GANSpeech Demo

GANSpeech: Adversarial Training for
       High-Fidelity Multi-Speaker Speech Synthesis

Jinhyeok Yang*, Jae-Sung Bae*, Taejun Bak, Young-Ik Kim, Hoon-Young Cho


Paper
arXiv:2106.15153 , accepted to INTERSPEECH 2021

Abstract
    Recent advances in neural multi-speaker text-to-speech (TTS) models have enabled the generation of reasonably good speech quality with a single model and made it possible to synthesize the speech of a speaker with limited training data. Fine-tuning to the target speaker data with the multi-speaker model can achieve better quality, however, there still exists a gap compared to the real speech sample and the model depends on the speaker. In this work, we propose GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model. In addition, we propose simple but efficient automatic scaling methods for feature matching loss used in adversarial training. In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models, and showed a better MOS score than the speaker-specific fine-tuned FastSpeech2.

Most of our samples are of high quality.
It is highly recommended to listen to the samples using headphones or earphones to clearly compare them.

GANSpeech Samples (Korean)

Total dataset (190 hours) including target speaker datasets (1-hour respectively)
MS: multi-speaker. FT: speaker-specific fine-tuning.

Recording GT mel + VocGAN FastSpeech1(MS) FastSpeech2(MS) FastSpeech2(FT) FS1-based GANSpeech FS2-based GANSpeech
Sentence: "음극선은 두꺼운 검은색 종이에 쌓여 있었으므로, 그 빛이 음극선일리가 없었다."
(Pronunciation): "eumgeukseoneun dukkeoun geomeunsaek jongie ssahyeo isseosseumeuro, geu bicci eumgeukseonilliga eopseossda."
Sentence: "소자 다녀왔습니다. 저를 급히 부르셨다고 들었습니다만."
(Pronunciation): "soja danyeowassseupnida. jeoreul geuphi bureusyeossdago deureossseupnidaman."
Sentence: "예. 집안일은 제게 맡겨두세요."
(Pronunciation): "ye. jibanireun jege matgyeoduseyo."
Sentence: "한때 명성이 자자했던 화씨 대 가문도 과거에 묻히게 되었다."
(Pronunciation): "hanttae myeongseongi jajahaessdeon hwassi dae gamundo gwageoe muthige doeeossda."



VocGAN Samples synthesized from true mel spectrograms



Ground Truth VocGAN w/o SFML VocGAN w/ SFML
Sentence: "차대, 너 그렇게 안봤는데 그럼 책임져야지, 멋지다."
(Pronunciation): "chadaeung, neo geureohge anbwassneunde geureom chaegimjyeoyaji, meosjida."
Sentence: "본 이백구쪽 일단의 북진은 가공할 사태 부분."
(Pronunciation): "bonmun ibaekgujjok ildanui bukjineun gagonghal satae bubun."
Sentence: " 그런 투에."
(Pronunciation): "mwo geureon tue ."



Appendix: Multi-speaker conversation sample


The same dataset as the above experiments, and recorded for an hour for each speaker in this sample.

  



Appendix: English samples (VCTK)


All models were trained using grapheme.
Recording Synthesized
Recording FastSpeech1(MS) FastSpeech2(MS) GANSpeech
Sentence: "I didn't play well last year."
Sentence: "When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow."
Sentence: "Nonetheless, the overall picture is healthy."
Sentence: "Their courage, and their honesty, should be respected."