HIERARCHICAL AND MULTI-SCALE VARIATIONAL AUTOENCODER FOR DIVERSE AND NATURAL SPEECH SYNTHESIS
Jae-Sung Bae, Jinhyeok Yang, Tae-Jun Bak, Young-Sun Joo, Hoon-Young Cho
jaesungbae@ncsoft.com
Abstract
We propose a hierarchical and multi-scale variational auto-encoder-based text-to-speech (HiMuV-TTS) model
that first determines the global-scale prosody and then determines the local-scale prosody via conditioning on the global-scale
prosody and the learned text representation. In addition, we improve the quality
of speech by adopting the adversarial training technique. Experimental results verify that the proposed HiMuV-TTS
model can generate more diverse and natural speech as compared to TTS models with single-scale variational autoencoders.