ci_logo

HIERARCHICAL AND MULTI-SCALE VARIATIONAL AUTOENCODER
FOR DIVERSE AND NATURAL SPEECH SYNTHESIS

Jae-Sung Bae, Jinhyeok Yang, Tae-Jun Bak, Young-Sun Joo, Hoon-Young Cho

jaesungbae@ncsoft.com


Abstract
    We propose a hierarchical and multi-scale variational auto-encoder-based text-to-speech (HiMuV-TTS) model that first determines the global-scale prosody and then determines the local-scale prosody via conditioning on the global-scale prosody and the learned text representation. In addition, we improve the quality of speech by adopting the adversarial training technique. Experimental results verify that the proposed HiMuV-TTS model can generate more diverse and natural speech as compared to TTS models with single-scale variational autoencoders.

Contents
    1. Audio Samples for Model Comparison
    2. Latent Representation of Each Scale of HiMuV-TTS Model
    3. Additional Examples on Sampling
Demo page of HiMuV-TTS


1. Audio Samples for Model Comparison

    The VAE-based TTS models (GVAE, LVAE, and HiMuV-TTS) can generate speech in multiple ways via sampling. The τ is a temperature value multiplied to the
prior standard deviation. That is, the HiMuV-TTS (τ=0.0) model generates speech without sampling.
Text Ground Truth FastPitch GANSpeech GVAE LVAE HiMuV-TTS (Ours) HiMuV-TTS (τ=0.0) (Ours)
In another portion of the garden more clothing partly male and partly female was discovered.









The lifting had been so complete in this case that there was no trace of the print on the rifle itself when it was examined by latona.








However for the first time in five years the relief rolls have declined instead of increased during the winter months.








In reaching the conclusion that the shots came from the sixth floor southeast corner window of the depository building.












2. Latent Representation of Each Scale of HiMuV-TTS Model

    In the HiMuV-TTS-G model, the global-scale prosody embedding is sampled but the local-scale prosody embedding is fixed. Meanwhile, in the
HiMuV-TTS-L model, the global-scale prosody embedding is fixed but the local-scale prosody embedding is sampled.
Text HiMuV-TTS-G HiMuV-TTS-L
All the allowances of food passed through his hands; he had the control of the poor box for chance charities.







They entered a stone cold room and were presently joined by the prisoner.











3. Additional Examples on Sampling

    Here, several speech samples with diverse speaking styles generated with the proposed HiMuV-TTS model via sampling are shown.
Text: They entered a stone cold room and were presently joined by the prisoner.
Low Middle High
Average Pitch
Pitch Variance
Speaking Speed
 
Text: All the allowances of food passed through his hands; he had the control of the poor box for chance charities.
Low Middle High
Average Pitch
Pitch Variance
Speaking Speed