WaveGlowGAN: the bipartite flow based vocoder with generative adversarial networks for high quality speech synthesis

Authors

Gyeong-Hoon Lee, Junmo Lee, Young-Ik Kim, Hoon-Young Cho (NCSOFT)

Abstract

In this study, we propose an effective learning method to improve the quality of speech synthesized by WaveGlow. WaveGlow is a recently proposed neural vocoder for parallel speech generation based on a bipartite flow. Because the transformation in WaveGlow is invertible, the network (a) is trained to maximize the log-likelihood of true audio datasets in the forward direction and (b) generates speech from a sample in the simple known distribution in the backward direction. WaveGlow generates high-quality speech from a mel-spectrogram faster than autoregressive neural vocoders. However, even if the log-likelihood loss converges sufficiently, the quality of speech generated by WaveGlow is not significantly improved compared with other generative models trained by the regression or adversarial loss, which directly compare the output sample with true data. In this paper, we propose a novel learning method for WaveGlow by combining the log-likelihood and adversarial loss. In our method, we train the network of WaveGlow in the forward and backward direction alternately by using invertible mappings of WaveGlow. In the forward direction, as in the conventional method, we train the network to maximize the log-likelihood of the training dataset. In the backward direction, we train the network to minimimze the loss between the spectrograms of the synthesized and real audio samples using the $L1$ loss and adversarial loss. By applying the reconstruction loss to the generative model that has a bipartite flow, the distortion and noise of generated speech are significantly reduced compared with using conventional learning alone. The experimental results prove that our method is an effective way to train the network of WaveGlow.

LJ Speech dataset with NCTTS

Setences

"He could indulge in snuff if a snuff-taker."
"who had fraudulent warrants out of their own to the extent of one hundred fifty thousand pounds, suspended payment and absconded."
"Here, however, the evidence was strong and sufficient."
"In the fall of that year John Pic and Robert Oswald went to a military academy."
"This arrangement would provide a continuing high-level contact for agencies that may wish to consult respecting particular protective measures."

Ground truth audio samples

Audio samples from true mel-spectrograms

WaveGlow-F
WaveGlow-FB
WaveGlow-GAN

Audio samples from mel spectrograms with NCTTS

WaveGlow-F
WaveGlow-FB
WaveGlow-GAN

Korean Speech dataset with NCTTS

Setences

"지난 시월 이십이일 발생한 경기지역 정모씨의 살인사건의 수법이 이십육년전 경기남부 연쇄살인사건과 흡사하다는 판단을 내린 수사팀은 주변 씨씨티브이 영상과 지문, 혈흔 증거 등으로 유력한 용의자를 발견하였습니다."
"내가 대명동 방을 뺄 때 가지고 있던 전셋돈도 그와의 생활비로 이미 반 이상이 없어진 뒤였으니까요."
"최 회장은 당시 그룹 내 주식 분산과 관련해 세무조사를 받고 있었으나, 이석희 당시 국세청 차장의 요청으로 대선자금 오억원을 건네주자, 세무조사가 보류됐다고 밝혔습니다."
"초등학교 어린이들이 생각하는 왕따의 원인과 극복 비결입니다."
"따라서, 이 범위 내에서 중요시 되는 문제들은 내가 엄선한 예상 문제의 범주를 결코 벗어날 수 없다."

Ground truth audio samples

Ground Truth
Female
Male

Audio samples from true mel-spectrograms

WaveGlow-F
Female
Male
WaveGlow-FB
Female
Male
WaveGlow-GAN
Female
Male

Audio samples from mel spectrograms with NCTTS

WaveGlow-F
Female
Male
WaveGlow-FB
Female
Male
WaveGlow-GAN
Female
Male