ci_logo

Avocodo: Generative Adversarial Network for Artifact-free Vocoder


Abstract

Neural vocoders based on the generative adversarial neural network (GAN) have been widely used due to their fast inference speed and lightweight networks while generating high-quality speech waveforms. Since the perceptually important speech components are primarily concentrated in the low-frequency bands, most GAN-based vocoders perform multi-scale analysis that evaluates downsampled speech waveforms. This multi-scale analysis helps the generator improve speech intelligibility. However, in preliminary experiments, we discovered that the multi-scale analysis which focuses on the low-frequency bands causes unintended artifacts, e.g., aliasing and imaging artifacts, which degrade the synthesized speech waveform quality. Therefore, in this paper, we investigate the relationship between these artifacts and GAN-based vocoders and propose a GAN-based vocoder, called Avocodo, that allows the synthesis of high-fidelity speech with reduced artifacts. We introduce two kinds of discriminators to evaluate speech waveforms in various perspectives: a collaborative multi-band discriminator and a sub-band discriminator. We also utilize a pseudo quadrature mirror filter bank to obtain downsampled multi-band speech waveforms while avoiding aliasing. According to experimental resutls, Avocodo outperforms baseline GAN-based vocoders, both objectviely and subjectively, while reproducing speech with fewer artifacts.


Structure

Demo page of Avocodo

1. Single Speaker Synthesis (LJ Speech Dataset)

RED: Artifacts due to inaccurate harmonic components
BLUE: Artifacts at high frequency due to imaging artifacts

Ground Truth HiFi-GAN V1 VocGAN StyleMelGAN Avocodo V1
and see for himself how a revolutionary society operates, a Marxist society.
Solomons was now also admitted as a witness, and his evidence, with that of Moss, secured the transportation of the principal actors in the theft.
The demands on the President in the execution of His responsibilities in today's world are so varied and complex.
When before have you found them really at your side in your fights for progress?

2. Unseen Speaker Synthesis


VCTK datasets

Ground Truth HiFi-GAN V1 VocGAN StyleMelGAN V1 Avocodo V1

Internal Korean Datasets

Ground Truth HiFi-GAN V1 VocGAN StyleMelGAN Avocodo V1
1. 러시아의 옛 수도원과 사원에서는 러시아 최초의 연대기들이 발견되는데.
=> reosiaui yet sudowongwa sawoneseoneun reosia choechoui yeondaegideuri balgyeondoeneunde.
2. 시몬 경을 어서 안전한 곳으로 옮겨야겠다. 상황이 일단락 되면 연락할테니 기다리도록 해.
=> simon gyeongeul eoseo anjeonhan goseuro olmgyeoyagessda. sanghwangi ildanrak doemyeon yeonrakhalteni gidaridorok hae.
3. 신애가 앞으로 의사가 될지 교수님이 될지, 화가가 될지 아직은 아무도 모르지.
=> sinaega apeuro uisaga doelji gyosunimi doelji, hwagaga doelji ajigeun amudo moreuji.
4. 낯선 환경이 얼마나 로맨틱한 감정을 불러일으키는지 확인해보자구요.
=> naccseon hwangyeongi eolmana romaentikhan gamjeongeul bulleoireukineunji hwaginhaebojaguyo.

3. Discriminator-wise Comparison (LJ Speech dataset)


Ground Truth MSD MPD CoMBD (proposed) SBD (proposed)

4. Analyze on artifacts - aliasing: Singing voice synthesis (Internal Korean Dataset)


Ground Truth HiFi-GAN V1 Avocodo V1
Vocal Scale

5. Analyze on artifacts - upsampling artifacts: Expressive speech synthesis (Internal Korean Dataset)

We trained each vocoders with expressive speech datasets (800k iterations). Below samples were excluded from training.
In expressive speech, the perceptual quality degradation, especailly high-frequency noise, by upsampling artifact is noticeble.

Ground Truth HiFi-GAN V1 Avocodo V1
1. 내게 아무런 의미가 없어.
=> naege amureon uimiga eopseo.
2. 흥! 늦었어. 이젠 날 막을 순 없어!
=> heung! neujeosseo. ijen nal mageul sun eopseoe!
3. 후... 엉망진창이로군!
=> hu... eongmangjinchangirogun!
4. 절대 기대하지 말라고. 그런 건 아주 오래전에 사라졌으니.
=> jeoldae gidaehaji mallago. geureon geon aju oraejeone sarajyeosseuni.
5. 천명에만 달린 것이 아닌 것을.
=> cheonmyeongeman dallin geosi anin geoseul.