Demo Page, VocGAN-PS & Improved FastPitch

Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

Hanbin Bae, Young-Sun Joo

bhb0722@ncsoft.com

Abstract

      The recently developed pitch-controllable text-to-speech (TTS) model, i.e. FastPitch, was conditioned for the pitch contours. However, the quality of the synthesized speech degraded considerably for pitch values that deviated significantly from the average pitch; i.e. the ability to control vocal pitch was limited. To address this issue, we propose two algorithms to improve the robustness of FastPitch. First, we propose a novel timbre-preserving pitch-shifting algorithm for natural pitch augmentation. Pitch-shifted speech samples sound more natural when using the proposed algorithm because the speaker's vocal timbre is maintained. Moreover, we propose a learning algorithm that defines FastPitch using pitch-augmented speech datasets with different pitch ranges for the same sentence. The experimental results demonstrate that the proposed algorithms improve the pitch controllability of FastPitch

Proposed VocGAN-based Pitch-Shifting Algorithms

Improved Pitch-Controllable TTS Models with Pitch-Augmented Datasets

FastPitch for NCFemale (KOR)
FastSpeech2 for LJSpeech (ENG)

Demo Page, VocGAN-PS & Improved FastPitch

A. Proposed VocGAN-based PS (VocGAN-PS) Algorithm

We use the pitch adjustment value α using semitone units, such that $\mathbf{\alpha}=12\cdot \log_{2}(\mathbf{f^{\alpha}_{0}}/\mathbf{f_0})$ .

Table shows the ratio between pitch-shifted fundamental frequency and its.

α	-6	-4	-2	0	2	4	6
$r_{\alpha}=\mathbf{f^{\alpha}_0}/\mathbf{f_0} * 100$	70%	79%	89%	100 %	112%	126%	141%

Figure 1 Schematic of the VocGAN-PS algorithm and Spectral envelopes of the three types of speech waveforms.

	Sample 1 Output samples by the Source and Filter gates.
	Input Speech	Source Gate	(Vocal Tract) Filter Gate *Warning
Korean (NCFemale)
English (LJSpeech)

	Sample 2 VocGAN-PS samples / Timbre-preserving or not
α	-3	-2	-1	Input (KOR)	+1	+2	+3
Sampling Rate Conversion
VocGAN-PS

α	-3	-2	-1	Input (ENG)	1	2	3
Sampling Rate Conversion
VocGAN-PS

	Sample 3 Timbre-Preserving PS Alogirithms : (1) TD-PSOLA-PS, (2) WORLD-PS and (3) VocGAN-PS (Proposed)
α	-3	-2	-1	Input (KOR)	+1	+2	+3
(1) TD-PSOLA-PS
(2) WORLD-PS
(3) VocGAN-PS

B. Improved Pitch-Controllable TTS Model with Pitch-Augmented Datasets

In order to compare the pitch controllability of two models, we used

the same pitch and duration from the testset

to generate samples

for this demo.

(* In our paper, we only used the predicted informations for all experiments.)

B.1. FastPitch for NCFemale (KOR)

Figure 2 Results of performance evaluations for each model.

[For α=±4]

"Red color" means mis-pronunciation of Baseline model.

Sample #1

정신과의사

도움을 받아 원인을 알아봤더니,

부모가 싸우는 날

밤엔

꼭 천식증상이 나타났다.

jeongsingwa uisa

doum-eul bad-a won-in-eul al-a bwassdeoni,

bumoga ssauneun nal

bam-en

kkog cheonsig jeungsang-i natanabnida.

Sample #2

물론 옛날에

나였으면

붙잡았을 거 같은데, 지금은 아예

연락을 안 하지

않을까?

mullon yesnal-e

na yeoss-eumyeon

but jab-ass-eul geo gat-eunde, jigeum-eun aye

yeonlag-eul anhaji

anh-eulkka?

Sample #3

근데 약간 그렇게 키가 비슷한 친구를

딱

이렇게 막상 마주하니까,

친근한

느낌이 들었어

설렘보다는.

geunde yaggan geuleohge kiga deo chinguleul

ttag

ileohge magsang maju hanikka,

chinguenhan

neukkim-i deul-eoss-eo

seollembodaneun.

Sample #4

얘한테는 그런 아쉬움이 없어서, 나랑 엄마는 다행인데

모르겠어.

계속 연락오면

또 흔들릴 수

있잖아.

yae hanteneun geuleon aswium-i eobs-eoseo, nalang eommaneun dahaeng-inde

moleugess-eo.

gyesog yeonlag omyeon

tto heundeullil su

issjanh-a.

Sample #5

집 오는 길 골목에

도는데

거기서 강아지 나와가지고

깜짝

놀라서 도망가버렸어.

jib oneun gil golmog-e

doneunde

geogiseo gang-ajiga nawagajigo

kkamjjag

nollaseo domang-ga beolyeo sseo.

Audio Samples for

α = ±4

Sample	1	2	3	4	5
α	-4	-4	+4	+4	+4
Baseline Model w/o Pitch-Augment.
Augment. Model Proposed

Audio Samples for

α ∈ {-2, 0, +2}

Sample	1	2	3	4	5
α	+2	+2	+2	+2	+2
Baseline Model w/o Pitch-Augment.
Augment. Model Proposed
α	0	0	0	0	0
Baseline Model w/o Pitch-Augment.
Augment. Model Proposed
α	-2	-2	-2	-2	-2
Baseline Model w/o Pitch-Augment.
Augment. Model Proposed