ci_logo

Enhancement of Pitch Controllability using Timbre-Preserving Pitch Augmentation in FastPitch

Hanbin Bae, Young-Sun Joo

bhb0722@ncsoft.com


Abstract


Demo Page, VocGAN-PS & Improved FastPitch





A. Proposed VocGAN-based PS (VocGAN-PS) Algorithm



We use the pitch adjustment value α using semitone units, such that .
Table shows the ratio between pitch-shifted fundamental frequency and its.

α -6 -4 -2 0 2 4 6
70% 79% 89% 100 % 112% 126% 141%


    My Image
Figure 1 Schematic of the VocGAN-PS algorithm and Spectral envelopes of the three types of speech waveforms.

Sample 1 Output samples by the Source and Filter gates.
Input Speech Source Gate (Vocal Tract) Filter Gate
*Warning
Korean (NCFemale)
English (LJSpeech)

Sample 2 VocGAN-PS samples / Timbre-preserving or not
α -3 -2 -1 Input (KOR) +1 +2 +3
Sampling Rate
Conversion
VocGAN-PS
α -3 -2 -1 Input (ENG) 1 2 3
Sampling Rate
Conversion
VocGAN-PS

Sample 3 Timbre-Preserving PS Alogirithms :
(1) TD-PSOLA-PS, (2) WORLD-PS and (3) VocGAN-PS (Proposed)
α -3 -2 -1 Input (KOR) +1 +2 +3
(1) TD-PSOLA-PS
(2) WORLD-PS
(3) VocGAN-PS







B. Improved Pitch-Controllable TTS Model with Pitch-Augmented Datasets


In order to compare the pitch controllability of two models, we used
the same pitch and duration from the testset
to generate samples
for this demo.
(* In our paper, we only used the predicted informations for all experiments.)



B.1. FastPitch for NCFemale (KOR)



    My Image
Figure 2 Results of performance evaluations for each model.


[For α=±4]
"Red color" means mis-pronunciation of Baseline model.

Sample #1

정신과의사
도움을 받아 원인을 알아봤더니,
부모가 싸우는 날
밤엔
꼭 천식증상이 나타났다.

jeongsingwa uisa
doum-eul bad-a won-in-eul al-a bwassdeoni,
bumoga ssauneun nal
bam-en
kkog cheonsig jeungsang-i natanabnida.

Sample #2

물론 옛날에
나였으면
붙잡았을 거 같은데, 지금은 아예
연락을 안 하지
않을까?

mullon yesnal-e
na yeoss-eumyeon
but jab-ass-eul geo gat-eunde, jigeum-eun aye
yeonlag-eul anhaji
anh-eulkka?

Sample #3

근데 약간 그렇게 키가 비슷한 친구를
이렇게 막상 마주하니까,
친근한
느낌이 들었어
설렘보다는.

geunde yaggan geuleohge kiga deo chinguleul
ttag
ileohge magsang maju hanikka,
chinguenhan
neukkim-i deul-eoss-eo
seollembodaneun.

Sample #4

얘한테는 그런 아쉬움이 없어서, 나랑 엄마는 다행인데
모르겠어.
계속 연락오면
또 흔들릴 수
있잖아.

yae hanteneun geuleon aswium-i eobs-eoseo, nalang eommaneun dahaeng-inde
moleugess-eo.
gyesog yeonlag omyeon
tto heundeullil su
issjanh-a.

Sample #5

집 오는 길 골목에
도는데
거기서 강아지 나와가지고
깜짝
놀라서 도망가버렸어.

jib oneun gil golmog-e
doneunde
geogiseo gang-ajiga nawagajigo
kkamjjag
nollaseo domang-ga beolyeo sseo.



Audio Samples for

α = ±4

Sample 1 2 3 4 5
α -4 -4 +4 +4 +4
Baseline Model
w/o Pitch-Augment.
Augment. Model
Proposed


Audio Samples for

α ∈ {-2, 0, +2}

Sample 1 2 3 4 5
α +2 +2 +2 +2 +2
Baseline Model
w/o Pitch-Augment.
Augment. Model
Proposed
α 0 0 0 0 0
Baseline Model
w/o Pitch-Augment.
Augment. Model
Proposed
α -2 -2 -2 -2 -2
Baseline Model
w/o Pitch-Augment.
Augment. Model
Proposed


B.2. FastSpeech2 for LJSpeech (ENG)



For foreign listeners, we used the English female speaker (LJSpeech) to train the FastSpeech2 (another type of pitch controllable TTS).
(* We referred ming024's FastSpeech2 implementation.)



[For α=±6]
"Red color" means mis-pronunciation of Baseline model.

Sample #1
Printing, in the only
sense with which we are at present concerned,
differs from most if not from all
the arts
and crafts represented in the Exhibition.

Sample #2
And though more
Roman than that, yet
scarcely more like the complete Roman
type of the earliest printers of Rome.

Sample #3
The Roman letter
was used side by side with the Gothic.

Sample #4
And things got worse and worse through the whole of the
seventeenth century, so that in the eighteenth
printing
was very miserably performed.

Sample #5
There is a
grossness in the
upper
finishings of letters like the c, the a, and so on.



Audio Samples for

α = ±6

Sample 1 2 3 4 5
α -6 +6 +6 -6 -6
Baseline Model
w/o Pitch-Augment.
Augment. Model
Proposed


Audio Samples for

α ∈ {-3, 0, +3}

Sample 1 2 3 4 5
α +3 +3 +3 +3 +3
Baseline Model
w/o Pitch-Augment.
Augment. Model
Proposed
α 0 0 0 0 0
Baseline Model
w/o Pitch-Augment.
Augment. Model
Proposed
α -3 -3 -3 -3 -3
Baseline Model
w/o Pitch-Augment.
Augment. Model
Proposed