ci_logo

FastPitchFormant: Source-filter based Decomposed Modeling
for Speech Synthesis

Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho

happyjun@ncsoft.com


Abstract
    Methods for modeling and controlling prosody with acoustic features have been proposed for neural
text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features.
However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation,
and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer
based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant,
has a unique structure that handles text and acoustic features in parallel. With modeling each feature
separately, the tendency that the model learns the relationship between two features can be mitigated.
Owing to its structural characteristics, FastPitchFormant is robust and accurate for pitch control and
generates prosodic speech preserving speaker characteristics. The experimental results show that proposed
model outperforms the baseline FastPitch.

Contents
  1. Decomposition (Korean)
  2. Audio Samples - pitch-shift
    1. Korean Female Speaker
    2. Korean Male Speaker
    3. LJSpeech (English)
Demo page of FastPitchFormant

A. Decomposition (Korean)


    My Image
Figure 1 Mel-spectrograms of (a) excitation representation, (b) formant representation, and (c) final output. (a) and (b) were generated by passing the excitation and formant representations through the spectrogram decoder individually.

Sentence: 진짜 귀찮으면 번갈아서 나오는데 내가 모르는 거 아닐까?
(Pronunciation): jinjja gwichanheumyeon beongaraseo naoneunde naega moreuneun geo anilkka?
Excitation Representation Formant Representation Final Output



B. Audio Samples - pitch-shift

We use the pitch adjustment value λ using semitone units, such that .
Table shows the ratio between pitch-shifted fundamental frequency and its.

λ -8 -6 -4 0 +4 +6 +8
63% 71% 79% 100 % 126% 141% 159%

You can listen several samples which were generated by FastPitch and FastPitchFormant according to shifting pitch.
In the representations from FastPitchFormant (3rd, and 4th rows of the table), there are audio samples of excitation
and formant representation, respectively. According to pitch-shift, there are difference in every sample of excitation.
However, pitch-shifted samples of formant representations are similar with sample of formant representation without pitch-shift.


B.1. Audio Samples - pitch-shift (Korean) - Female Speaker



Sentence: 나는 몇 달째 계속 못 들고 있는데 넌 어떻게 들었어?
(Pronunciation): naneun myeot daljjae gyesok mot deulgo issneunde neon eotteohge deureosseo?
λ -8 -6 -4 Female (KOR) +4 +6 +8
FastPitch
(Baseline)
FastPitchFormant
(proposed)
Excitation Representation
from FastPitchFormant
(proposed)
Formant Representation
from FastPitchFormant
(proposed)



B.2. Audio Samples - pitch-shift (Korean) - Male Speaker



Sentence: 그리고 내일은 조금 무더울 거예요.
(Pronunciation): geurigo naeireun jogeum mudeoul geoyeyo.
λ -8 -6 -4 Male (KOR) +4 +6 +8
FastPitch
(Baseline)
FastPitchFormant
(proposed)
Excitation Representation
from FastPitchFormant
(proposed)
Formant Representation
from FastPitchFormant
(proposed)



B.3. Audio Samples - pitch-shift (English) - LJSpeech



Sentence: Warm and cold baths, or commodious bathing tubs,
λ -8 -6 -4 LJSpeech (ENG) +4 +6 +8
FastPitchFormant
(proposed)
Excitation Representation
from FastPitchFormant
(proposed)
Formant Representation
from FastPitchFormant
(proposed)