Speaking Speed Control of End-to-End Speech Synthesis using Sentence-Level Conditioning

Paper

arXiv:2007.15281 , accepted to INTERSPEECH 2020

Authors

Jae-Sung Bae, Hanbin Bae, Young-Sun Joo, Junmo Lee, Gyeong-Hoon Lee, Hoon-Young Cho

Abstract

This paper proposes a controllable end-to-end text-to-speech (TTS) system to control the speaking speed (speed-controllable TTS; SCTTS) of synthesized speech with sentence-level speaking-rate value as an additional input. The speaking-rate value, the ratio of the number of input phonemes to the length of input speech, is adopted in the proposed system to control the speaking speed. Furthermore, the proposed SCTTS system can control the speaking speed while retaining other speech attributes, such as the pitch, by adopting the global style token-based style encoder. The proposed SCTTS does not require any additional well-trained model or an external speech database to extract phoneme-level duration information and can be trained in an end-to-end manner. In addition, our listening tests on fast-, normal-, and slow-speed speech showed that the SCTTS can generate more natural speech than other phoneme duration control approaches which increase or decrease duration at the same rate for the entire sentence, especially in the case of slow-speed speech.

Speaking Rate Control (Neutral Speaker)
Speaking Rate Control (Expressive Speaker)
Disentanglement of Speaking Speed and Other Speech Attributes

1. Speaking Speed Control (Neutral Speaker)

Sentence: 시간 순서대로 내용을 정리합시다.
	SCTTS (ours)	FastSpeech	PDC-TTS	DCTTS (normal-speed only)
Fast
Normal
Slow

Sentence: 설날에 아이들은 어른에게 절을 합니다.
	SCTTS (ours)	FastSpeech	PDC-TTS	DCTTS (normal-speed only)
Fast
Normal
Slow

2. Speaking Speed Control (Expressive Speaker)

Sentence: 음... 그로 인해서 일루에 조금 오래 머물 수밖에 없었던 최훈재였거든요.
	SCTTS (ours)	FastSpeech	PDC-TTS	DCTTS (normal-speed only)
Fast
Normal
Slow

Sentence: 그, 자, 지금 이루수 김지수 선수가 잘 쫓아가봤지만 옆으로 빠져나갔습니다.
	SCTTS (ours)	FastSpeech	PDC-TTS	DCTTS (normal-speed only)
Fast
Normal
Slow

Sentence: 잡아당긴 타구는 왼쪽에 파울!
	SCTTS (ours)	FastSpeech	PDC-TTS	DCTTS (normal-speed only)
Fast
Normal
Slow

Sentence: 삼루 주자가 홈인! 그리고 강영식이 삼루까지! 점수차를 두 점 차로 벌립니다!
	SCTTS (ours)	FastSpeech	PDC-TTS	DCTTS (normal-speed only)
Fast
Normal
Slow

Sentence: 와우! 육구! 바깥쪽 툭 갖다 댔습니다! 그리고 김대유!
	SCTTS (ours)	FastSpeech	PDC-TTS	DCTTS (normal-speed only)
Fast
Normal
Slow

3. Disentanglement of Speaking Speed and Other Speech Attributes

Sentence: 음... 그로 인해서 일루에 조금 오래 머물 수밖에 없었던 최훈재였거든요.
	SCTTS	SCTTS-GST (N)	SCTTS-GST (H)
Fast
Normal
Slow

Sentence: 그, 자, 지금 이루수 김지수 선수가 잘 쫓아가봤지만 옆으로 빠져나갔습니다.
	SCTTS	SCTTS-GST (N)	SCTTS-GST (H)
Fast
Normal
Slow

Sentence: 잡아당긴 타구는 왼쪽에 파울!
	SCTTS	SCTTS-GST (N)	SCTTS-GST (H)
Fast
Normal
Slow

Sentence: 삼루 주자가 홈인! 그리고 강영식이 삼루까지! 점수차를 두 점 차로 벌립니다!
	SCTTS	SCTTS-GST (N)	SCTTS-GST (H)
Fast
Normal
Slow