Audio samples from "Effective Emotion Transplantation in an End-to-End Text-to-Speech System"

Paper: Accepted to IEEE Access (paper) (You can read the paper without IEEE login, because IEEE Access is open access.)
Authors: Young-Sun Joo, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho, Hong-Goo Kang

Abstract

In this paper, we propose an effective technique to transplant a source speaker's emotional expression to a new target speaker's voice within an end-to-end text-to-speech (TTS) framework. We modify an expressive TTS model pre-trained using a source speaker's emotional speech database to reflect the voice characteristics of a target speaker for which only a neutral speech database is available. We set two adaptation criteria to achieve this. One criterion is to minimize the reconstruction loss between the target speaker's recorded and synthesized speech, such that the synthesized speech has the target speaker's voice characteristics. The other criterion is to minimize the emotion loss between the emotion embedding vectors extracted from the reference expressive speech and the target speaker's synthesized expressive speech, which is essential to preserve expressiveness. Since the two criteria are applied alternately in the adaptation process, we are able to avoid the kind of bias issues frequently encountered in similar tasks. The proposed adaptation technique demonstrates more effective performance compared to conventional approaches in both quantitative and qualitative evaluations.  


Contents

 


Source Speaker's Expressive Speech

We used an internal expressive speech database for the source speaker consists of four emotion classes, namely neutral, joyful, angry, and sad.
The total amount of speech waveforms is about 11 hours. It is recorded by a single professional voice actress.


Example of Expressive Speech (recorded)

Unfortunately, there are no recorded expressive speech samples for the same sentence. Please listen the speech samples focusing on emotion.
NEU ANG SAD JOY
Record


Synthesized Expressive Speech

Sentence: "내가 대명동 방을 뺄 때 가지고 있던 전셋돈도 그와의 생활비로 이미 반 이상 없어진 뒤였으니까요."
NEU ANG SAD JOY
Record
Synth.




Target Speaker's Expressive Speech based on Emotion Transplantation

* We adapt the pre-trained TTS model using a part of the neutral speech database; an hour of speech waveforms.


Target speaker A (female)


Sentence : "즉 태양 에너지가 증가하면 지구의 바다가 데워집니다."
NEU ANG SAD JOY
Record
Conv. approach (w/o emo_loss)
Prop. approach (w/ emo_loss)

Sentence : "욕하는건 기본이고 뺨까지 때려요."
NEU ANG SAD JOY
Record
Conv. approach (w/o emo_loss)
Prop. approach (w/ emo_loss)


Target speaker B (male)


Sentence : "그러자 관노인 권송이가 분연히 일어나 부싯돌을 쳐서 횃불에 불을 붙였습니다."
NEU ANG SAD JOY
Record
Conv. approach (w/o emo_loss)
Prop. approach (w/ emo_loss)
Prop. approach (w/ emo_loss) 20
Sentence : "이렇게 뛰어가세요."
Conv. approach (w/o emo_loss)
Prop. approach (w/ emo_loss)
Sentence : "점심 시간에 밖에 나가서 돼지고기를 맛있게 먹었습니다."
Conv. approach (w/o emo_loss)
Prop. approach (w/ emo_loss)