Audio samples from "Effective Emotion Transplantation in an End-to-End Text-to-Speech System"
Paper: Accepted to IEEE Access (paper) (You can read the paper without IEEE login, because IEEE Access is open access.)
Authors: Young-Sun Joo, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho, Hong-Goo Kang
Abstract
In this paper, we propose an effective technique to transplant a source
speaker's emotional expression to a new target speaker's voice within an
end-to-end text-to-speech (TTS) framework. We modify an expressive TTS
model pre-trained using a source speaker's emotional speech database
to reflect the voice characteristics of a target speaker for which only
a neutral speech database is available. We set two adaptation criteria
to achieve this. One criterion is to minimize the reconstruction loss
between the target speaker's recorded and synthesized speech, such that
the synthesized speech has the target speaker's voice characteristics.
The other criterion is to minimize the emotion loss between the emotion
embedding vectors extracted from the reference expressive speech and the
target speaker's synthesized expressive speech, which is essential to
preserve expressiveness. Since the two criteria are applied alternately
in the adaptation process, we are able to avoid the kind of bias issues
frequently encountered in similar tasks. The proposed adaptation technique
demonstrates more effective performance compared to conventional approaches
in both quantitative and qualitative evaluations.
We used an internal expressive speech database for the source speaker consists of four emotion classes, namely neutral, joyful, angry, and sad. The total amount of speech waveforms is about 11 hours. It is recorded by a single professional voice actress.