ULF-TTS: An Uncluttered Hybrid TTS System using Language and Flow Matching Models

Jae Hyun Park, Seung Jae Choi, Young Sik Eom, Allison Shindell, Min-Gwan Seo and Gyeong-hoon Lee

Abstract:

Hybrid text-to-speech (TTS) systems that integrate a language model with flow matching decoders have been proposed to generate speech in voice-related applications such as dubbing. However, these systems suffer from latency and computational overhead caused by next-token prediction and denoising steps. Therefore, we propose an uncluttered hybrid TTS system that integrates a lightweight flow matching decoder and supports variable token prediction time. In addition, we leverage a training strategy from a diffusion-based approach into the language model to enhance its generative performance. The results of objective and subjective evaluations demonstrate that our proposed method is significantly more efficient than existing hybrid systems while maintaining competitive speech quality.

Model Architecture

Lightweight Flow-Matching Decoder

A Multi-modal block module in the flow matching decoder, composed exclusively of multi-layer perceptrons (MLPs)

Variable Token sequence Prediction

A token prediction approach for controlling the length of the token sequence predicted by the language model.

Natural and Expressive Speech

A training methodology leveraging a diffusion training strategy for the language model training to enhance generation.

Contents

   - Intra-Lingual Generation
   - Cross-Lingual Generation
   - Multi-Lingual Generation
   - Impact on Multi-token prediction and Decoder types

Demo Samples

Zero-shot Scenario

Intra-Lingual Generation

Prompt Text Ground Truth CosyVoice FireRedTTS ULF-TTS
Text: When a private in the eighth Cavalry, he had been on the point of quitting the army at twenty eight years of age, but unexpectedly he had been appointed orderly to Captain Servadac.
Hector Servadac was thirty years of age, an orphan without lineage and almost without means.
Text: When the alternating current was introduced for practical purposes it was not needed for arc lighting, the circuit for which, from a single dynamo, would often be twenty or thirty miles in length, its current having a pressure of not less than five or six thousand volts.
It consisted of one small dynamo of a capacity of two hundred and eighty lights of ten c.p. each, and was housed in an unpretentious wooden shed.
Text: Now let us to business.
The little knot of Indians drew back in a body, and suffered, as they thought, the conjurer and his inspired assistant to proceed.
Text: A young lady quietly joined the party at the supper table.
But this subject will be more properly discussed when we treat of the different races of mankind.

Cross-Lingual Generation

Prompt Text CosyVoice FireRedTTS ULF-TTS
Text: [KO]총자산이 십 퍼센트 상승하였습니다.
Also, a draft on futurity, sometimes honored, but generally extended.
Text: [ZH]城门顶端有桃色的陶瓦、屋顶以龙凤等瑞兽装饰。.
I could write to my man and enclose the key; he could send down the packet as he finds it. It was to me in particular that he appeared to propound this appeared almost to appeal for aid not to hesitate.
Text: [KR]우리 엄만 미워요 미경인 보고싶지도 않은가봐요
They would think that they might do so too, and that would make you a great deal of trouble.
Text: [KR]언제 한 번 마주치나 했더니 여기서 보게 되는구나!
Whatever she thought, she was not idly musing, as one might see by the expression of her face.

Multi-Lingual Scenario

Natural and Expressive Voice Generation

Language Text Ground Truth Generated
EN
Everything is shimmering and glowing brilliantly, catching everyone's attention.
The last photo I have before leaving for the airport that year, ten years ago, is of me draped over my sister's chest, kneeling on the grass, eyes red with tears.
Oh! You noticed! We've learned that your kind's life force does wonders for our golems' power levels!
What type of international assistance is needed for this type of situation?
KO
소문에 의하면 당신이 영혼을 심판한다던데. 하아, 날 좀 봐주시오.
어휴, 무턱대고 야생 류크를 습격했다간 아무것도 남지 않을 것 같아.
어머! 아, 탓하는 것처럼 들렸나요? 당신 때문은 아니니 미안해 마세요.
자, 자! 집중하자. 이제 조금만 더 가면 사막의 모닥불 지역이다!
TW
活动间隙,谢海英拿出手机,收到一条暖心的短信.
其家属反映,近期刘小华身体状态不好,精神较差.
其昨天打电话说好的两三天就回来
他问你开车走没有,我说没有走呢.
JA
おこまりですか. ごらんのひよーでおなやみごとおかいけついたします.
でも, そんなこといみがあるんでしょーか?
とまってください.それいじょーちかずかないでください
なにもききたくありません. よんじゅーねんもなにもいわずにいたくせに, いまさらきけって?

Impact on Multi-token prediction and Decoder types

Multi Token Prediction & Decoder Types

Text Ground Truth m=1, Tokenizer Decoder m=1, Flow-Matching Decoder m=2, Flow-Matching Decoder m=4, Flow-Matching Decoder m=6, Flow-Matching Decoder
the application of reenforced suggestion or even of hypnotism in the doctor's office is even for him no possible source of danger.
there was thus no great sign of depression to be noticed when we came back into the tent after finishing our work, and had to while away the time.
there are two main classes of simple measures, two beat measure, and three beat measure.
the first rains had fallen on the lowlands, and the first snows on the mountains, and everything was fresh and bracing, while an abundance of balmy sunshine filled all the noonday hours.