When Humans Growl and Birds Speak:
High-Fidelity Voice Conversion from Human to Animal and Designed Sounds

Accepted to Interspeech 2025

Minsu Kang, Seolhee Lee, Choonghyeon Lee, and Namhyun Cho

Human-to-non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations. Unlike prior studies focused on dog-sounds and 16 or 22.05kHz audio transformation, this work addresses a broader range of non-speech sounds, including natural sounds (lion-roars, birdsongs) and designed voice (synthetic growls). To accommodate generation of diverse non-speech sounds and 44.1kHz high-quality audio transformation, we introduce a preprocessing pipeline and an improved CVAE-based H2NH-VC model, both optimized for human and nonhuman voices. Experimental results showed that the proposed method outperformed baselines in quality, naturalness, and similarity MOS, achieving effective voice conversion across diverse nonhuman timbres.
   

Comparison of Baseline Models

This section presents a comparison between our proposed model and baseline models.

Baseline Models:

(Source) Human Speech + (Reference) Non-human Timbre

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

(Source) Human Non-verbal Vocalizations + (Reference) Non-human Timbre

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

(Source) Non-human Non-verbal Vocalizations + (Reference) Nonhuman Timbre

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC