When Humans Growl and Birds Speak:
High-Fidelity Voice Conversion from Human to Animal and Designed Sounds

Accepted to Interspeech 2025

Minsu Kang, Seolhee Lee, Choonghyeon Lee, and Namhyun Cho

Human-to-non-human voice conversion (H2NH-VC) transforms human speech into animal or designed vocalizations. Unlike prior studies focused on dog-sounds and 16 or 22.05kHz audio transformation, this work addresses a broader range of non-speech sounds, including natural sounds (lion-roars, birdsongs) and designed voice (synthetic growls). To accommodate generation of diverse non-speech sounds and 44.1kHz high-quality audio transformation, we introduce a preprocessing pipeline and an improved CVAE-based H2NH-VC model, both optimized for human and nonhuman voices. Experimental results showed that the proposed method outperformed baselines in quality, naturalness, and similarity MOS, achieving effective voice conversion across diverse nonhuman timbres.

Comparison of Baseline Models

This section presents a comparison between our proposed model and baseline models.

Baseline Models:

Proposed- Proposed - Our proposed H2NH-VC model, designed for high-fidelity nonhuman voice conversion using a CVAE-based architecture with specialized preprocessing pipeline
w/o pp- The proposed model without our specialized preprocessing pipeline, instead using a conventional speech-focused preprocessing approach.
DDDM-VC [Choi 23] - A VC model integrating a source-filter–based diffusion acoustic model with a separate HiFiGAN vocoder [Kong 21], designed to enhance the disentanglement of prosodic elements in human speech.
Diff-HierVC [Choi 23] - A diffusion-based model employing a hierarchical structure to more accurately capture the source-filter properties of human speech over time and frequency. A separate HiFiGAN vocoder [Kong 21] was used for waveform reconstruction.
Free-VC [Li 22] - A CVAE-VC model implementing text-free VC, focusing on detailed linguistic representation extraction.

(Source) Human Speech + (Reference) Non-human Timbre

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

(Source) Human Non-verbal Vocalizations + (Reference) Non-human Timbre

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

(Source) Non-human Non-verbal Vocalizations + (Reference) Nonhuman Timbre

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC

Source

Reference

Proposed (ours)

w/o pp (ours)

DDDM-VC

Diff-HierVC

Free-VC