Accepted to Interspeech 2025
Minsu Kang, Seolhee Lee, Choonghyeon Lee, and Namhyun Cho
This section presents a comparison between our proposed model and baseline models.
(Source) Human Speech + (Reference) Non-human Timbre
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
(Source) Human Non-verbal Vocalizations + (Reference) Non-human Timbre
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
(Source) Non-human Non-verbal Vocalizations + (Reference) Nonhuman Timbre
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC
Source
Reference
Proposed (ours)
w/o pp (ours)
DDDM-VC
Diff-HierVC
Free-VC