Audio samples from "A Synthetic Corpus Generation Method for Neural Vocoder Training"

Abstract: Nowadays, neural vocoders are preferred for their ability to synthesize high-fidelity audio. However, training a neural vocoder requires a massive corpus of high-quality real audio, and the audio recording process is often labor-intensive. In this work, we propose a synthetic corpus generation method for neural vocoder training, which can easily generate synthetic audio with an unlimited number at nearly no cost. We explicitly model the prior characteristics of audio from multiple target domains simultaneously (e.g., speeches, singing voices, and instrumental pieces) to equip the generated audio data with these characteristics. And we show that our synthetic corpus allows the neural vocoder to achieve competitive results without any real audio in the training process. To validate the effectiveness of our proposed method, we performed empirical experiments on both speech and music utterances in subjective and objective metrics. The experimental results show that the neural vocoder trained with the synthetic corpus produced by our method can generalize to multiple target scenarios and has excellent singing voice (MOS: 4.20) and instrumental piece (MOS: 4.00) synthesis results.

Our implementation is available in the github repository.

Contents

Synthetic Corpus
Singing Voices
Instrumental Pieces
Female Speeches
Male Speeches

Pipeline of proposed paradigm for universal vocoding

The pipeline of our proposed method. At first, based on the prior knowledge from different target audio domains, we model the distributions of acoustic features. Then we sample the acoustic features, including fundamental frequency \(f_0\), harmonic amplitude \(A\), harmonic distribution \(D\), and time-varying filtered noise signal \(N\) from corresponding distributions, respectively.

Synthetic Dataset

Please lower the volume to protect your hearing. NOISE WARNING!!!

Synthetic audios

Singing Voices

"Neural Vocoder (Real)" denotes the vocoder trained with real corpus LJSpeech. "Neural Vocoder (10K)" denotes the vocoder trained with 10,000 pieces of synthetic audio and "Neural Vocoder (1M)" denotes the vocoder trained with 1,000,000 pieces of synthetic audio.