MuCodec: Ultra Low-Bitrate Music Codec

TL;DR

Music codecs are a vital aspect of audio codec research, and ultra low-bitrate compression holds significant importance for music transmission and generation. Due to the complexity of music backgrounds and the richness of vocals, solely relying on modeling semantic or acoustic information cannot effectively reconstruct music with both vocals and backgrounds. To address this issue, we propose MuCodec, specifically targeting music compression and reconstruction tasks at ultra low bitrates. MuCodec employs MuEncoder to extract both acoustic and semantic features, discretizes them with RVQ, and obtains Mel-VAE features via flow-matching. The music is then reconstructed using a pre-trained MEL-VAE decoder and HiFi-GAN. MuCodec can reconstruct high-fidelity music at ultra low (0.35kbps) or high bitrates (1.35kbps), achieving the best results to date in both subjective and objective metrics.

MuCodec's overall process

Interpolate start reference image.

Music Samples

Here we provide samples of music from different languages, including English, Chinese, and other languages. The samples are from Youtube, and the links are provided below each sample.

English Music

Sampled from Youtube, the link has been shown below the sample

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Link Link Link Link Link
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Chinese Music

Sampled from Youtube, the link has been shown below the sample

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Link Link Link Link Link
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Other Language Music

Sampled from Youtube, the link has been shown below the sample

France Music Korean Music Japenese Music India Music
Link Link Link Link
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Other Types of Audio (Domain Transfer Capability)

Please note that MuCodec itself does not target other types of audio, and we have not used any other kind of audio data except Music, only demonstrating the domain transfer capability. We will focus on developing a universal audio codec at ultra low bit rates in our future work.

Music Background

Sampled from Youtube, the link has been shown below the sample

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Link Link Link Link Link
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Vocal

Sampled from Opencpop dataset

Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Audio Event

Sampled from AudioSet

Sample 1 Sample 2 Sample 3
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Chinese Speech

Sampled from THCHS-30

Sample 1 Sample 2 Sample 3
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

English Speech

Sampled from Librispeech

Sample 1 Sample 2 Sample 3
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)