MuCodec: Ultra Low-Bitrate Music Codec

TL;DR

Music codecs are a vital aspect of audio codec research, and ultra low-bitrate compression holds significant importance for music transmission and generation. Due to the complexity of music backgrounds and the richness of vocals, solely relying on modeling semantic or acoustic information cannot effectively reconstruct music with both vocals and backgrounds. To address this issue, we propose MuCodec, specifically targeting music compression and reconstruction tasks at ultra low bitrates. MuCodec employs MuEncoder to extract both acoustic and semantic features, discretizes them with RVQ, and obtains Mel-VAE features via flow-matching. The music is then reconstructed using a pre-trained MEL-VAE decoder and HiFi-GAN. MuCodec can reconstruct high-fidelity music at ultra low (0.35kbps) or high bitrates (1.35kbps), achieving the best results to date in both subjective and objective metrics.

MuCodec's overall process

Other Types of Audio (Domain Transfer Capability)

Music Background
Vocal
Audio Event
English Speech
Chinese Speech

Music Samples

Here we provide samples of music from different languages, including English, Chinese, and other languages. The samples are from Youtube, and the links are provided below each sample.

English Music

Sampled from Youtube, the link has been shown below the sample

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
	Link	Link	Link	Link	Link
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Go Back to Table of Contents

Chinese Music

Sampled from Youtube, the link has been shown below the sample

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
	Link	Link	Link	Link	Link
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Go Back to Table of Contents

Other Language Music

Sampled from Youtube, the link has been shown below the sample

	France Music	Korean Music	Japenese Music	India Music
	Link	Link	Link	Link
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Go Back to Table of Contents

Other Types of Audio (Domain Transfer Capability)

Please note that MuCodec itself does not target other types of audio, and we have not used any other kind of audio data except Music, only demonstrating the domain transfer capability. We will focus on developing a universal audio codec at ultra low bit rates in our future work.

Music Background

Sampled from Youtube, the link has been shown below the sample

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
	Link	Link	Link	Link	Link
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Go Back to Table of Contents

Vocal

Sampled from Opencpop dataset

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Go Back to Table of Contents

Audio Event

Sampled from AudioSet

	Sample 1	Sample 2	Sample 3
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Go Back to Table of Contents

Chinese Speech

Sampled from THCHS-30

	Sample 1	Sample 2	Sample 3
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Go Back to Table of Contents

English Speech

Sampled from Librispeech

	Sample 1	Sample 2	Sample 3
Origin Audio
low-bitrate scenario (0.35kbps)
GAN-based (0.35kbps)
SemantiCodec (0.375kbps)
MuCodec-proposed (0.35kbps)
high-bitrate scenario (1.33kbps)
GAN-based (1.33kbps)
SemantiCodec (1.40kbps)
MuCodec-proposed (1.33kbps)

Go Back to Table of Contents