Abstract
Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster's enhanced outputs over other baselines.
🔍 Key Contributions of SonicMaster
-
Unified and Controllable Music Restoration Framework
SonicMaster is the first model to address multiple real-world music degradations—such as reverb, clipping, spectral imbalance, and stereo artifacts—within a single generative framework. It supports both automatic and instruction-based restoration through natural-language prompts.
-
Hybrid Flow Matching Architecture for High-Fidelity Output
By combining Multimodal Diffusion Transformers (MM-DiT) with standard DiT blocks in the latent domain, SonicMaster achieves high-quality restoration while preserving musical fidelity and stereo realism across diverse degradation types.
🎼 Dataset Contribution
We introduce a large-scale synthetic and real-world degradation dataset paired with natural-language restoration instructions. This benchmark enables training and evaluation of controllable music restoration models and supports research in multi-task generative audio modeling.
Comparative Samples
| Text Prompt | Ground Truth | Music to Enhance | Sonic Master |
|---|---|---|---|
| Increase the clarity of this song by emphasizing treble frequencies. | |||
| Can you make this sound louder, please? | |||
| Improve the balance in this song. | |||
| Correct the unnatural frequency emphasis. Reduce the roominess or echo. | |||
| Increase the clarity of this song by emphasizing treble frequencies. | |||
| Clean this off any echoes! | |||
| Make the sound less squashed and more open. | |||
| Make this song sound more boomy by amplifying the low end bass frequencies. | |||
| Make the audio smoother and less distorted. | |||
| Disentangle the left and right channels to give this song a stereo feeling. | |||
| Raise the level of the vocals, please. | |||
| Please, dereverb this audio. | |||
| Disentangle the left and right channels to give this song a stereo feeling. |
Baseline Comparison
| Ground Truth | Music to Enhance | Mel2Mel | Text2FX-EQ | SonicMaster |
|---|---|---|---|---|
Historic Piano Recordings
| Original | LTAS-EQ | BEHM-GAN | BABE | BABE-2 | Sonic Master |
|---|---|---|---|---|---|
Resources
1. Code repository: GitHub Link
2. Model checkpoints: Hugging Face Link