In this paper, the authors propose a novel approach to source separation using a pretrained diffusion model. The model, called Separate and Diffuse (SAD), leverages the power of diffusion models to separate sources in a mixed signal. The authors demonstrate the effectiveness of SAD on several benchmark datasets, showing that it outperforms state-of-the-art source separation methods in various scenarios.
The key insight behind SAD is to use a pretrained diffusion model to transform the mixed signal into a "diffused" representation, where each source is separated from the others. This transformation is achieved through a series of invertible transformations, which allow for efficient and exact source separation. The authors show that by applying these transformations, they can separate the sources in a way that minimizes the distortion between the original and separated signals.
The proposed SAD model consists of two main components: (1) a pretrained diffusion model, and (2) an adaptation module that fine-tunes the diffusion model for the specific source separation task at hand. The diffusion model is trained on a large dataset of audio samples, and it learns to transform the mixed signal into a diffused representation that captures the underlying sources. The adaptation module then refines this diffused representation, adapting it to the specific source separation task by learning a mapping between the diffused representation and the desired separated signals.
The authors evaluate SAD on several benchmark datasets, including Clean Mix, Dirty Mix, and LibriSpeech. They show that SAD outperforms state-of-the-art source separation methods in terms of both objective metrics (e.g., Signal-to-Noise Ratio) and subjective evaluations (e.g., human listening tests). Additionally, they demonstrate the versatility of SAD by applying it to a variety of source separation tasks, including speech separation, music separation, and mixture separation.
In summary, Separate and Diffuse is a powerful approach to source separation that leverages the strengths of diffusion models. By transforming the mixed signal into a diffused representation, where each source is separated from the others, SAD can effectively separate sources in a variety of scenarios. Its simplicity and efficiency make it a promising method for a wide range of applications, including speech recognition, music processing, and more.