In this paper, we propose a new framework called Amphion to simplify the process of generating audio, which involves two layers of meaning: specifically referring to sound effects or broadly encompassing sound effects, music, and speech. The authors aim to provide a beginner-friendly solution to generate high-quality audio by unifying various scattered repositories, which often lack systematic evaluation metrics and are difficult to compare.
The Amphion framework consists of three layers: the bottom layer, which includes data processing; the middle layer, which incorporates optimization algorithms; and the top layer, which provides a unified infrastructure for all audio generation tasks. This design allows users to easily switch between different audio generation tasks by modifying a single recipe.
To make Amphion more accessible, the authors provide visualizations that demonstrate the internal working mechanisms of generative models. They also offer a recipe format for each model, which is self-contained and easy to follow.
In summary, Amphion is an innovative framework that streamlines the audio generation process by integrating various scattered repositories into a single, user-friendly solution. By providing visualizations and clear instructions, Amphion makes it easier for beginners to generate high-quality audio without feeling overwhelmed by complex concepts or technical jargon.