SAM Audio by Meta

Meta shipped SAM (Segment Anything Model) Audio, the first of its kind model to segmenting/isolating any sound of interest from complex audio tracks using multimodal prompts (e.g. text, visual cues).

Meta built SAM Audio using a new Perception Encoder Audiovisual (PE-AV). Under the hood, PE-AV is built on top of another Meta Perception Encoder model. In a nutshell, PEAV processes videos frame by frame, processes the audio, and syncs them up temporally to identify what sounds below to what visual events.

High level architecture of SAM Audio.
High level architecture of SAM Audio.

SAM Audio allows users to isolate/segment audio by text prompting by describing what specific sounds is desired, visual prompting by clicking on objects/people in a video to extract their sound, or span prompting by making specific time segments of the desired audio segments in the source video/audio, or any combination of the three.

The model is trained using 100 million videos with high fidelity. The model was released in six variants, small (& small-tv), base (& base-tv), large (& large-tv).

Meta also released an audio separation benchmark (SAM Audio-Bench) covering speech, music, and sound effects using text vision and span prompts, helping evaluate future audio extraction models.