SAM Audio by Meta
Meta shipped SAM (Segment Anything Model) Audio, the first of its kind model to segmenting/isolating any sound of interest from complex audio tracks using multimodal prompts (e.g. text, visual cues).
Meta built SAM Audio using a new Perception Encoder Audiovisual (PE-AV). Under the hood, PE-AV is built on top of another Meta Perception Encoder model. In a nutshell, PEAV processes videos frame by frame, processes the audio, and syncs them up temporally to identify what sounds below to what visual events.
SAM Audio allows users to isolate/segment audio by text prompting by describing what specific sounds is desired, visual prompting by clicking on objects/people in a video to extract their sound, or span prompting by making specific time segments of the desired audio segments in the source video/audio, or any combination of the three.
The model is trained using 100 million videos with high fidelity. The model was released in six variants, small (& small-tv), base (& base-tv), large (& large-tv).
Meta also released an audio separation benchmark (SAM Audio-Bench) covering speech, music, and sound effects using text vision and span prompts, helping evaluate future audio extraction models.