Add real-time audio separation to any pipeline
Available on iOS/MacOS, Android, Windows, and Linux platforms, with local inference times optimized for each. You’ll always get the best speed wherever you deploy.
What is a stem separation SDK?





Separation models across voice, film, TV, and music
Isolates spoken dialogue from background sound in real-time streams. Cleans voice inputs before they reach ASR, transcription, translation, or A1 audio engineer workflows — with ~25% improvement in ASR accuracy in noisy audio environment.
Removes copyrighted background music from live or streamed audio while preserving dialogue and effects. Built for broadcasters, sports commentary, and content platforms managing copyright compliance on live feeds.
Isolates vocals, instruments, or up to 14 different instrument stems from any song in real time. Used by music apps, learning platforms, and DJ tools to give users stem-level control inside a consumer experience.
Built for production workflows
Authentic and scalable speech recovery
11ms latency
Isolate clean dialogue from crowd noise
11ms latency
Get started with our SDK
FAQ
The SDK powers a broad range of real-time and on-device applications built on isolated audio: music remixing and mashup tools, karaoke and vocal isolation features, interactive fan engagement experiences, music education and practice apps, mobile songwriting tools, and broadcast or streaming workflows requiring clean separated audio.
The model supports separation into individual stems including vocals, lead vocals, backing vocals, drums, bass, guitar (acoustic and electric), piano, keys, strings, wind instruments, and more. See the Stem Separation page for the complete list of available stems and model options.
AI instrument stem separation is the process of using machine learning to isolate individual musical components from a fully mixed audio recording. AudioShake's models can extract individual stems — including vocals, drums, bass, guitar, piano, strings, and more — directly from any mix, without requiring the original multi-track session. The result is a set of isolated audio elements that can be used for remixing, creative tools, education, and interactive experiences.
Music captured incidentally in live events, venue footage, sports clips, and social content routinely triggers copyright claims, muted audio, and distribution blocks. AudioShake's system identifies the music present — including song identity and rights holder metadata — and removes it before or during distribution, allowing teams to publish or redistribute content without infringing on music rights they don't hold. This supports DMCA compliance workflows for broadcasters, sports leagues, and creator platforms at scale.
Yes. AudioShake's separation model isolates music as a discrete audio element, leaving dialogue, crowd noise, commentary, and ambient sound intact in the output. This makes it well-suited for live sports broadcasts, events, and streaming workflows where crowd atmosphere and commentary are part of the content's value.
AudioShake's music detection model scans audio or video content, identifies where music is present, and returns song-level metadata including track title, artist, and rights holder information. The music removal model then separates and eliminates the detected music from the audio, producing a clean output with all other elements — dialogue, ambient sound, and effects — fully preserved. The two models work together as AudioShake's Copyright Compliance system and are available via SDK for real-time and on-device workflows.
The model handles a wide range of noise conditions including crowd noise, PA bleed, wind, music bleed, and ambient environmental sound. Unlike noise suppression tools that model and subtract a noise profile, AudioShake uses AI source separation — isolating the speech signal directly — which makes it more resilient to sudden or unpredictable noise without manual configuration.
The SDK is designed for applications that require clean speech in complex acoustic environments: live broadcast and sports production, real-time captioning and transcription pipelines, voice AI and ASR preprocessing, multilingual localization, streaming infrastructure, and conferencing tools.
Yes. AudioShake's dialogue isolation model separates clean speech from background noise, crowd noise, music, and other competing audio at latencies as low as 11ms, making it suitable for live production as well as file-based workflows. The model produces two output streams simultaneously — a clean dialogue stem and a separate background stem — giving applications independent control over both.
SDK access requires a Client ID and Client Secret from AudioShake, along with encrypted model files for your target platform. Contact info@audioshake.ai to request access and get started.
The SDK supports iOS, macOS, Android, Windows, and Linux. GPU acceleration is available on all platforms — Metal and Neural Engine on Apple devices, CUDA on Linux and Windows, DirectX 12 on Windows, and OpenGL ES on Android — with CPU-only processing available as a fallback.
Yes. The AudioShake SDK runs all audio processing models locally on-device, with no network calls or cloud processing required. Audio never leaves the device, making it well-suited for privacy-sensitive workflows, offline applications, embedded systems, and any use case where network latency is not acceptable. Both real-time streaming and file-based processing are supported without a connection.






