How AI Actually Splits A Song Into Stems

Pulling the vocal, drums and bass out of a finished mix sounds impossible, like un-baking a cake. Here's what's really happening, and why some parts come out clean while others stay a little messy.

When you play a song, your speaker cone is doing exactly one thing: moving back and forth along a single path. Every instrument, every voice, the whole band, has already been summed into one blended waveform. There is no hidden "vocal track" tucked inside the file waiting to be extracted. There's just the sum.

So asking software to hand you back the isolated vocal is asking it to un-bake the cake, to look at the finished mix and reconstruct ingredients that were blended together and, strictly speaking, thrown away. That it works at all is genuinely impressive. Understanding how tells you exactly when to trust it.

Separation isn't magic. It's a very good guess, and the cleaner the mix, the better the guess.

01The model learns what each part looks like

AI separation is trained on enormous libraries of songs where the isolated tracks are known: the studio has the real vocal, the real drums, the real bass on their own. The network gets shown the full mix alongside those true parts, over and over, until it learns what a vocal, a snare, a bass note tend to look like as a picture of frequencies over time, a spectrogram.

Then, faced with a brand-new song it has never heard, it estimates a kind of stencil for each part, deciding which frequencies at each instant belong to the vocal, which to the drums, and so on, and lifts them out. It isn't recovering the original files. It's making an educated guess, informed by every song it trained on.

02Why some parts come out cleaner

Not every instrument is equally easy to guess, and this is the part worth knowing before you judge a result:

Vocals and drums separate cleanly. Voices have a distinctive fingerprint and drums are sharp bursts in time, so the model can spot them confidently. On a good mix they come out close to the original.
Bass and piano come out very well. They occupy fairly predictable ranges, so there's less guessing.
Synths, strings, and "everything else" are the noisiest. They smear across the same frequencies as the other instruments, so the model can't cleanly decide who owns what. That leftover "other" bucket is the roughest part of any separation, in every tool on the market, not just one.

03The recording matters more than the tool

People blame the software when a stem comes out watery, but the bigger factor is usually the source. A clean, well-mixed studio recording gives the model clear ingredients to pull apart. A phone-recorded live clip, a lo-fi upload, or a heavily-processed track hands it a mix where the sources are already blurred into each other, and no amount of cleverness fully un-blurs them.

So if a result disappoints, try a cleaner version of the same song before you blame the tool. Garbage in doesn't mean garbage out exactly, but it does mean harder out. That's physics, not a bug.

04On your device, or up to a server

There's one more design choice that quietly matters: where the separation runs. Many tools upload your audio to their servers, do the work in the cloud, and send stems back. That means your files, including anything you recorded yourself, leave your machine.

Riffloop runs the separation on your device instead, right on the YouTube video you're watching or a file you upload, and pulls the song into six stems, vocals, drums, bass, guitar, piano and the rest, so nothing is uploaded. For a practice tool you'll point at demos and lesson recordings, keeping the audio local is the difference that lets you not think about it.

hear it in parts

Split any song into six stems

Separate a song into vocals, drums, bass, guitar, piano and the rest, then solo or mute each one, on the YouTube video or a file you upload. Free to start, no signup, nothing leaves your device.

01 LoadOpen a YouTube song or upload a file.

02 SplitSeparate into six stems.

03 SoloKeep one part, mute the rest.

04 DrillLoop it and slow it to learn.

Open Riffloop →

05Knowing the guess makes you use it better

Once you stop expecting perfection, stems become an incredibly useful practice tool. Trust the vocal, drum, and bass isolations. Expect the strings-and-synth bucket to be rough and don't try to transcribe fine detail out of it. Feed it the cleanest recording you can find. Do that and you'll get exactly what separation is good for: hearing a single part clearly enough to finally learn it.

It was never magic. It's a very good guess, trained on a mountain of music, running quietly on your laptop. That's more than enough to change how you practice.

How AI Actually Splits A Song Into Stems

01The model learns what each part looks like

02Why some parts come out cleaner

03The recording matters more than the tool

04On your device, or up to a server

Split any song into six stems

05Knowing the guess makes you use it better

Roy Rosenberg

Keep practicing

Riffloop 1.0: Stem Splitting Comes To Your Own Files

The 10-Minute Daily Ear Workout

Practice Headphones And Latency: Hearing Yourself Honestly