Audio Decomposition – open-source seperation of music to constituent instruments

312 points by thunderbong 5 days ago | 61 comments

The title is a bit confusing as open-source separation of ... reads like source separation, which this is not. Rather, it is a pitch detection algorithm which also classifies the instrument the pitch originated with.

I think it's really neat, but the results look like it could take more time to fix the output than using a manual approach (if really accurate results are required).

earthnail 4 days ago | root | parent | next |

Thanks for clarifying.

In fairness to the author, he is still at high school: https://matthew-bird.com/about.html

Amazing work for that age.

veunes 4 days ago | root | parent | next |

He's definitely a talent to watch!

timlod 4 days ago | root | parent | prev |

Wow, I didn't see that. Great to see this level of interest early on!

Is “source separation” better known as “stem separation” or is that something else? I think the latter term is the one I usually hear from musicians who are interested in taking a single audio file and recovering (something approximating) the original tracks prior to mixing (i.e. the “stems”).

timlod 4 days ago | root | parent | next |

Audio Source Separation I think is the general term used in research. It is often applied to musical audio though, where you want to do stem separation - that's source separation where you want to isolate audio stems, a term referring to audio from related groups of signals, e.g. drums (which can contain multiple individual signals, like one for each drum/cymbal).

Stem separation refers to doing it with audio playback fidelity (or an attempt at that). So it should pull the bass part out at high enough fidelity to be reused as a bass part.

This is a partly solved problem right now. Some tracks and signal types can be unmixed easier than others, it depends on what the sources are and how much post-processing (reverb, side chaining, heavy brick wall limiting and so on)

dylan604 4 days ago | root | parent |

> This is a partly solved problem right now.

I'd agree with the partly. I have yet to find one that either isolates an instrument as a separate file or removes one from the rest of the mix that does not negatively impact the sound. The common issues I hear are similar to the early internet low bit rate compression. The new "AI" versions are really bad at this, but even the ones available before the AI craze were still susceptible

mh- 4 days ago | root | parent |

I'm far (far) from an expert in this field, but when you think about how audio is quantized into digital form, I'm really not sure how one solves this with the current approaches.

That is: frequencies from one instrument will virtually always overlap with another one (including vocals), especially considering harmonics.

Any kind of separation will require some pretty sophisticated "reconstruction" it seems to me, because the operation is inherently destructive. And then the problem becomes one of how faithful the "reproduction" is.

This feels pretty similar to the inpainting/outpainting stuff being done in generative image editing (a la Photoshop) nowadays, but I don't think anywhere near the investment is being made in this field.

Very interested to hear anyone with expertise weigh in!

nineteen999 3 days ago | root | parent |

I won't say expertise, but what I've done recently:

1) used PixBim AI to extract "stems" (drums, bass, piano, all guitars, vocals). Obviously a lossless source like FLAC works better than MP3 here

2) imported the stems to ProTools.

3) from there, I will usually re-record the bass, guitars, pianos and vocals myself. Occassionally the drums as well.

This is a pretty good way I found to record covers of tracks at home, re-using the original drums if I want to, keeping the tempo of the original track intact etc. I can embellish/replace/modify/simplify parts that I re-record obviously.

It's a bit like drawing using tracing paper, you're creating a copy to the best of your ability, but you have a guide underneath to help you with placement.

popalchemist 4 days ago | root | parent | prev |

Source separation is a general term, stem separation is a specific instance of source separation.

emptiestplace 4 days ago | root | parent | prev |

No, it doesn't read like that. The hyphen completely eliminates any possible ambiguity.

ipsum2 4 days ago | root | parent | next |

The title of the submission was modified. It you read the article it says:

Audio Decomposition [Blind Source Seperation]

croes 4 days ago | root | parent | prev |

Maybe added later by OP? Because there is no hyphen in the article’s subtitle.

>Open source seperation of music into constituent instruments.

emptiestplace 4 days ago | root | parent |

The complaint:

> The title is a bit confusing as open-source separation of ... reads like source separation, which this is not.

loubbrad 4 days ago | prev | next |

I didn't see it referenced directly anywhere in this post. However, for those interested, automatic music transcription (i.e., audio->MIDI) is actually a decently sized subfield of deep learning and music information retrieval.

There have been several successful models for multi-track music transcription - see Google's MT3 project (https://research.google/pubs/mt3-multi-task-multitrack-music...). In the case of piano transcription, accuracy is nearly flawless at this point, even for very low-quality audio:

https://github.com/EleutherAI/aria-amt

Full disclaimer: I am the author of the above repo.

Earw0rm 4 days ago | root | parent | next |

He's trying to solve a second (also hard ish) problem as well, deriving an accurate musical score from MIDI data. It's a "sounds easy but isn't" problem, especially when audio to MIDI transcribers are great at pitch and onset times, but rather less reliable at duration and velocity.

loubbrad 4 days ago | root | parent |

I agree that the audio->score and MIDI->score problems are quite hard. There has been research in this area too, however it is far less developed than audio->MIDI.

Earw0rm 4 days ago | root | parent |

That's because MIDI doesn't contain all the information that was in a score.

Scores are interpreted by musicians to create a performance, and MIDI is a capture of (some of) the data about that performance. Music engraving is full of implicit and explicit cultural rules, and getting it _right_ has parallels with handwritten kanji script in terms of both the importance of correctness to the reader, and the amount of traps for the unwary or uncultured.

All of which can be taken to mean "classical musicians are incredibly picky and anal about this stuff", or, "well-formed music notation conveys all sorts of useful contextual information beyond simply 'what note to play when'".

pclmulqdq 4 days ago | root | parent |

A lot of modern scores are written with MIDI in mind (whether or not the composer knows it - that's how they hear it the first 50 or so times). That should make it somewhat easier to go MIDI -> score for similar pieces. Current attempts I have seen still make a lot of stupid errors like making note durations too precise and spelling accidentals badly. There's probably still a lot of low-hanging fruit.

This is absolutely not easy, though, given all the cultural context. Things like picking up a "legato" or "cantabile" marking and choosing an accent vs a dagger or a marcato mark are going to be very difficult no matter what.

I know the reported scores of MT3 are very good, but have you had success with using it yourself?

https://replicate.com/turian/multi-task-music-transcription

I ported their colab to runtime so I could use it more easily.

The MIDI output is... puzzling?

I've tried feeding it even simple stems and found the output unusable for some tracks, i.e. the MIDI output and audio were not well aligned and there were timing issues. On other audio it seemed to work fine.

loubbrad 4 days ago | root | parent | next |

Multi-track transcription has a long way to go before it seriously useful for real-world applications. Ultimately I think that converting audio into MIDI makes a lot more sense for piano/guitar transcription than it does for complex multi-instrument works with sound effects ect...

Luckily for me, audio-to-seq approaches do work very well for piano, which turns out to be an amazing way of getting expressive MIDI data for training generative models.

air217 4 days ago | root | parent | prev |

I developed https://pyaar.ai, it uses MT3 under the hood. I realized that continuous string instruments (guitar) that have things like slides, bends are quite difficult to capture in MIDI. Piano works much better because it's more discrete (the keys abstract away the strings) and so the MIDI file has better representation

duped 4 days ago | root | parent |

> I realized that continuous string instruments (guitar) that have things like slides, bends are quite difficult to capture in MIDI.

It's just pitch bend?

I think trying to transcribe as MIDI is just a fundamentally flawed approach that has too many (well known) pitfalls to be useful.

A trained human can listen to a piece and transcribe it in seconds, but programming it as MIDI could take minutes/hours. If you're not trying to replicate how humans learn by ear, you're probably approaching this wrong.

WiSaGaN 4 days ago | root | parent | prev |

How does the problem simplify when it's restricted to piano?

loubbrad 4 days ago | root | parent |

Essentially, the leading way to do automatic music transcription is to train a neural network on supervised data, i.e., paired audio-MIDI data. In the case of piano recordings, there is a very good dataset for this task which was released by Google in 2018:

https://magenta.tensorflow.org/datasets/maestro

Most current research involves refining deep learning based approaches to this task. When I worked on this problem earlier this year, I was interested in adding robustness to these models by training a sort of musical awareness into them. You can see a good example of it in this tweet:

https://x.com/loubbrad/status/1794747652191777049

fxj 4 days ago | prev | next |

If you are interested in audio (or stem) separation have a look at RipX

https://hitnmix.com/ripx-daw-pro/

It can even export the separated tracks as midi files. It still has some problems but works very well. Stem separation is now standard in the musical software and almost every DAW provides it.

tasty_freeze 4 days ago | root | parent | next |

RipX can do stem separation and allows repitching notes in the mix. If that is what you want to do it is great.

I find moises (https://moises.ai/) to be easy to use for the tasks I need to do. It allows transposing or time scaling the entire song. It does stem separation and has a simple interface for muting and changing the volume on a per-track basis. It auto-detects the beat and chords.

I'm not affiliated, just a happy nearly-daily user for learning and practicing songs. I boost the original bass part and put everything else at < 10% volume to hear the bass part clearly clearly (which often shows how bad online transcriptions are, even paid ones). Once once I know the part, I mute the bass part and play along with the original song as if I was the bass player.

alok-g 3 days ago | root | parent |

Moises looks promising.

I wonder why pricing information is so hard to find these days. Would like to get an idea of the same.

Stemroller[0] has been around for a while too, it's free and based on Meta's models:

0: https://www.stemroller.com/

cloudking 4 days ago | root | parent |

I've heard Meta's Demucs is SOTA, has anything else better come out since?

adzm 4 days ago | root | parent |

It's still pretty much the best, though there are fine tunings and tweaks on top of that and the runner-up MDX that work well for specific scenarios.

> almost every DAW provides it.

It's an up and coming feature that nearly every DAW should have, but most don't yet.

Ableton Live - No

Bigwig - No

Cubase - No

FL - Yes

Logic - Yes

Pro Tools - No

Reason - No

Reaper - No

Studio One - Yes

fxj 4 days ago | root | parent |

MPC3 - Yes

Mixcraft - Yes

Maschine3 - Yes

It appears to be related to Polymath.

https://github.com/samim23/polymath

Polymath is effective at isolating and extracting individual instrument tracks from MP3s. It works very well.

makz 4 days ago | root | parent | prev |

Thanks for the information. I’m a long time Logic Pro user and I wasn’t aware of this feature.

Sporktacular 4 days ago | root | parent |

On an M1/2/3/4 processor. Not Intel.

bottom999mottob 4 days ago | prev | next |

This is really cool, but there's real-world instrument physics that might not be captured by simple Fourier transform templates, like a trumpet playing softly can have a significantly different harmonic spectrum than the same trumpet playing loudly, even at the same pitch

Trumpets produce a rich harmonic series with strong overtones, meaning their Fourier transform would show prominent peaks at integer multiples of the fundamental frequency. Instruments like flutes have more pure tones, but brass instruments typically have stronger higher harmonics, which would lead to more complex partial derivatives in the matrix equation shown in the article

So this script uses bandpass filtering and cross-correlation of attack/release envelopes to identify note timing. Given that brass instruments can exhibit non-linear behavior where the harmonic content changes significantly with playing intensity (think of the brightness difference between pp and ff passages), not sure how would this algorithm could handle intensity-dependent timbral variations. I'd consider adding intensity-dependent Fourier templates for each instrument to improve accuracy

atoav 4 days ago | root | parent |

As someone who uses source separation twice a week for mixing purposes the number of other instruments that can produce sounds of "vocal" quality is high. These models all stop functiining well when you have bands where the instruments don't sound typical and aren't played and/or mixed in a way that achieves maximum separation between them — e.g. an electrical guitar with a distorted harmonic hitting the same note as your singer while the drummer plays only shrieking noises on their cymbals and the bass player simulates a punching kick drum on their instrument.

In these situations (experimental music) source separation will produce completely unpredictable results, thst may or may not be useful for musical rebalancing.

fnordlord 4 days ago | root | parent |

What tool do you use for the source separation? Everything I've used so far is great for learning or transcribing to MIDI but the separated tracks always have a strange phasing sound to them. Are you doing something to clean that up before mixing back in or are the results already good enough?

atoav 4 days ago | root | parent |

iZotope RX with musical rebalance, great to reduce drum spill from vocal mics

ekianjo 4 days ago | prev | next |

Looks like this may be the work of Joshua Bird's little brother (?). Joshua bird did some impressive projects already, that were featured on HN before: https://www.youtube.com/@joshuabird333

generalizations 4 days ago | prev | next |

No one else is going to mention that "separation" was misspelled four times?

orbitingpluto 4 days ago | root | parent |

If we can all hear the tiny violin, who cares?

generalizations 4 days ago | root | parent |

Degradation of the environment. https://en.wikipedia.org/wiki/Broken_windows_theory#Theoreti...

orbitingpluto 2 days ago | root | parent |

Someone created something. Its value greatly exceeds the perceived "degradation of the environment" of a spelling mistake. Not acknowledging that says more about the pedant than the creator.

baq 4 days ago | prev | next |

Got a flashback of playing audiosurf 15 or so years ago. Time flies.

https://en.wikipedia.org/wiki/Audiosurf

ipsum2 4 days ago | prev | next |

I must be dumb, but none of the YouTube video demos are demonstrating source separation?

Edit: to clarify, source separation in audio research means separating out the audio into separate clips.

atoav 4 days ago | root | parent | next |

I think decomposition is the word, source separation in this case (misleadingly) referes to the fact that the decomposed notes can be separated into different sources.

wkjagt 4 days ago | root | parent | prev |

The "source" here goes with "open source".

fonema 3 days ago | prev | next |

I'm a long-time fan of Ultrastar Deluxe, which is an open-source clone of Singstar. This is a karaoke game where people compete by singing along to the tune. It recognizes the notes you are singing and compares them to a vocals-timings mapping file for that particular song. The better you sing to the tune (getting the words correct doesn't matter), the higher your score.

While there are extensive libraries of fan-made song mappings, it's never enough, and there are very few mapped songs in languages other than English or Spanish (if you or your friends prefer your native language). Doing the entire mapping manually is time-consuming, not to mention that I am almost tone-deaf myself, which would make it even more difficult. I have been wondering for a long time what software I could use to make this process easier to automate. This seems like a great tool for capturing vocal timings and notes from original songs.

I have it on my bucket list to create a Singstar playlist in my native language and host a singing party with friends.

Does anyone have suggestions for other similar tools?

alok-g 3 days ago | root | parent |

Lovely. I did not know of this.

Sounds like the text file needs vocals and pitches along with time stamps. AI is getting there to allow automating it's creation.

For myself: Adding a link I just found for reading further.

https://www.reddit.com/r/karaoke/comments/x61kzy/modern_equi...

DidYaWipe 4 days ago | prev | next |

Some of those videos don't have audio, as far as I can tell...

tjoff 4 days ago | root | parent |

The youtube links explains why: "No audio as a result of copyright." And also has a link to the audio that you can play alongside.

DidYaWipe 4 days ago | root | parent |

Of course, we can't expect Google to respect the obvious fair-use nature of these demonstrations.

bastloing 4 days ago | prev | next |

I can't find the source code, but the project looks interesting.

ssttoo 4 days ago | root | parent |

There’s a GitHub link right below the videos https://github.com/mbird1258/Audio-Decomposition

bastloing 4 days ago | root | parent |

Thanks! Nice! This kid is pretty sharp, can't wait to see what else he does!

kasajian 4 days ago | prev | next |

dude can't spell

berbec 4 days ago | root | parent |

He's in high school and pulls of a project like this. I thought I was slick convincing the 7-11 guy to give me my Twist-a-Pepper soda without charging me bottle deposit or tax.

testoveride 4 days ago | prev |