4

Deep Learning Machine Solves the Cocktail Party Problem _ MIT Technology Review.pdf

Embed Size (px)

Citation preview

  • Emerging Technology From the arXivApril 29, 2015

    Deep Learning Machine Solves the CocktailParty ProblemSeparating a singers voice from background music has always been auniquely human ability. Not anymore.

  • The cocktail party effect is the ability to focus on a specific human voice while filtering out other voicesor background noise. The ease with which humans perform this trick belies the challenge that scientistsand engineers have faced in reproducing it synthetically. By and large, humans easily outperform thebest automated methods for singling out voices.

    A particularly challenging cocktail party problem is in the field of music, where humans can easilyconcentrate on a singing voice superimposed on a musical background that includes a wide range ofinstruments. By comparison, machines are poor at this task.

    Today, that looks to be changing thanks to the work of Andrew Simpson and pals at the University ofSurrey in the U.K. These guys have used some of the most recent advances associated with deepneural networks to separate human voices from the background in a wide range of songs.

  • Their approach showcases the huge advances that have been made in recent years in machine learningand neural networks. And it paves the way for a more general solution to the famous cocktail partyproblem which should allow, among other things, the vocals to be easily separated from the music theyaccompany.

    The method these guys use is relatively straightforward. They start with a database of 63 songs thatare available as a set of individual tracks that each contain a different instrument or voice, as well as thefully mixed version of the song.

    Simpson and co divide each track into 20-second segments and create a spectrogram for each thatshows how the frequencies in the sound vary over time. The result is a kind of unique fingerprint thatidentifies the instrument or voice.

    They also create a spectrogram of the fully mixed version of the song. This is essentially all of thecomponent spectrograms added together.

    The task of picking out a voice from this mixture is essentially the task of separating the voices uniquespectrogram from the other spectrograms that are present.

    Simpson and co trained their deep convolutional neural network to do exactly that. They used 50 ofthese songs to train the network while keeping the remaining 13 to test it on. In total that generatedmore than 20,000 spectrograms for training purposes.

    The task for the neural network was simple. As an input, they gave it the fully mixed spectrogram andexpected it to produce, essentially, the vocal spectrogram as the output.

    The task in this kind of machine learning is one of parameter optimization. Their deep neural networkhas a billion parameters that need to be tuned in a way that produces the desired output.

    This process of optimizationor learningoccurs by iteration. So the network begins with theseparameters set randomly and then gradually improves the settings each time it scans through thedatabase, which it did over a hundred iterations.

    Having found a good setup for the network, Simpson and co then gave it the 13 songs it had not seenbefore to test how well it could separate the vocals from the mix.

    The outputs turned out to be impressive. These results demonstrate that a convolutional deep neuralnetwork approach is capable of generalizing voice separation, learned in a musical context, to newmusical contexts, say the team.

  • Tagged: Computing

    Reprints and Permissions | Send feedback to the editor

    Simpson and co of even compared their results to those from a conventional cocktail party algorithmapplied to the same data.The main advantage of the deep neural network appears to be in its generallearning of what vocal sounds are, they say.

    In other words, having learned what a voice sounds like, a deep neural network can use this informationto pick out other voices from a mix. But just how good this approach is compared to humanperformance, they do not say.

    One immediate application is in producing music tracks minus vocals for karaoke machines. Thatsclearly an errr important goal but there are broader implications as well.

    Deep neural networks are revolutionizing machine learning in a wide range of areas. Until recently,humans had a clear dominance in pattern recognition tasks such as facial recognition and objectrecognition. That lead has been considerably reduced and in some cases lost altogether.

    Now machines are playing catch up in the area of cocktail party problems and only a fool would betagainst them triumphing in the not too distant future.

    Ref: arxiv.org/abs/1504.04658: Deep Karaoke: Extracting Vocals from Musical Mixtures Using aConvolutional Deep Neural Network

    2015 v1.13.05.10MIT Technology Review