36
Dong Yu Distinguished Scientist and Vice General Manager Tencent AI Lab work was done while @ Microsoft Research Joint work with Morten Kolbæk, Zheng-Hua Tan, and Jesper Jensen Multi-talker Speech Separation and Tracing with Permutation Invariant Training

Multi-talker Speech Separation and Tracing at AI NEXT Conference

Embed Size (px)

Citation preview

Page 1: Multi-talker Speech Separation and Tracing at AI NEXT Conference

DongYuDistinguishedScientistandViceGeneralManager

Tencent AILabworkwasdonewhile@MicrosoftResearch

JointworkwithMortenKolbæk,Zheng-HuaTan,andJesperJensen

Multi-talkerSpeechSeparationandTracingwith

PermutationInvariantTraining

Page 2: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 2

Page 3: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 3

Page 4: Multi-talker Speech Separation and Tracing at AI NEXT Conference

FrontierShift

• Drivenbydemandfromuserstointeractwithdeviceswithoutwearingorcarryingaclose-talkmicrophone.

• Manydifficultieshiddenbyclose-talkmicrophonesnowsurface:

• Theenergyofspeechsignalisverylowwhenitreachesthemicrophones.

• Theinterferingsignals,suchasbackgroundnoise,reverberation,andspeechfromothertalkers,becomesodistinctthattheycannolongerbeignored.

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 4

close-talkmicrophone far-fieldmicrophone

Page 5: Multi-talker Speech Separation and Tracing at AI NEXT Conference

reverberation from surface reflections

additive noise from other sound sources

source

Channeldistortion

ASRinRealWorldScenarios

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 5

Page 6: Multi-talker Speech Separation and Tracing at AI NEXT Conference

CocktailPartyProblem• TermcoinedbyCherry

• “Oneofourmostimportantfacultiesisourabilitytolistento,andfollow,onespeakerinthepresenceofothers.Thisissuchacommonexperiencethatwemaytakeitforgranted;wemaycallit‘thecocktailpartyproblem’…”(Cherry’57)

• Human’sperformanceissuperiortomachine• “For‘cocktailparty’-likesituations…whenallvoicesareequallyloud,speechremainsintelligiblefornormal-hearinglisteners evenwhenthereareasmanyassixinterferingtalkers”(Bronkhorst &Plomp’92)

• Speechseparationproblem• Separate andtrace audiostreams• Sometimescalledspeechenhancementwhendealingwithnon-speechinterference

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 6

Page 7: Multi-talker Speech Separation and Tracing at AI NEXT Conference

IsSpeechSeparationWorkNeeded?• End-to-endASRsystemsufficient?

• CurrentASRtechniquesrequirehugeamountoftrainingdatathatcoversvariousconditionstotrainwell

• Speechseparationcanbeusedasadvancedfront-end• SpeechseparationcriterioncanbeusedasregularizationtoaidandspeeduptrainingofASRsystems

• MoreapplicationsthanASR• Hearingaids• Cochlearimplants• Noisereductionformobilecommunication• Audioinformationretrieval

• Usingmicrophonearraysufficient?• Mic-arrayaloneisnotsufficient,e.g.,whenatsamedirection• Manyrecordingsarestillcollectedwithsinglemicrophone

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 7

Page 8: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 8

Page 9: Multi-talker Speech Separation and Tracing at AI NEXT Conference

ProblemDefinition• Sourcespeechstreams• Mixedspeech• STFTdomain• EstimateMask• ReconstructwithMask

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 9

• Ill-posedproblem(#constraints<#freeparams:• Thereareaninfinitenumberofpossible 𝑋" 𝑡, 𝑓 combinationsthatleadtothesame 𝑌 𝑡,𝑓

• Solution:• Learnfromtrainingsettolookforhiddenregularities(complicatedsoftconstraints)

Page 10: Multi-talker Speech Separation and Tracing at AI NEXT Conference

PriorArtsBeforeDeepLearningEra• Computationalauditorysceneanalysis(CASA)

• Useperceptualgroupingcuestoestimatetime-frequencymasks• Non-negativematrixfactorization(NMF)

• Learnasetofnon-negativebasesduringtraining• Estimatemixingfactorsduringevaluation

• ModelbasedapproachsuchasfactorialGMM-HMM• Modelstheinteractionbetweenthetargetandcompetingspeechsignalsandtheirtemporaldynamics

• Spatialfilteringwithamicrophonearray• Beamforming:Extracttargetsoundfromaspecificspatialdirection• Independentcomponentanalysis:Findademixingmatrixfrommultiplemixturesofsoundsources

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 10

Page 11: Multi-talker Speech Separation and Tracing at AI NEXT Conference

TrainingCriteriaforDeepLearning• Idealamplitudemask(IAM)𝑀" 𝑡, 𝑓 = )* +,,

- +,,• Minimizemask estimationerror(twoproblems)

• Insilencesegments 𝑋" 𝑡, 𝑓 = 0 and 𝑌 𝑡, 𝑓 = 0 → 𝑀" 𝑡, 𝑓 isnotwelldefined• Smallererroronmasksmaynotleadtoasmallererroronmagnitude(whichiswhatwecareabout)

• Minimizemagnitude estimationerror(usedinthisstudy)

• Magnitudestillestimatedthroughmasks:oftenleadtobetterperformanceesp.whentrainingsetissmall

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 11

Page 12: Multi-talker Speech Separation and Tracing at AI NEXT Conference

PriorArtswithDL:Speech+Others(manyworks,OSU,MERL,CUST,etc.)

• BasicArchitecture:mixofdifferenttypesofsignals

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 12

Noise/Music/OtherSpeakers

Est.Noise/Music/OtherSpeakers

Page 13: Multi-talker Speech Separation and Tracing at AI NEXT Conference

PriorArtswithDL:FocusonSpeech(manyworks,OSU,MERL,CUST,etc.)

• BasicArchitecture:mixofdifferenttypesofsignals

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 13

Noise/Music/OtherSpeakers

Est.Noise/Music/OtherSpeakers

Speech +noiseSpeech +musicSpecificspeaker+otherspeakers

Page 14: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 14

Page 15: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Multi-TalkerSpeechSeparation• LabelAmbiguity/LabelPermutationProblem

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 15

Speaker1à output1 ?Speaker1à output2 ?

Page 16: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Solution1:DeepClustering(Hershey,Chen,Roux,Watanabe,2016)

• Learnaunit-sizeembeddingforeachtime-frequencybin• Iftwobinsbelongtothesamespeakertheyarecloseintheembeddingspace,andfatherawayotherwise.

• Trainedonalargewindowofframes

• Separationisdonebyclusteringembeddingspacerepresentations(i.e.,segmentthebins)

• Shortcomings• Pipelineiscomplicated• Eachbinisassumedtobelongtooneandonlyonespeakerà limiteditsabilitytocombinewithothertechniques

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 16

Page 17: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Solution2:UseManuallyDefinedRules(Weng,Yu,Seltzer,Droppo,14,15)

• UseinstantaneousenergyinsteadofspeakerIDtoassignlabels:manuallydesignedlimitedcues

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 17

Low-energyspeech

High-energyspeech

Page 18: Multi-talker Speech Separation and Tracing at AI NEXT Conference

OurSolution:PermutationInvariantTraining(Yu, Kolbæk,Tan,Jensen,16,17)

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 18

SimpletoimplementCanbeeasilyextendedto3-speakers

𝑋0 − 𝑋203+ 𝑋3 − 𝑋23

3

𝑋3 − 𝑋203+ 𝑋0 − 𝑋23

3

Page 19: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Testing

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 19

• Defaultassignment:concatenateoutputs’sframestoformstreams• Optimalassignment:outputofeachframeiscorrectlyassignedtospeakers.Concatenateframesbelongtospeakerstoformstreams

• Gapbetweenthemindicatesthegainfromadditionalspeakertracing

Page 20: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 20

Page 21: Multi-talker Speech Separation and Tracing at AI NEXT Conference

ExperimentSetup:Datasets• WSJ0-2mixand3-mix

• DerivedfromWSJ0corpus• 2- and3-speakermixtures(artificiallygenerated)• 30htrainingset,10hvalidationset,5htestset• MixedatSIRsbetween0dBand5dB.

• Danish-2mixand3-mix• DerivedfromaDanishcorpus• 2- or3-speakermixtures(artificiallygenerated)• 10k,1k,1k+1kutterancesintraining,validation,andtestsets• Mixedat0dB

• WSJ0-2mix-other• SameasWSJ0-2mixbutmixedat0dB

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 21

Page 22: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Models• ImplementedusingtheMicrosoftcognitivetoolkit(CNTK)• Input:257dimSTFT;Output:257xSstreams• Segment-based(PIT-S):Eachsegmentisindependent,notracing

• DNN:3hiddenlayerseachwith1024ReLU units• PITwithtracing(PIT-T):forceallframesfromthesameoutputlayertobelongtothesamespeaker

• LSTM:3LSTMlayerseachwith1792units• BLSTM:3BLSTMlayerseachwith896units

• TestConditions• Closedcondition(CC): seenspeakers• Opencondition(OC):unseenspeakers

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 22

Page 23: Multi-talker Speech Separation and Tracing at AI NEXT Conference

PIT-STrainingBehavior:WSJ0-2mix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 23

Page 24: Multi-talker Speech Separation and Tracing at AI NEXT Conference

PIT-S:SDRGain(dB)onWSJ0-2MIX

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 24

Page 25: Multi-talker Speech Separation and Tracing at AI NEXT Conference

PIT-TTrainingBehavior:WSJ0-2mix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 25

Page 26: Multi-talker Speech Separation and Tracing at AI NEXT Conference

PIT-T:SDRGain(dB)onWSJ0-2MIX

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 26

Page 27: Multi-talker Speech Separation and Tracing at AI NEXT Conference

SDR(dB)andPESQGainComparison

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 27

Page 28: Multi-talker Speech Separation and Tracing at AI NEXT Conference

CrossLanguageBehavioron2-talkerMix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 28

Page 29: Multi-talker Speech Separation and Tracing at AI NEXT Conference

PIT-TonWSJ0-3mix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 29

Page 30: Multi-talker Speech Separation and Tracing at AI NEXT Conference

PIT-TTrainedwithBoth2- and3-mix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 30

Page 31: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Examples:2-talkerMix•Male+Female:

•Mix:•S1:•S2:

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 31

•Female+Male:•Mix:•S1:•S2:

•Female+Female:•Mix:•S1:•S2:

•Male+Male:•Mix:•S1:•S2:

Page 32: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Examples:3-talkerMix•Male+2Female:

•Mix:•S1:•S2:•S3:

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 32

•Female+2Male:•Mix:•S1:•S2:•S3:

Page 33: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Example:Trainedon3-MixTeston2-Mix

•DiffGender:•Mix:•S1:•S2:•S3:

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 33

•SameGender:•Mix:•S1:•S2:•S3:

Page 34: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Example:Trainedon2and3-Mix,teston2-Mix

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 34

•DiffGender:•Mix:•S1:•S2:•S3:

•SameGender:•Mix:•S1:•S2:•S3:

Page 35: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Outline• Motivation• ProblemSetupandPriorArts• Multi-talkerSpeechSeparation• Experiments• Conclusion

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 35

Page 36: Multi-talker Speech Separation and Tracing at AI NEXT Conference

Conclusion

• PITcansolvethelabelpermutationproblem• PITiseffectiveinspeechseparationwithoutknowingnumberofspeakers

• PITtrainedmodelsgeneralizewelltounseenspeakersandlanguages• PITissimpletoimplement• PIThasgreatpotentialsinceitcanbeeasilyintegratedandcombinedwithothertechniques

3/27/17 DongYu :Multi-talkerSpeechSeparationandTracingwithPermutationInvariantTraining 36

ClassificationView(supervisedapproach)

Segmentationview(deepclustering)

SeparationView(PIT)

PITisanimportantingredientinthefinalsolutiontothecocktailpartyproblem