CS388: Natural Language Processing Lecture 24 ... · This Lecture ‣ Morphology: effects and...

Preview:

Citation preview

CS388:NaturalLanguageProcessingLecture24:Mul9linguality+Morphology

GregDurrett

Administrivia

‣ Project2graded;average=19.0

‣ Finalprojectpresenta9onsnextweek

‣ SeeCanvasannouncementforwhoispresen9ngwhen

‣ Canbe“workinprogress”,butthereshouldbeatleastpreliminaryresults

‣ FinalreportsdueonDecember14;noslipdays

Dealingwithotherlanguages

‣ManyalgorithmssofarhavebeendevelopedforEnglish

‣ Somestructureslikecons9tuencyparsingdon’tmakesenseforotherlanguages‣ NeuralmethodsaretypicallytunedtoEnglish-scaleresources,maynotbethebestforotherlanguageswherelessdataisavailable

1)Whatotherphenomena/challengesdoweneedtosolve?

‣ Ques9on:

2)Howcanweleverageexis9ngresourcestodobe]erinotherlanguageswithoutjustannota9ngmassivedata?

‣ OtherlanguagespresentsomeproblemsnotseeninEnglishatall!

ThisLecture

‣Morphology:effectsandchallenges

‣ Cross-lingualtaggingandparsing

‣Morphologytasks:analysis,inflec9on,wordsegmenta9on

Morphology

Whatismorphology?‣ Studyofhowwordsform

‣ Deriva9onalmorphology:createanewlexemefromabaseestrange(v)=>estrangement(n)become(v)=>unbecoming(adj)

Ibecome/shebecomes

‣ Inflec9onalmorphology:wordisinflectedbasedonitscontext

‣Maynotbetotallyregular:enflame=>inflammable

‣Mostlyappliestoverbsandnouns

MorphologicalInflec9on‣ InEnglish: Iarrive youarrive he/she/itarrives

wearrive youarrive theyarrive[X]arrived

‣ InFrench:

MorphologicalInflec9on‣ InSpanish:

NounInflec9on

‣ Nomina9ve:I/he/she,accusa9ve:me/him/her,geni9ve:mine/his/hers

‣ Notjustverbseither;gender,number,casecomplicatethings

Igivethechildrenabook<=>IchgebedenKinderneinBuchItaughtthechildren<=>IchunterrichtedieKinder

‣ Da9ve:mergedwithaccusa9veinEnglish,showsrecipientofsomething

IrregularInflec9on‣ Commonwordsareolenirregular

‣ Iam/youare/sheis

‣ However,lesscommonwordstypicallyfallintosomeregularparadigm—thesearesomewhatpredictable

‣ Jesuis/tues/elleest

‣ Yosoy/ustedestá/ellaes

Agglu9na9ngLangauges‣ Finnish/Turkish/Hungarian(Finno-Ugric):whatapreposi9onwoulddoinEnglishisinsteadpartoftheverb

‣Manypossibleforms—andinnewswiredata,onlyafewareobservedilla9ve:“into” adessive:“on”

Morphologically-RichLanguages‣ManylanguagesspokenallovertheworldhavemuchrichermorphologythanEnglish(Chineseisthemainexcep9on)

‣ CoNLL2006/2007:dependencyparsing+morphologicalanalysesfor~15mostlyIndo-Europeanlanguages

‣Wordpiece/byte-pairencodingmodelsforMTarepre]ygoodathandlingtheseifthere’senoughdata

‣ SPMRLsharedtasks(2013-2014):Syntac9cParsingofMorphologically-RichLanguages

Morphologically-RichLanguages

‣ Greatresourcesforchallengingyourassump9onsaboutlanguageandforunderstandingmul9lingualmodels!

MorphologicalAnalysis/Inflec9on

MorphologicalAnalysis‣ InEnglish,notthatmanywordforms,lexicalfeaturesonwordsandwordvectorsarepre]yeffec9ve

‣Whenwe’rebuildingsystems,weprobablywanttoknowbaseform+morphologicalfeaturesexplicitly

‣ Inotherlanguages,*lots*moreunseenwords!Affectsparsing,transla9on,…

‣ Howtodothiskindofmorphologicalanalysis?

MorphologicalAnalysis

Ámakormányegyetlenadócsökkentésétsemjavasolja.

n=singular|case=nomina9ve|proper=no

deg=posi9ve|n=singular|case=nomina9ve

n=singular|case=nomina9ve|proper=no

n=singular|case=accusa9ve|proper=no|pperson=3rd|pnumber=singular

mood=indica9ve|t=present|p=3rd|n=singular|def=yes

Butthegovernmentdoesnotrecommendreducingtaxes.

‣Whyisthisuseful?

MorphologicalAnalysis‣ Givenaword,needtorecognizewhatitsmorphologicalfeaturesare

‣ LotsofworkonArabicinflec9on(highamountsofambiguity)

‣ Basicapproach:

‣ Lexicon:tellsyouwhatpossibili9esare

‣ Analyzer:sta9s9calmodelthatdisambiguates

‣ModelsarelargelyCRF-like:scoremorphologicalfeaturesincontext

Predic9ngInflec9on‣ Otherdirec9on:givenbaseform+features,inflecttheword

Durre]andDeNero(2013)

‣ Hardforunknownwords—needmodelsthatgeneralize

w i n d e n

Predic9ngInflec9on

w i n d e n

i

iiia...

i1

en

eestet-...

en2

n

-sttte...

n1

i1 n1

= =====

n1

en1

en2

to wind (de)

en

estt-...

en1

‣ Otherdirec9on:givenbaseform+features,inflecttheword

‣ Hardforunknownwords—needmodelsthatgeneralize

‣ Takeabunchofexis9ngverbsfromWik9onary,extractthesechangerulesusingcharacteralignments

Durre]andDeNero(2013)

‣ TrainaCRFwithcharactern-gramcontextfeaturestolearnwheretoapplythem

MorphologicalReinflec9on

Chahuneauetal.(2013)

‣Machinetransla9onwherephrasetableisdefinedintermsoflemmas‣ “Translate-and-inflect”:translateintouninflectedwordsandpredictinflec9onbasedonsourceside

WordSegmenta9on

MorphemeSegmenta9on‣ Canwedosomethingunsupervisedratherthanthesecomplicatedanalyses?

CreutzandLagus(2002)

‣ unbecoming=>un+becom+ing—weshouldbeabletorecognizethesecommonpiecesandsplitthemoff

‣ Howdoweodothis?

MorphemeSegmenta9on‣ Simpleprobabilis9cmodel

CreutzandLagus(2002)

‣ p(mi)=count(token)/count(alltokens)

‣ TrainwithEM:E-stepinvolveses9ma9ngbestsegmenta9onwithViterbi,M-step:collecttokencounts

allowedexpectedneedneeded all+owe+dexpe+ctedn+e+edne+ed+ed E0

M0:edhascount3 all+ow+edexpect+edne+edne+ed+ed

‣ Someheuris9cs:rejectraremorphemes,one-le]ermorphemes

‣ Doesn’thandlestemchanges:becoming=>becom+ing

E1

ChineseWordSegmenta9on

‣ LSTMsovercharacterembeddings/characterbigramembeddingstopredictwordboundaries

‣ SomelanguagesincludingChinesearetotallyuntokenized

Chenetal.(2015)

‣ Havingtherightsegmenta9oncanhelpmachinetransla9on

Cross-LingualTaggingandParsing

Cross-LingualTagging‣Mul9lingualPOSinduc9on

Snyderetal.(2008)

‣ Genera9vemodeloftwolanguagessimultaneously,jointalignment+taglearning

‣ Complexgenera9vemodel,requiresGibbssamplingforinference

Cross-LingualTagging‣WehaveresourcesforlanguageslikeEnglish—canweusethesemoredirectly?

DasandPetrov(2011)

Ilikeitalot

Jel’aimebeaucoup

NVPRDTADJ

NPRV??

‣ TagwithEnglishtagger,projectacrossbitext,trainFrenchtagger?‣ Candosomethingsmarter

Cross-LingualTagging

DasandPetrov(2011)

{Ilikeitalot

Jel’aimebeaucoup

NVPRDTADJ

{‣ Formagraphoftrigrams,usethesetopropagateknowledgeabouttags

Cross-LingualTaggingDasandPetrov(2011)

Ilikeit

l’aimebeaucoup l’adoreunpeul’adorebeaucoup

edgeweightsbasedonsimilarityofcontextsthesetrigramsoccurin

Iloveit helovesitshelovesit

edgeweightsbasedonalignments(middlewordmustbealigned)

‣ Eachnodeisassociatedwithadistribu9onovertags,labelpropaga9onupdatestheseusingthegraph

Cross-LingualTagging

DasandPetrov(2011)

Cross-LingualTagging

DasandPetrov(2011)

‣ Takethesetrigramsandtreatthemas“soltrainingexamples”andlearnanHMMtagger

‣ Labelpropaga9on:encouragesnodeswithhigher-weightedgesbetweenthemtohavesimilartags

‣ Prunetoonlykeeptagsabovesomeprobabilitytogetthelexicon(validtag-wordpairs)

Cross-LingualTagging

DasandPetrov(2011)

‣ EM-HMM/featureHMM:unsupervisedmethodswithagreedymappingfromlearnedtagstogoldtags

‣ Projec9on:projecttagsacrossbitexttomakepseudogoldcorpus,trainonthat

Cross-LingualParsing

McDonaldetal.(2011)

‣ NowthatwecanPOStagotherlanguages,canweparsethemtoo?

‣ Directtransfer:trainaparseroverPOSsequencesinonelanguage,thenapplyittoanotherlanguage

Iliketomatoes

PRONVERBNOUN

Jelesaime

PRONPRONVERB

Ilikethem

PRONVERBPRON ‣ Eventhoughwe'veneverseenthissequenceinEnglishanddon’tknowthewords,wecans9llfigureitout

Cross-LingualParsing

McDonaldetal.(2011)

‣Mul9-dir:transferaparsertrainedonseveralsourcetreebankstothetargetlanguage

‣Mul9-proj:morecomplexannota9onprojec9onapproach

Cross-LingualEmbeddings

Ammaretal.(2016)

‣ mul9Cluster:usebilingualdic9onariestoformclustersofwordsthataretransla9onsofoneanother,replacecorporawithclusterIDs,train“monolingual”embeddingsoverallthesecorpora

‣ mul9CCA:“project”allotherlanguagesintoEnglish

‣ CCA:learnaprojec9onofaligneddatapointsintoasharedspace

‣ Learnasharedmul9lingualembeddingspacesoanyneuralsystemcantransferover

Cross-LingualEmbeddings

Ammaretal.(2016)

‣Wordvectorsworkpre]ywellat“intrinsic”tasks,someimprovementonthingslikedocumentclassifica9onanddependencyparsingaswell

Wherearewenow?‣ Universaldependencies:treebanks(+tags)for70+languages

‣Manylanguagesares9llsmall,soprojec9ontechniquesmays9llhelp

‣Morecorporainotherlanguages,lessandlessrelianceonstructuredtoolslikeparsers,andpretrainingonunlabeleddatameansthatperformanceonotherlanguagesisbe]erthanever

‣ BERThaspretrainedmul9lingualmodelsthatseemtoworkpre]ywell(trainedonawholebunchoflanguages)

Takeaways

‣ManylanguageshaverichermorphologythanEnglishandposedis9nctchallenges

‣ Problems:howtoanalyzerichmorphology,howtogeneratewithit

‣ CanleverageresourcesforEnglishusingbitexts

‣ Next9me:wrapup+ethicsofNLP

Recommended