CS6120/CS4120:NaturalLanguageProcessing
Instructor:Prof.LuWangCollegeofComputerandInformationScience
NortheasternUniversityWebpage:www.ccs.neu.edu/home/luwang
Prerequisites• Programming• Beingabletowritecodeinsomeprogramminglanguages(e.g.Python,Java,C/C++,Matlab)proficiently
• Courses• Algorithms• Somecalculus• Probabilityandstatistics• Linearalgebra(optionalbuthighlyrecommended)
Prerequisites• Aquiz:• ThisFriday,inclass• 22simplequestions,20ofthemasTrueorFalsequestions(relevanttoprobability,statistics,andlinearalgebra)• Thepurposeofthisquizistoindicatetheexpectedbackgroundofstudents.• 80%ofthequestionsshouldbeeasytoanswer.• Notcountedinyourfinalscore!
TextbookandReferences• Maintextbook(andsomeslides)• DanJurafsky andJamesH.Martin,"SpeechandLanguageProcessing,2nd Edition",PrenticeHall,2009.
• Wewillusesomematerialfrom3rd editionwhenitisavailable.• http://web.stanford.edu/~jurafsky/slp3/
• Youtube video:https://www.youtube.com/watch?v=s3kKlUBa3b0
• Otherreference• ChrisManningandHinrich Schutze,"FoundationsofStatisticalNaturalLanguageProcessing",MITPress,1999
• Machinelearningtextbooks:• ChristopherM.Bishop,"PatternRecognitionandMachineLearning",Springer,2006.
• TomMitchell,"MachineLearning",McGrawHill,1997.
TopicsoftheCourse(tentatively)• LanguageModeling
• Part-of-SpeechTagging
• TextCategorization:WordSenseDisambiguation,NamedEntityRecognition
• Syntax:FormalGrammarsofEnglish,SyntacticParsing,StatisticalParsing,DependencyParsing
• Semantics:Vector-Space,LexicalSemantics,SemanticswithDenseVectors
• InformationExtraction
• QuestionAnswering
• MachineTranslation
• Summarization
• SentimentAnalysis,OpinionMining
• NLPandSocialMedia• DialogSystemsandChatbots
TheGoal
• StudyfundamentaltasksinNLP
• Learnsomeclassicandstate-of-the-arttechniques
• Acquirehands-onskillsforsolvingNLPproblems• Evensomeresearchexperience!
Grading• Assignment(30%)• 2 assignments,15%foreach
• Quiz(5%)• 8 in-classtests,1%foreach(threelowestscoresaredropped)• Tuesdays,andstartingnextweek
• FinalExam(35%)• Project(25%)• Participation(5%)• Classes:askandanswerquestions,participateindiscussions…• Piazza:helpyourpeers,addressquestions…
Exam
• Openbook• Timeandplace,TBD(stillschedulingwiththecollege)• Pleasedon’tmaketravelarrangementsforexamweeks.
CourseProjectGrading
• Wewanttoseenovelprojects!• Theproblemneedstobewell-defined,novel,useful,andpractical.
• Reasonableresultsandobservations.
• Weencourageyoutotacklearesearch-drivenproblem.
• Option1:aresearchprojectdiscussedwiththeinstructor• Abettersolutionforanexistingproblem• Oranovelproblem
• Option2:Thefakenewschallenge
SampleProjectsfromPreviousOffering
• Projectreportsarelistedhere:http://www.ccs.neu.edu/home/luwang/courses/cs6120_fa2017.html• NeuralSemanticParsingNaturalLanguageintoSQL• ShortPassagesReadingComprehensionandQuestionAnswering• PoliticalPromiseEvaluation(PPE)• PredictingPersonalityTraitsusingTweets• STORYNEXT2.0:ATEXTINSIGHTS/VISUALIZATIONTOOL• AndroidApplicationforVisualQA• NovelSummarizerandKeywordIdentifierUsingTextRankwithSentenceFarmDetection• ParaphraseGeneration• HashtagSimilaritybasedonTweetText• StanceDetectionfortheFakeNewsChallenge• MachineComprehensionUsingmatch-LSTMandAnswer-Pointer• OnlineAbuseDetection• PlagiarismDetectionUsingFP-GrowthAlgorithm• AnExaminationofInfluentialFramingofControversialTopicsonTwitter
OnlineAbuseDetection
Automating the process of identifying abuse comments wouldnot only save time for the Social Media platforms but alsowould increase user safety and improve discussions online.
Build a classifier that classifies the test data as either an“Abuse” (aggression, personal attack or a toxic statement) or“Not-Abuse” statement using multiple techniques.
TheFakeNewsChallenge
• Website:http://www.fakenewschallenge.org/• Goal:“Thegoalofthe FakeNewsChallenge istoexplorehowartificialintelligencetechnologies,particularlymachinelearningandnaturallanguageprocessing,mightbeleveragedtocombatthefakenewsproblem.WebelievethattheseAItechnologiesholdpromiseforsignificantlyautomatingpartsoftheprocedurehumanfactcheckersusetodaytodetermineifastoryisrealorahoax.”
CourseProjectGrading
• Wewanttoseenovelprojects!• Theproblemneedstobewell-defined,novel,useful,andpractical.
• Reasonableresultsandobservations.
• Weencourageyoutotacklearesearch-drivenproblem.• Abettersolutionforanexistingproblem• Oranovelproblem
• FeelfreetotalktotheinstructororTAsonprojecttopicsduringofficehours.
CourseProjectGrading
• Threereports• Proposal(3%),duebytheendofJanuary• Progress,withcode(7%)• Final,withcode(10%)
• Onepresentation• Inclass(5%)
AudienceAward
• Bonuspoints!• Allteamsvotefortheirfavoriteproject(s).• Bestprojectgets1%asbonus(onebestprojectineach
batch,ifweneedtohavemorethanonebatch/lectureforpresentation)
SubmissionandLatePolicy• Eachassignmentorreportisdueatthebeginningofclassonthecorrespondingduedate.
• Programminglanguage• Python(encouraged),Java,C/C++
• Electronicversion• Onblackboard
SubmissionandLatePolicy
• Assignmentorreportturnedinlatewillbecharged20points(outof100points)offforeachlateday(i.e.24hours).
• Eachstudenthasabudgetof5days throughoutthesemesterbeforealatepenaltyisapplied.
• Latedaysarenotapplicabletofinalpresentation.
• Eachgroupmemberischargedwiththesamenumberoflatedays,ifany,fortheirsubmission.
Howtofindus?• Coursewebpage:• http://www.ccs.neu.edu/home/luwang/courses/cs6120_sp2018/cs6120_sp2018.html
• Officehours• Prof.LuWang:Tuesdays,from5:15pmto6:15pm,orbyappointment,258WVH• TALiwen Hou• TATirthraj Maheshkumar Parmar• TAManthan Thakar
• Piazza• http://piazza.com/northeastern/sp2018/cs6120/home• Allcourserelevantquestionsshouldgohere– alsoisthebestwaytoreachtheinstructorandTAs!
WhatisNaturalLanguageProcessing?
• Allowingmachinestocommunicatewithhuman
• Naturallanguageunderstanding+naturallanguagegeneration
Whatdoesitmeantounderstandalanguage?Phonology
Morphology
Lexemes
Syntax
Semantics
Pragmatics
Discourse
Soundwaves
Words
Parsetrees
Meanings
Whatdoesitmeantounderstandalanguage?Phonology
Morphology
Lexemes
Syntax
Semantics
Pragmatics
Discourse
ShallowerAnalysis
DeeperAnalysis
Syntax,Semantic,Pragmatics• Syntaxconcernstheproperorderingofwordsanditsaffectonmeaning.
• Thedogbittheboy.• Theboybitthedog.• Bitboydogthethe.
• Semanticsconcernsthe(literal)meaningofwords,phrases,andsentences.• “plant”asaphotosyntheticorganism• “plant”asamanufacturingfacility• “plant”astheactofsowing
• Pragmaticsconcernstheoverallcommunicativeandsocialcontextanditseffectoninterpretation.• Thehamsandwichwantsanotherbeer.• Johnthinksvanilla.
[ModifiedfromRayMooney’sSlides]
AmbiguityisUbiquitous• SpeechRecognition
• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”
AmbiguityisUbiquitous• SpeechRecognition
• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”
• SyntacticAnalysis• “Iatespaghettiwith chopsticks”vs.“Iatespaghettiwith meatballs.”
AmbiguityisUbiquitous• SpeechRecognition
• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”
• SyntacticAnalysis• “Iatespaghettiwith chopsticks”vs.“Iatespaghettiwith meatballs.”
• SemanticAnalysis• “Thedogisinthepen.”vs.“Theinkisinthepen.”• “Iputtheplant inthewindow”vs.“Fordputtheplant inMexico”
AmbiguityisUbiquitous• SpeechRecognition
• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”
• SyntacticAnalysis• “Iatespaghettiwith chopsticks”vs.“Iatespaghettiwith meatballs.”
• SemanticAnalysis• “Thedogisinthepen.”vs.“Theinkisinthepen.”• “Iputtheplant inthewindow”vs.“Fordputtheplant inMexico”
• PragmaticAnalysis• From“ThePinkPantherStrikesAgain”:• Clouseau:Doesyourdogbite?HotelClerk:No.Clouseau:[bowingdowntopetthedog]Nicedoggie.[DogbarksandbitesClouseau inthehand]Clouseau:Ithoughtyousaidyourdogdidnotbite!HotelClerk:Thatisnotmydog.
AmbiguityisExplosive• Ambiguitiescompoundtogenerateenormousnumbersofpossibleinterpretations.• InEnglish,asentenceendinginn prepositionalphraseshasover 2nsyntacticinterpretations(cf.Catalannumbers).• “Isawthemanwiththetelescope”:2parses• “Isawthemanonthehillwiththetelescope.”:5parses• “IsawthemanonthehillinTexaswiththetelescope”:14parses• “IsawthemanonthehillinTexaswiththetelescopeatnoon.”:42parses• “IsawthemanonthehillinTexaswiththetelescopeatnoononMonday”:132parses
HumorandAmbiguity
• Manyjokesrelyontheambiguityoflanguage:• Policemantolittleboy:“Wearelookingforathiefwithabicycle.”Littleboy:“Wouldn’tyoubebetterusingyoureyes.”• Whyistheteacherwearingsun-glasses.Becausetheclassissobright.• GrouchoMarx:OnemorningIshotanelephantinmypajamas.Howhegotintomypajamas,I’llneverknow.• Shecriticizedmyapartment,soIknockedherflat.• Noahtookalloftheanimalsonthearkinpairs.Excepttheworms,theycameinapples.
WhyisLanguageAmbiguous?
• Havingauniquelinguisticexpressionforeverypossibleconceptualizationthatcouldbeconveyedwouldmakelanguageoverlycomplexandlinguisticexpressionsunnecessarilylong.• Allowingresolvableambiguitypermitsshorterlinguisticexpressions,i.e.datacompression.• Languagereliesonpeople’sabilitytousetheirknowledgeandinferenceabilitiestoproperlyresolveambiguities.• Infrequently,disambiguationfails,i.e.thecompressionislossy.
WordSegmentation
• Breakingastringofcharactersintoasequenceofwords.• Insomewrittenlanguages(e.g.Chinese)wordsarenotseparatedbyspaces.• EveninEnglish,charactersotherthanwhite-spacecanbeusedtoseparatewords[e.g.,;.- :() ]• ExamplesfromEnglishURLs:• jumptheshark.comÞ jumptheshark.com• myspace.com/pluckerswingbarÞmyspace .compluckers wingbarÞmyspace .complucker swingbar
MorphologicalAnalysis
• Morphology isthefieldoflinguisticsthatstudiestheinternalstructureofwords.(Wikipedia)• Amorpheme isthesmallestlinguisticunitthathassemanticmeaning(Wikipedia)
• e.g.“carry”,“pre”,“ed”,“ly”,“s”
• Morphologicalanalysisisthetaskofsegmentingawordintoitsmorphemes:• carriedÞ carry+ed (pasttense)• independentlyÞ in+(depend+ent)+ly• GooglersÞ (Google+er)+s(plural)• unlockableÞ un+(lock+able)?
Þ (un+lock)+able?
PartOfSpeech(POS)Tagging
• Annotateeachwordinasentencewithapart-of-speech.
• Usefulforsubsequentsyntacticparsingandwordsensedisambiguation.
I ate the spaghetti with meatballs. Pro V Det N Prep N
John saw the saw and decided to take it to the table.PN V Det N Con V Part V Pro Prep Det N
PhraseChunking
• Findallnon-recursivenounphrases(NPs)andverbphrases(VPs)inasentence.• [NPI][VPate][NPthespaghetti][PPwith][NPmeatballs].• [NP He][VP reckons ][NP thecurrentaccountdeficit][VP willnarrow ][PPto][NP only#1.8billion][PP in][NP September]
WordSenseDisambiguation(WSD)
• Wordsinnaturallanguageusuallyhaveafairnumberofdifferentpossiblemeanings.• Ellenhasastronginterest incomputationallinguistics.• Ellenpaysalargeamountofinterest onhercreditcard.
• Formanytasks(questionanswering,translation),thepropersenseofeachambiguouswordinasentencemustbedetermined.
SemanticRoleLabeling(SRL)
• Foreachclause,determinethesemanticroleplayedbyeachnounphrasethatisanargumenttotheverb.agent patient source destination instrument• John droveMary fromAustin toDallas inhisToyotaPrius.• Thehammer brokethewindow.
• Alsoreferredtoa“caseroleanalysis,”“thematicanalysis,”and“shallowsemanticparsing”
SemanticParsing
• Asemanticparsermapsanatural-languagesentencetoacomplete,detailedsemanticrepresentation(logicalform).• Formanyapplications,thedesiredoutputisimmediatelyexecutablebyanotherprogram.• Example:MappinganEnglishdatabasequerytoProlog:
HowmanycitiesarethereintheUS?answer(A,count(B,(city(B),loc(B,C),
const(C,countryid(USA))),A))
TextualEntailment
• Determinewhetheronenaturallanguagesentenceentails(implies)anotherunderanordinaryinterpretation.
• E.g.,“Asoccergamewithmultiplemalesplaying.->Somemenareplayingasport.”
AnaphoraResolution/Co-Reference
• Determinewhichphrasesinadocumentrefertothesameunderlyingentity.• Johnputthecarrotontheplateandateit.
• BushstartedthewarinIraq.ButthepresidentneededtheconsentofCongress.
• Somecasesrequiredifficultreasoning.• TodaywasJack'sbirthday.PennyandJanetwenttothestore.Theyweregoingtogetpresents.Janetdecidedtogetakite."Don'tdothat,"saidPenny."Jackhasakite.Hewillmakeyoutakeit back."
InformationExtraction(IE)
• Identifyphrasesinlanguagethatrefertospecifictypesofentitiesandrelationsintext.• Namedentityrecognitionistaskofidentifyingnamesofpeople,places,organizations,etc.intext.people organizations places• MichaelDell istheCEOofDellComputerCorporation andlivesinAustinTexas.
• Relationextractionidentifiesspecificrelationsbetweenentities.• MichaelDell istheCEOof DellComputerCorporation andlivesinAustinTexas.• MichaelDell istheCEOofDellComputerCorporationandlivesin AustinTexas.
QuestionAnswering• Directlyanswernaturallanguagequestionsbasedoninformationpresentedinacorporaoftextualdocuments(e.g.theweb).• WhoisthepresidentofUnitedStates?
• DonaldTrump
• WhatisthepopularofMassachusetts?• 6.8million
TextSummarization
• Produceashortsummaryofoneormanylongerdocument(s).• Article: Aninternationalteamofscientistsstudieddietandmortalityin135,335peoplebetween35and70yearsoldin18countries,followingthemforanaverageofmorethansevenyears.Dietinformationdependedonself-reports,andthescientistscontrolledforfactorsincludingage,sex,smoking,physicalactivityandbodymassindex. ThestudyisinTheLancet.Comparedwithpeoplewhoatethelowest20percentofcarbohydrates,thosewhoatethehighest20percenthada28percentincreasedriskofdeath.Buthighcarbohydrateintakewasnotassociatedwithcardiovasculardeath.…
• Summary: Researchersfoundthatpeoplewhoatehigheramountsofcarbohydrateshadahigherriskofdyingthanthosewhoatemorefats.
AmbiguityResolutionisRequiredforTranslation• Syntacticandsemanticambiguitiesmustbeproperlyresolvedforcorrecttranslation:• “Johnplays theguitar.”→“John弹吉他”• “Johnplays soccer.”→“John踢足球”
AmbiguityResolutionisRequiredforTranslation• Syntacticandsemanticambiguitiesmustbeproperlyresolvedforcorrecttranslation:• “Johnplays theguitar.”→“John弹吉他”• “Johnplays soccer.”→“John踢足球”
• AnapocryphalstoryisthatanearlyMTsystemgavethefollowingresultswhentranslatingfromEnglishtoRussianandthenbacktoEnglish:• “Thespiritiswillingbutthefleshisweak.”à “Theliquorisgoodbutthemeatisspoiled.”• “Outofsight,outofmind.”à “Invisibleidiot.”
ResolvingAmbiguity• Choosingthecorrectinterpretationoflinguisticutterancesrequires(commonsense)knowledgeof:• Syntax
• Anagentistypicallythesubjectoftheverb• Semantics
• MichaelandEllenarenamesofpeople• Austinisthenameofacity(andofaperson)• ToyotaisacarcompanyandPriusisabrandofcar
• Pragmatics• Somesocialnorm,communicativegoals• Askingaquestion,expectingananswer
• Worldknowledge• Creditcardsrequireuserstopayfinancialinterest• Agentsmustbeanimateandahammerisnotanimate
State-of-the-Arts
• Learningfromlargeamountsoftextdata(cf.rule-basedmethods)• Supervisedlearningorunsupervisedlearning
• Statisticalmachinelearning-basedmethods• Theprobabilisticknowledgeacquiredallowsrobustprocessingthathandleslinguisticregularitiesaswellasexceptions.
• Nowwithneuralnetwork-basedmethodsmostly
RelatedFields
• ArtificialIntelligence• MachineLearning• Linguistics• Cognitivescience• Logic• Datascience• Politicalscience• Education• …manymore
RelevantScientificConferencesandJournals
• AssociationforComputationalLinguistics(ACL)• NorthAmericanAssociationforComputationalLinguistics(NAACL)• EmpiricalMethodsinNaturalLanguageProcessing(EMNLP)• InternationalConferenceonComputationalLinguistics(COLING)• ConferenceonComputationalNaturalLanguageLearning(CoNLL)• TransactionsoftheAssociationforComputationalLinguistics(TACL)• JournalofComputationalLinguistics(CL)