68
CS 6120/CS4120: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang

CS 6120/CS4120: Natural Language Processing...Semantic Role Labeling (SRL) •For each clause, determine the semantic role played by each noun phrase that is an argument to the verb

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

CS6120/CS4120:NaturalLanguageProcessing

Instructor:Prof.LuWangCollegeofComputerandInformationScience

NortheasternUniversityWebpage:www.ccs.neu.edu/home/luwang

TimeandLocation

• Time:TuesdaysandFridays,3:25pm- 5:05pm

• Location:WestVillageH108

CourseWebpage

• http://www.ccs.neu.edu/home/luwang/courses/cs6120_sp2018/cs6120_sp2018.html

Prerequisites• Programming• Beingabletowritecodeinsomeprogramminglanguages(e.g.Python,Java,C/C++,Matlab)proficiently

• Courses• Algorithms• Somecalculus• Probabilityandstatistics• Linearalgebra(optionalbuthighlyrecommended)

Prerequisites• Aquiz:• ThisFriday,inclass• 22simplequestions,20ofthemasTrueorFalsequestions(relevanttoprobability,statistics,andlinearalgebra)• Thepurposeofthisquizistoindicatetheexpectedbackgroundofstudents.• 80%ofthequestionsshouldbeeasytoanswer.• Notcountedinyourfinalscore!

TextbookandReferences• Maintextbook(andsomeslides)• DanJurafsky andJamesH.Martin,"SpeechandLanguageProcessing,2nd Edition",PrenticeHall,2009.

• Wewillusesomematerialfrom3rd editionwhenitisavailable.• http://web.stanford.edu/~jurafsky/slp3/

• Youtube video:https://www.youtube.com/watch?v=s3kKlUBa3b0

• Otherreference• ChrisManningandHinrich Schutze,"FoundationsofStatisticalNaturalLanguageProcessing",MITPress,1999

• Machinelearningtextbooks:• ChristopherM.Bishop,"PatternRecognitionandMachineLearning",Springer,2006.

• TomMitchell,"MachineLearning",McGrawHill,1997.

TopicsoftheCourse(tentatively)• LanguageModeling

• Part-of-SpeechTagging

• TextCategorization:WordSenseDisambiguation,NamedEntityRecognition

• Syntax:FormalGrammarsofEnglish,SyntacticParsing,StatisticalParsing,DependencyParsing

• Semantics:Vector-Space,LexicalSemantics,SemanticswithDenseVectors

• InformationExtraction

• QuestionAnswering

• MachineTranslation

• Summarization

• SentimentAnalysis,OpinionMining

• NLPandSocialMedia• DialogSystemsandChatbots

TheGoal

• StudyfundamentaltasksinNLP

• Learnsomeclassicandstate-of-the-arttechniques

• Acquirehands-onskillsforsolvingNLPproblems• Evensomeresearchexperience!

Grading• Assignment(30%)• 2 assignments,15%foreach

• Quiz(5%)• 8 in-classtests,1%foreach(threelowestscoresaredropped)• Tuesdays,andstartingnextweek

• FinalExam(35%)• Project(25%)• Participation(5%)• Classes:askandanswerquestions,participateindiscussions…• Piazza:helpyourpeers,addressquestions…

Exam

• Openbook• Timeandplace,TBD(stillschedulingwiththecollege)• Pleasedon’tmaketravelarrangementsforexamweeks.

CourseProject

• AnNLP-relatedresearchproject

• 2-3studentsasateam

CourseProjectGrading

• Wewanttoseenovelprojects!• Theproblemneedstobewell-defined,novel,useful,andpractical.

• Reasonableresultsandobservations.

• Weencourageyoutotacklearesearch-drivenproblem.

• Option1:aresearchprojectdiscussedwiththeinstructor• Abettersolutionforanexistingproblem• Oranovelproblem

• Option2:Thefakenewschallenge

SampleProjectsfromPreviousOffering

• Projectreportsarelistedhere:http://www.ccs.neu.edu/home/luwang/courses/cs6120_fa2017.html• NeuralSemanticParsingNaturalLanguageintoSQL• ShortPassagesReadingComprehensionandQuestionAnswering• PoliticalPromiseEvaluation(PPE)• PredictingPersonalityTraitsusingTweets• STORYNEXT2.0:ATEXTINSIGHTS/VISUALIZATIONTOOL• AndroidApplicationforVisualQA• NovelSummarizerandKeywordIdentifierUsingTextRankwithSentenceFarmDetection• ParaphraseGeneration• HashtagSimilaritybasedonTweetText• StanceDetectionfortheFakeNewsChallenge• MachineComprehensionUsingmatch-LSTMandAnswer-Pointer• OnlineAbuseDetection• PlagiarismDetectionUsingFP-GrowthAlgorithm• AnExaminationofInfluentialFramingofControversialTopicsonTwitter

NeuralSemanticParsingNaturalLanguageintoSQL

ShortPassagesReadingComprehensionandQuestionAnswering

OnlineAbuseDetection

Automating the process of identifying abuse comments wouldnot only save time for the Social Media platforms but alsowould increase user safety and improve discussions online.

Build a classifier that classifies the test data as either an“Abuse” (aggression, personal attack or a toxic statement) or“Not-Abuse” statement using multiple techniques.

StoryNext 2:SentimentAnalysisforDocuments

Option2:TheFakeNewsChallenge

TheFakeNewsChallenge

• Website:http://www.fakenewschallenge.org/• Goal:“Thegoalofthe FakeNewsChallenge istoexplorehowartificialintelligencetechnologies,particularlymachinelearningandnaturallanguageprocessing,mightbeleveragedtocombatthefakenewsproblem.WebelievethattheseAItechnologiesholdpromiseforsignificantlyautomatingpartsoftheprocedurehumanfactcheckersusetodaytodetermineifastoryisrealorahoax.”

TheFakeNewsChallenge

• Stage1:StanceDetection

TheFakeNewsChallenge

• Data:https://github.com/FakeNewsChallenge/fnc-1

Headline:“RobertPlantRippedup$800MLedZeppelinReunionContract”

CourseProjectGrading

• Wewanttoseenovelprojects!• Theproblemneedstobewell-defined,novel,useful,andpractical.

• Reasonableresultsandobservations.

• Weencourageyoutotacklearesearch-drivenproblem.• Abettersolutionforanexistingproblem• Oranovelproblem

• FeelfreetotalktotheinstructororTAsonprojecttopicsduringofficehours.

CourseProjectGrading

• Threereports• Proposal(3%),duebytheendofJanuary• Progress,withcode(7%)• Final,withcode(10%)

• Onepresentation• Inclass(5%)

AudienceAward

• Bonuspoints!• Allteamsvotefortheirfavoriteproject(s).• Bestprojectgets1%asbonus(onebestprojectineach

batch,ifweneedtohavemorethanonebatch/lectureforpresentation)

SubmissionandLatePolicy• Eachassignmentorreportisdueatthebeginningofclassonthecorrespondingduedate.

• Programminglanguage• Python(encouraged),Java,C/C++

• Electronicversion• Onblackboard

SubmissionandLatePolicy

• Assignmentorreportturnedinlatewillbecharged20points(outof100points)offforeachlateday(i.e.24hours).

• Eachstudenthasabudgetof5days throughoutthesemesterbeforealatepenaltyisapplied.

• Latedaysarenotapplicabletofinalpresentation.

• Eachgroupmemberischargedwiththesamenumberoflatedays,ifany,fortheirsubmission.

Howtofindus?• Coursewebpage:• http://www.ccs.neu.edu/home/luwang/courses/cs6120_sp2018/cs6120_sp2018.html

• Officehours• Prof.LuWang:Tuesdays,from5:15pmto6:15pm,orbyappointment,258WVH• TALiwen Hou• TATirthraj Maheshkumar Parmar• TAManthan Thakar

• Piazza• http://piazza.com/northeastern/sp2018/cs6120/home• Allcourserelevantquestionsshouldgohere– alsoisthebestwaytoreachtheinstructorandTAs!

WhatisNaturalLanguageProcessing?

• Allowingmachinestocommunicatewithhuman

• Naturallanguageunderstanding+naturallanguagegeneration

Whatdoesitmeantounderstandalanguage?

Whatdoesitmeantounderstandalanguage?Phonology

Morphology

Lexemes

Syntax

Semantics

Pragmatics

Discourse

Soundwaves

Words

Parsetrees

Meanings

Whatdoesitmeantounderstandalanguage?Phonology

Morphology

Lexemes

Syntax

Semantics

Pragmatics

Discourse

ShallowerAnalysis

DeeperAnalysis

Syntax,Semantic,Pragmatics• Syntaxconcernstheproperorderingofwordsanditsaffectonmeaning.

• Thedogbittheboy.• Theboybitthedog.• Bitboydogthethe.

• Semanticsconcernsthe(literal)meaningofwords,phrases,andsentences.• “plant”asaphotosyntheticorganism• “plant”asamanufacturingfacility• “plant”astheactofsowing

• Pragmaticsconcernstheoverallcommunicativeandsocialcontextanditseffectoninterpretation.• Thehamsandwichwantsanotherbeer.• Johnthinksvanilla.

[ModifiedfromRayMooney’sSlides]

AmbiguityisUbiquitous• SpeechRecognition

• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”

AmbiguityisUbiquitous• SpeechRecognition

• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”

• SyntacticAnalysis• “Iatespaghettiwith chopsticks”vs.“Iatespaghettiwith meatballs.”

AmbiguityisUbiquitous• SpeechRecognition

• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”

• SyntacticAnalysis• “Iatespaghettiwith chopsticks”vs.“Iatespaghettiwith meatballs.”

• SemanticAnalysis• “Thedogisinthepen.”vs.“Theinkisinthepen.”• “Iputtheplant inthewindow”vs.“Fordputtheplant inMexico”

AmbiguityisUbiquitous• SpeechRecognition

• “recognizespeech”vs.“wreckanicebeach”• “youthinAsia”vs.“euthanasia”

• SyntacticAnalysis• “Iatespaghettiwith chopsticks”vs.“Iatespaghettiwith meatballs.”

• SemanticAnalysis• “Thedogisinthepen.”vs.“Theinkisinthepen.”• “Iputtheplant inthewindow”vs.“Fordputtheplant inMexico”

• PragmaticAnalysis• From“ThePinkPantherStrikesAgain”:• Clouseau:Doesyourdogbite?HotelClerk:No.Clouseau:[bowingdowntopetthedog]Nicedoggie.[DogbarksandbitesClouseau inthehand]Clouseau:Ithoughtyousaidyourdogdidnotbite!HotelClerk:Thatisnotmydog.

AmbiguityisExplosive• Ambiguitiescompoundtogenerateenormousnumbersofpossibleinterpretations.• InEnglish,asentenceendinginn prepositionalphraseshasover 2nsyntacticinterpretations(cf.Catalannumbers).• “Isawthemanwiththetelescope”:2parses• “Isawthemanonthehillwiththetelescope.”:5parses• “IsawthemanonthehillinTexaswiththetelescope”:14parses• “IsawthemanonthehillinTexaswiththetelescopeatnoon.”:42parses• “IsawthemanonthehillinTexaswiththetelescopeatnoononMonday”:132parses

HumorandAmbiguity

• Manyjokesrelyontheambiguityoflanguage:• Policemantolittleboy:“Wearelookingforathiefwithabicycle.”Littleboy:“Wouldn’tyoubebetterusingyoureyes.”• Whyistheteacherwearingsun-glasses.Becausetheclassissobright.• GrouchoMarx:OnemorningIshotanelephantinmypajamas.Howhegotintomypajamas,I’llneverknow.• Shecriticizedmyapartment,soIknockedherflat.• Noahtookalloftheanimalsonthearkinpairs.Excepttheworms,theycameinapples.

WhyisLanguageAmbiguous?

WhyisLanguageAmbiguous?

• Havingauniquelinguisticexpressionforeverypossibleconceptualizationthatcouldbeconveyedwouldmakelanguageoverlycomplexandlinguisticexpressionsunnecessarilylong.• Allowingresolvableambiguitypermitsshorterlinguisticexpressions,i.e.datacompression.• Languagereliesonpeople’sabilitytousetheirknowledgeandinferenceabilitiestoproperlyresolveambiguities.• Infrequently,disambiguationfails,i.e.thecompressionislossy.

SomeNLPTasks

SyntacticTasks

WordSegmentation

• Breakingastringofcharactersintoasequenceofwords.• Insomewrittenlanguages(e.g.Chinese)wordsarenotseparatedbyspaces.• EveninEnglish,charactersotherthanwhite-spacecanbeusedtoseparatewords[e.g.,;.- :() ]• ExamplesfromEnglishURLs:• jumptheshark.comÞ jumptheshark.com• myspace.com/pluckerswingbarÞmyspace .compluckers wingbarÞmyspace .complucker swingbar

MorphologicalAnalysis

• Morphology isthefieldoflinguisticsthatstudiestheinternalstructureofwords.(Wikipedia)• Amorpheme isthesmallestlinguisticunitthathassemanticmeaning(Wikipedia)

• e.g.“carry”,“pre”,“ed”,“ly”,“s”

• Morphologicalanalysisisthetaskofsegmentingawordintoitsmorphemes:• carriedÞ carry+ed (pasttense)• independentlyÞ in+(depend+ent)+ly• GooglersÞ (Google+er)+s(plural)• unlockableÞ un+(lock+able)?

Þ (un+lock)+able?

PartOfSpeech(POS)Tagging

• Annotateeachwordinasentencewithapart-of-speech.

• Usefulforsubsequentsyntacticparsingandwordsensedisambiguation.

I ate the spaghetti with meatballs. Pro V Det N Prep N

John saw the saw and decided to take it to the table.PN V Det N Con V Part V Pro Prep Det N

PhraseChunking

• Findallnon-recursivenounphrases(NPs)andverbphrases(VPs)inasentence.• [NPI][VPate][NPthespaghetti][PPwith][NPmeatballs].• [NP He][VP reckons ][NP thecurrentaccountdeficit][VP willnarrow ][PPto][NP only#1.8billion][PP in][NP September]

SyntacticParsing

• Producethecorrectsyntacticparsetreeforasentence.

SemanticTasks

WordSenseDisambiguation(WSD)

• Wordsinnaturallanguageusuallyhaveafairnumberofdifferentpossiblemeanings.• Ellenhasastronginterest incomputationallinguistics.• Ellenpaysalargeamountofinterest onhercreditcard.

• Formanytasks(questionanswering,translation),thepropersenseofeachambiguouswordinasentencemustbedetermined.

SemanticRoleLabeling(SRL)

• Foreachclause,determinethesemanticroleplayedbyeachnounphrasethatisanargumenttotheverb.agent patient source destination instrument• John droveMary fromAustin toDallas inhisToyotaPrius.• Thehammer brokethewindow.

• Alsoreferredtoa“caseroleanalysis,”“thematicanalysis,”and“shallowsemanticparsing”

SemanticParsing

• Asemanticparsermapsanatural-languagesentencetoacomplete,detailedsemanticrepresentation(logicalform).• Formanyapplications,thedesiredoutputisimmediatelyexecutablebyanotherprogram.• Example:MappinganEnglishdatabasequerytoProlog:

HowmanycitiesarethereintheUS?answer(A,count(B,(city(B),loc(B,C),

const(C,countryid(USA))),A))

TextualEntailment

• Determinewhetheronenaturallanguagesentenceentails(implies)anotherunderanordinaryinterpretation.

• E.g.,“Asoccergamewithmultiplemalesplaying.->Somemenareplayingasport.”

Pragmatics/DiscourseTasks

AnaphoraResolution/Co-Reference

• Determinewhichphrasesinadocumentrefertothesameunderlyingentity.• Johnputthecarrotontheplateandateit.

• BushstartedthewarinIraq.ButthepresidentneededtheconsentofCongress.

• Somecasesrequiredifficultreasoning.• TodaywasJack'sbirthday.PennyandJanetwenttothestore.Theyweregoingtogetpresents.Janetdecidedtogetakite."Don'tdothat,"saidPenny."Jackhasakite.Hewillmakeyoutakeit back."

MoreApplication-drivenTasks

InformationExtraction(IE)

• Identifyphrasesinlanguagethatrefertospecifictypesofentitiesandrelationsintext.• Namedentityrecognitionistaskofidentifyingnamesofpeople,places,organizations,etc.intext.people organizations places• MichaelDell istheCEOofDellComputerCorporation andlivesinAustinTexas.

• Relationextractionidentifiesspecificrelationsbetweenentities.• MichaelDell istheCEOof DellComputerCorporation andlivesinAustinTexas.• MichaelDell istheCEOofDellComputerCorporationandlivesin AustinTexas.

QuestionAnswering• Directlyanswernaturallanguagequestionsbasedoninformationpresentedinacorporaoftextualdocuments(e.g.theweb).• WhoisthepresidentofUnitedStates?

• DonaldTrump

• WhatisthepopularofMassachusetts?• 6.8million

TextSummarization

• Produceashortsummaryofoneormanylongerdocument(s).• Article: Aninternationalteamofscientistsstudieddietandmortalityin135,335peoplebetween35and70yearsoldin18countries,followingthemforanaverageofmorethansevenyears.Dietinformationdependedonself-reports,andthescientistscontrolledforfactorsincludingage,sex,smoking,physicalactivityandbodymassindex. ThestudyisinTheLancet.Comparedwithpeoplewhoatethelowest20percentofcarbohydrates,thosewhoatethehighest20percenthada28percentincreasedriskofdeath.Buthighcarbohydrateintakewasnotassociatedwithcardiovasculardeath.…

• Summary: Researchersfoundthatpeoplewhoatehigheramountsofcarbohydrateshadahigherriskofdyingthanthosewhoatemorefats.

SpokenDialogueSystems-- Chatbots

• Q:Isitgoingtoraintoday?• A:Itwillbemostlysunny.Norainisexpected.

MachineTranslation

• Translateasentencefromonenaturallanguagetoanother.• 我喜欢汉堡à Ilikeburgers.

AmbiguityResolutionisRequiredforTranslation• Syntacticandsemanticambiguitiesmustbeproperlyresolvedforcorrecttranslation:• “Johnplays theguitar.”→“John弹吉他”• “Johnplays soccer.”→“John踢足球”

AmbiguityResolutionisRequiredforTranslation• Syntacticandsemanticambiguitiesmustbeproperlyresolvedforcorrecttranslation:• “Johnplays theguitar.”→“John弹吉他”• “Johnplays soccer.”→“John踢足球”

• AnapocryphalstoryisthatanearlyMTsystemgavethefollowingresultswhentranslatingfromEnglishtoRussianandthenbacktoEnglish:• “Thespiritiswillingbutthefleshisweak.”à “Theliquorisgoodbutthemeatisspoiled.”• “Outofsight,outofmind.”à “Invisibleidiot.”

ResolvingAmbiguity• Choosingthecorrectinterpretationoflinguisticutterancesrequires(commonsense)knowledgeof:• Syntax

• Anagentistypicallythesubjectoftheverb• Semantics

• MichaelandEllenarenamesofpeople• Austinisthenameofacity(andofaperson)• ToyotaisacarcompanyandPriusisabrandofcar

• Pragmatics• Somesocialnorm,communicativegoals• Askingaquestion,expectingananswer

• Worldknowledge• Creditcardsrequireuserstopayfinancialinterest• Agentsmustbeanimateandahammerisnotanimate

State-of-the-Arts

• Learningfromlargeamountsoftextdata(cf.rule-basedmethods)• Supervisedlearningorunsupervisedlearning

• Statisticalmachinelearning-basedmethods• Theprobabilisticknowledgeacquiredallowsrobustprocessingthathandleslinguisticregularitiesaswellasexceptions.

• Nowwithneuralnetwork-basedmethodsmostly

RelatedFields

• ArtificialIntelligence• MachineLearning• Linguistics• Cognitivescience• Logic• Datascience• Politicalscience• Education• …manymore

RelevantScientificConferencesandJournals

• AssociationforComputationalLinguistics(ACL)• NorthAmericanAssociationforComputationalLinguistics(NAACL)• EmpiricalMethodsinNaturalLanguageProcessing(EMNLP)• InternationalConferenceonComputationalLinguistics(COLING)• ConferenceonComputationalNaturalLanguageLearning(CoNLL)• TransactionsoftheAssociationforComputationalLinguistics(TACL)• JournalofComputationalLinguistics(CL)