A Corpus Based Computational Linguistics Alon Itai Department of Computer Science Technion—Israel Institute of Technology, Haifa, Israel

A Corpus Based A Corpus Based Computational LinguisticsComputational Linguistics

Alon Itai Alon Itai Department of Computer ScienceDepartment of Computer Science

TechnionmdashIsrael Institute of TechnionmdashIsrael Institute of Technology Haifa IsraelTechnology Haifa Israel

Language is the communication media between humans If we want to allow computers to communicate with us we have to teach them human language (aka ldquoNatural Languagerdquo)

Investigating natural language is a complex and interesting challange

It also has many immediate technological benefits such asInternet search enginesvoice interface with computersndash for children handicapped persons

IntroductionIntroduction

Search Google for ldquodogrdquoSearch Google for ldquodogrdquo DogpileDogpile Web Search Home Page Web Search Home Page

dogdogcomcom - a - a dogsdogs best friend best friend

Sausage Software - HotDog Web Editors - HotDog Sausage Software - HotDog Web Editors - HotDog Professional Professional Special Make certain you check out our Special Make certain you check out our DogDog Packs before making yourPacks before making yourpurchase purchase דפים דומיםדפים דומים

explodingdog explodingdog DogDog-Play Great activities you can do with your -Play Great activities you can do with your

dogdogActivities for Dogs Having Fun with Your Activities for Dogs Having Fun with Your DogDog

Search Google for ldquodogsrdquoSearch Google for ldquodogsrdquo I-Love-I-Love-DogsDogscomcom - Free Resources For Dog Lovers - Free Resources For Dog Lovers

including Free Dog Loversincluding Free Dog LoversI-Love-I-Love-DogsDogscom - Free Dog Resources For Dog com - Free Dog Resources For Dog Lovers includinghellipLovers includinghellip

DogsDogs in Canada in CanadaCanadas Top Obedience Canadas Top Obedience DogsDogs 2002 2002

Guide Guide DogsDogs for the Blind for the BlindGuide Guide DogsDogs for the Blind A nonprofit charitable for the Blind A nonprofit charitable organization (800) 295-4050organization (800) 295-4050

DogsDogs of the Dow - High dividend yield Dow stocks - www of the Dow - High dividend yield Dow stocks - www In addition to the high dividend yield stocks of the In addition to the high dividend yield stocks of the DogsDogs of the Dow you will also find a helpful set of of the Dow you will also find a helpful set of long-term stock market charts investment research long-term stock market charts investment research

Ambiguitymdashthe Main ProblemAmbiguitymdashthe Main Problem

Morphological Morphological בחורבחור LexicalLexical bankbank SyntacticalSyntactical

I saw the man with the telescopeI saw the man with the telescope SemanticSemantic hot dogs for the blindhot dogs for the blind Pragmatic Pragmatic Can you pass the saltCan you pass the salt

Methods for understanding Methods for understanding LanguageLanguage

The traditional methodThe traditional method

Develop a theory build toolshellipDevelop a theory build toolshellip

Corpus methodsCorpus methods

Hebrew Computational LinguisticsHebrew Computational Linguistics

Most research is in EnglishMost research is in English

Many other languages have developed toolsMany other languages have developed tools

Why not in HebrewWhy not in Hebrew

Special problems of HebrewSpecial problems of Hebrew

The writing systemThe writing systemTechnical problems (right to left) font Technical problems (right to left) font different writing systemsdifferent writing systemsno standardno standard

Complex morphologyComplex morphologyYou have left me = You have left me = עזבתניעזבתני 4 morphemes in one word4 morphemes in one word

AmbiguityAmbiguityהקפה = ה+קפה הקפה הקפההקפה = ה+קפה הקפה הקפה

בחור = ב+ה+חור בחורבחור = ב+ה+חור בחור some words can have upto 13 different some words can have upto 13 different

readings (readings (שמונהשמונה))

AchievementsAchievements

Constructed an analyzer to read Constructed an analyzer to read Hebrew text that is 96 correctHebrew text that is 96 correct

Constructing data base for Constructing data base for automatically constructing a automatically constructing a syntactic analyzer of Hebrewsyntactic analyzer of Hebrew

Search enginesSearch engines

סיכום וחזוןסיכום וחזון

נביא את המחקר בעברית למצבן של שפות נביא את המחקר בעברית למצבן של שפותאחרותאחרות

נלמד מעברית על שפות דומות נלמד מעברית על שפות דומותנפתח שיטות לטיפול בשפה כלשהינפתח שיטות לטיפול בשפה כלשהינפתח מנגנונים להבנת טכסטיםנפתח מנגנונים להבנת טכסטיםנבין טוב יותר כיצד המוח האנושי פועלנבין טוב יותר כיצד המוח האנושי פועל לעבור את מבחן טיורינג לעבור את מבחן טיורינג

The ProblemThe ProblemWritten Hebrew texts are ambiguous Written Hebrew texts are ambiguous The reasonsThe reasons

The vowels and gemination are omitted The vowels and gemination are omitted ה JפKה קו Jהקופה = קופ JפKה קו Jקופה = קופ

small words are prependedsmall words are prependedand when you will goand when you will go וכשתלך = ו + כש + תלך = וכשתלך = ו + כש + תלך =

Hebrew morphology is complexHebrew morphology is complex

The structure of a Hebrew wordThe structure of a Hebrew word

the lexical lemma the lexical lemma short words such as determiners short words such as determiners

prepositions and conjunctions prepositions and conjunctions prepended to the word prepended to the word

suffixes for possessives and object suffixes for possessives and object cliticsclitics

The linguistic features mark part-of-The linguistic features mark part-of-speech (POS) tense person etc speech (POS) tense person etc

morphemesmorphemes

linguistic featureslinguistic features

ExampleExample

$QMTI $QMTI שקמתישקמתי

$iqmati ndash$iqmati ndash י Pת Jמ Qק Pיש Pת Jמ Qק Pש my sycamore my sycamore

$e-qamti ndash$e-qamti ndash י Pת Qמ Rק Sיש Pת Qמ Rק Sש that I got up that I got up

$e-qammati ndash$e-qammati ndash י Pת Jמ Rק Sיש Pת Jמ Rק Sש that my hey that my hey

noun sg possessive-1sgnoun sg possessive-1sg

connective+verb 1sg pastconnective+verb 1sg past

connective + noun sg possessive-1sgconnective + noun sg possessive-1sg

Previous workPrevious work

POS and Morphological POS and Morphological disambiguationdisambiguation

jhjh

Three stagesThree stages

11 Word stage ndashWord stage ndash find the most probable find the most probable reading of a word regardless of its contextreading of a word regardless of its context

22 Pair stagePair stage ndashcorrect the analysis of a word ndashcorrect the analysis of a word based on the analysis of its immediate based on the analysis of its immediate neighbors neighbors

33 Sentence stage ndashSentence stage ndash use a syntactic parser to use a syntactic parser to rule out improbable analysesrule out improbable analyses

Combining all three stages yielded the best results

The Word StageThe Word Stage Give each word its most probable analysisGive each word its most probable analysis

How to estimate the probability of each How to estimate the probability of each analysisanalysis

Estimate the probability of each analysis Estimate the probability of each analysis from a large analyzed corpusfrom a large analyzed corpus

A large enough corpus does not existA large enough corpus does not exist

Since each word has many forms the Since each word has many forms the number of word tokens is so large that many number of word tokens is so large that many word forms wonrsquot appear even in 10M word word forms wonrsquot appear even in 10M word corpuscorpus

The Word StageThe Word Stage Following the ldquoSimilar Words MethodrdquoFollowing the ldquoSimilar Words Methodrdquo

(Levinger Ornan and Itai 1995) estimate the (Levinger Ornan and Itai 1995) estimate the probability of each analysis of an ambiguous word probability of each analysis of an ambiguous word by changing a (single) feature of each analysis and by changing a (single) feature of each analysis and comparing the occurrences of the resultant words in comparing the occurrences of the resultant words in a large corpusa large corpus

ExampleExample HQPH HQPH הקפההקפה the coffee definite to indefinite the coffee definite to indefinite QPHQPH encirclement indefinite to definite encirclement indefinite to definite HHQPHHHQPH her perimeter feminine possessive to masculine her perimeter feminine possessive to masculine

possessive possessive HQPWHQPW Distribution QPH=180 HHQFH=18 HQPW=2Distribution QPH=180 HHQFH=18 HQPW=2

Our variation of the SW methodOur variation of the SW method

To overcome sparseness we assumed To overcome sparseness we assumed that the lemma and the other that the lemma and the other morphemeslinguistic features are morphemeslinguistic features are statistically independentstatistically independentNamely Namely P(the coffee) = P(the)P(the coffee) = P(the)P(coffee)P(coffee)

Even though the assumption is not Even though the assumption is not valid the resultant ranking is correctvalid the resultant ranking is correct

Evaluation and ComplexityEvaluation and Complexity

Errors 36 Errors 36 145 145

Complexity of algorithm O(Complexity of algorithm O(nn) where ) where nn is the size of the corpusis the size of the corpus

Keeping a copy of the corpus as an Keeping a copy of the corpus as an inverse file reduces the complexity to inverse file reduces the complexity to linear in the number of different linear in the number of different similar words similar words

The pair stage The pair stage

Following Brill we learned correction Following Brill we learned correction rules from a corpus rules from a corpus

The initial The initial morphological scoremorphological score of an of an analysis is its probability as obtained analysis is its probability as obtained at the word stageat the word stage

Correction rules modify the scores by Correction rules modify the scores by considering pairs of adjacent words considering pairs of adjacent words checking if the rule applies and if so checking if the rule applies and if so modify the scoresmodify the scores

Example of a correction ruleExample of a correction rule

If If the POS of the current tag of w the POS of the current tag of w11 is a is a proper-noun proper-noun

andand the POS of the current tag of w the POS of the current tag of w22 is is a noun a noun

and and ww22 has an analysis as a verb that has an analysis as a verb that

matches wmatches w11 by gender and number by gender and number

thenthen add 05 to the morphological add 05 to the morphological score of wscore of w22 as a verb and normalize as a verb and normalize the scores the scores

ExampleExample

YWSP ampDR YWSP ampDR יוסףיוסף עדרעדר

YWSPYWSP == proper noun masc(Joseph)proper noun masc(Joseph)

ampDR =ampDR = noun masc sg abs indefnoun masc sg abs indef (herd) score=07 (herd) score=07

ampDR =ampDR = verb past 3sg masc verb past 3sg masc (hoed) score= (hoed) score=0303

08

0467

0533

normalization

Learning the Rules from a training Learning the Rules from a training corpuscorpus

Input A training corpus where each word is Input A training corpus where each word is correctly analyzedcorrectly analyzed

Run the word stage on the training corpusRun the word stage on the training corpus Generate all possible rulesGenerate all possible rules For each rule set the correction factor to be the For each rule set the correction factor to be the

minimum value that does more good than minimum value that does more good than damagedamage

Choose the rule that does the maximum benefitChoose the rule that does the maximum benefit Repeat until no rule improves the overall Repeat until no rule improves the overall

analyses of the training corpusanalyses of the training corpus


Training corpus 4892 word tokensTraining corpus 4892 word tokenslearned 93 ruleslearned 93 ruleserrors 145 errors 145 62 62

Complexity of the learning algorithm Complexity of the learning algorithm O(O(cc33) where ) where cc = size of the training = size of the training corpuscorpus

Complexity of the correction Complexity of the correction OO((rrnn) ) where where r = r = number of rulesnumber of rules n = n = size of trial textsize of trial text

The sentence stageThe sentence stage

Use a syntactic parser to rule out Use a syntactic parser to rule out improbable analysesimprobable analyses

The pair stage ndash adjacent words The pair stage ndash adjacent words the sentence stage ndash long term the sentence stage ndash long term dependencies dependencies

ExampleExample

מורה הכיתה הנמוכה נכנס לכיתהמורה הכיתה הנמוכה נכנס לכיתה MWRHMWRH HKITH HNMWKH NKNS LKITH HKITH HNMWKH NKNS LKITH

moremoramoremora ha-kitta ha-nmuka ha-kitta ha-nmuka niknasniknas hellip hellip

mascfem verb-mascmascfem verb-masc

more ha-kitta ha-nmuka niknasmore ha-kitta ha-nmuka niknas hellip hellip

Score of a syntax treeScore of a syntax tree

PREP NN J

S

NP VP

V COMPN COMP

more ha-kitta ha-nmuka niknas la-kitta

score(s) = score(more)score(ha-kitta) hellip score(la-kitta)

The challenge calculate the score of all syntax trees without enumerating all trees

Dynamic ProgrammingDynamic Programming

TableTable[[ijAijA] = the maximum score of all ] = the maximum score of all parsesparses

Fill table by incrasing values of Fill table by incrasing values of

i jA w w

max and im im im iTable i i A s A t G t T

[ ] max [ ] [ 1 ]A BC Gi k j

Table i j A Table i k B Table k j C

0

Time complexity 3O G n

0j i

EvaluationEvaluation

53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate

Language is the communication media between humans If we want to allow computers to communicate with us we have to teach them human language (aka ldquoNatural Languagerdquo)

Investigating natural language is a complex and interesting challange

It also has many immediate technological benefits such asInternet search enginesvoice interface with computersndash for children handicapped persons

IntroductionIntroduction












































morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate












































morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate







































morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate


































morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate































morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate



























morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate























morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate

















morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate













morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate










morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate






morphemesmorphemes


ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate

ExampleExample










jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate



jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate


jhjh
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate
































ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate



























ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate






















ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate


















ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate















ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate











ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate







ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate

ExampleExample





08

0467

0533

normalization














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate














ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate








ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate




ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate

ExampleExample






PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate


PREP NN J

S

NP VP

V COMPN COMP







i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate




i jA w w


[ ] max [ ] [ 1 ]A BC Gi k j


0


0j i


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate


53

147

38

362120

14

Word Stage

Pair Stage

Sentence Stage

error rate

Documents

A Corpus Based Computational Linguistics Alon Itai Department of Computer Science Technion—Israel Institute of Technology, Haifa, Israel