53
My Grad School Experience By Shereen Khoja

My Grad School Experience

  • Upload
    uriel

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

My Grad School Experience. By Shereen Khoja. Introduction. I will talk about: Research for my M.Sc. Research for my Ph.D. This will involve talking about: Stemming Tagging The Arabic Language. M.Sc. I started my M.Sc at the University of Essex in 1997. University of Essex. - PowerPoint PPT Presentation

Citation preview

Page 1: My Grad School Experience

My Grad School Experience

By

Shereen Khoja

Page 2: My Grad School Experience

February 7, 2005

Introduction

• I will talk about:– Research for my M.Sc.– Research for my Ph.D.

• This will involve talking about:– Stemming– Tagging– The Arabic Language

Page 3: My Grad School Experience

February 7, 2005

M.Sc

• I started my M.Sc at the University of Essex in 1997

Page 4: My Grad School Experience

February 7, 2005

University of Essex

• The University of Essex is located in Colchester in the south east of England

Page 5: My Grad School Experience

February 7, 2005

Colchester

• Colchester is Britain’s oldest recorded city

Page 6: My Grad School Experience

February 7, 2005

The University of Essex

• The university has 9,100 students, 25% in the graduate programs

• The computer science department has 34 faculty members

• There were 120 students on the M.Sc. program

Page 7: My Grad School Experience

February 7, 2005

M.Sc. Courses

• I completed 7 courses while I was at the University of Essex– Computer Networks– Computer Vision– Expert Systems– Distributed AI & Artificial Life– Machine Learning– Neural Networks– Natural Language Processing

Page 8: My Grad School Experience

February 7, 2005

M.Sc Research• I decided to do my research in the area of Natural

Language Processing

• Natural Language Processing is a broad area of AI that focuses on how computers process language

• NLP has many sub-areas such as:– Computational Linguistics– Speech Synthesis– Speech Recognition– Information Retrieval

Page 9: My Grad School Experience

February 7, 2005

M.Sc Research

• My M.Sc research was conducted under the supervision of Professor Anne De Roeck

• She was already supervising a Ph.D student who was working on an Arabic language system

• It was decided that I would work on an Arabic language stemmer

Page 10: My Grad School Experience

February 7, 2005

Stemming

• What is stemming?– Stemming is the process of removing a

words’ prefixes and suffixes to extract the root or stem

– Computers– Compute– Computing

Page 11: My Grad School Experience

February 7, 2005

Uses of Stemming

• Compression

• Text Searching

• Spell Checking

• Text Analysis

Page 12: My Grad School Experience

February 7, 2005

Arabic Language

• Arabic is an old language

• The language hasn’t changed much in 1400 years

• Arabic is written from right to left

Page 13: My Grad School Experience

February 7, 2005

Arabic Language

• Arabic is a cursive language

• The shape of the letters change depending on whether they are at the beginning, the middle or the end of the word

Page 14: My Grad School Experience

February 7, 2005

Arabic Language

• 28 consonants in Arabic

• 3 of these are used as long vowels

• A number of short vowels or diacritics– drs– darasa– durisa– dars

Page 15: My Grad School Experience

February 7, 2005

The Arabic Language

• Arabic is a Semitic language, so words are built up from, and can be analysed down to roots following fixed patterns

• Patterns add prefixes, suffixes and infixes to the roots.

• Examples of words following the pattern Ma12oo3:– ktb maktoob– drs madroos

Page 16: My Grad School Experience

February 7, 2005

The Arabic Stemmer

• I developed an Arabic stemmer

• The stemmer used language rules

• These rules are based on the Arabic grammar, which hasn’t changed for 1,400 years

• The stemmer was developed in Visual C++

Page 17: My Grad School Experience

February 7, 2005

Difficulties

• Words that do not have roots

• Root letters that are deleted

• Root letters that change

Page 18: My Grad School Experience

February 7, 2005

The Arabic Stemmer

• I released the stemmer under the GNU public licence

• The stemmer has been used by the following:– The University of Massachusetts– MitoSystems

Page 19: My Grad School Experience

February 7, 2005

Ph.D.

• I began my Ph.D. research at Lancaster University in 1998

Page 20: My Grad School Experience

February 7, 2005

Lancaster University

• Lancaster University is located in the city of Lancaster in the north west of England

Page 21: My Grad School Experience

February 7, 2005

Lancaster

• Lancaster also has a castle

• This castle is used as a prison

Page 22: My Grad School Experience

February 7, 2005

Corpus Linguistics

• For the last 25 years, professors at Lancaster University have been conducting research in the area of computer corpus linguistics

Page 23: My Grad School Experience

February 7, 2005

Computer Corpus Linguistics

• Computer Corpus Linguistics is the sub-discipline of Computational Linguistics that utilises large quantities of texts to heighten the understanding of linguistic phenomena.

• A corpus is a collection of texts that has been assembled for linguistic analysis

Page 24: My Grad School Experience

February 7, 2005

Annotated Corpora

• Corpora are not much use to linguists in their raw form

• Annotated corpora are richer and more useful

Page 25: My Grad School Experience

February 7, 2005

Uses of Corpora

• Speech Synthesis

• Speech Recognition

• Lexicography

• Machine-aided Translation

• Information Retrieval

• EFL

Page 26: My Grad School Experience

February 7, 2005

Types of Annotation

[[S[NP Pierre Vinken],[ADJP[NP 61 years] old,]]will[VP join[NP the board][PP as[NP a nonexecutive director]][NP Nov.29]]].]

• Grammatical Annotation

• Prosodic Annotation

• Syntactic Annotation• Semantic Annotation

The_AT0 cat_NN1 sat_VVD on_PRP the_AT0 chair_NN1

Joanna 231112stubbed 21072-31246out 21072-31246her 0cigarette 2111014with 0unnecessary 317fierceness 227052

231112 Personal Name21072 Physical Activity31246 Temperature0 Low Content Word2111014 Luxury Items317 Causality / Chance227052 Anger

Page 27: My Grad School Experience

February 7, 2005

Part-of-Speech Tagging

• POS tagging is the process of automatically assigning POS tags to words in running text

• An example:– Where are you from?– Where_AVQ are_VBB you_PNP from_PRP ?_PUN

• POS taggers have been developed for English and other Indo-European with varying degrees of success

Page 28: My Grad School Experience

February 7, 2005

Techniques Used in POS Tagging• Statistical Techniques

• Rule-Based Techniques

• Machine Learning

• Neural Networks

Page 29: My Grad School Experience

February 7, 2005

Arabic Corpus Linguistics

• Arabic Corpora– No publicly available Arabic corpora– No Arabic POS tagger

• Arabic POS Taggers– POS taggers are being developed at New

Mexico State University and Alexandria University

Page 30: My Grad School Experience

February 7, 2005

Developing the Arabic POS Tagger• Developing a manual tagger

– Used to create a training corpus

• Defining an Arabic tagset

• Developing the automatic Arabic POS Tagger

Page 31: My Grad School Experience

February 7, 2005

The Arabic Tagset

• Follows the tagging system that has been used for fourteen centuries

• Noun, verb and particle

• All subclasses of these three tags inherit properties from the parent class

• Many subclasses are described in Arabic such as nouns of instrument and nouns of place and time

Page 32: My Grad School Experience

PunctuationResidualParti

cleVerbNoun

Word

Page 33: My Grad School Experience

AdjectiveCommon Proper Pronoun Numeral

PunctuationResidualParti

cleVerbNoun

Word

Page 34: My Grad School Experience

DemonstrativePersonal Rel

ative

AdjectiveCommon Proper Pronoun Numeral

PunctuationResidualParti

cleVerbNoun

Word

Page 35: My Grad School Experience

Specific Common

DemonstrativePersonal Rel

ative

AdjectiveCommon Proper Pronoun Numeral

PunctuationResidualParti

cleVerbNoun

Word

Page 36: My Grad School Experience

Numerical

AdjectiveCardinal

Ordinal

AdjectiveCommon Proper Pronoun Numeral

PunctuationResidualParti

cleVerbNoun

Word

Page 37: My Grad School Experience

ImperativePerfect Imperfect

PunctuationResidualParti

cleVerbNoun

Word

Page 38: My Grad School Experience

Prepositions Explanations AnswersSubordinates Adverbi

al

PunctuationResidualParti

cleVerbNoun

Word

Page 39: My Grad School Experience

Negatives Exceptions Interjections Conjunctions

PunctuationResidualParti

cleVerbNoun

Word

Page 40: My Grad School Experience

Numerals Mathematical

Formulae Foreign

PunctuationResidualParti

cleVerbNoun

Word

Page 41: My Grad School Experience

CommaExclamation

MarkQuestion

Mark

PunctuationResidualParti

cleVerbNoun

Word

Page 42: My Grad School Experience

February 7, 2005

Arabic POS TaggerPlain Arabi

c Text

ManTagTrainin

g Corpus

Lexicons

Probability Matrix

APTUntagged

Arabic Corpus

Tagged

Corpus

DataExtract

Page 43: My Grad School Experience

February 7, 2005

DataExtract Process

• Takes in a tagged corpus and extracts various lexicons and the probability matrix– Lexicon that includes all clitics.

• (Sprout, 1992) defines a clitic as “a syntactically separate word that functions phonologically as an affix”

– Lexicon that removes all clitics before adding the word

Page 44: My Grad School Experience

February 7, 2005

DataExtract Process

• Produces a probability matrix for various levels of the tagset– Lexical probability: probability of a word

having a certain tag– Contextual probability: probability of a tag

following another tag

Page 45: My Grad School Experience

February 7, 2005

DataExtract Process

N V P No. Pu.

N 0.711 0.065 0.143 0.010 0.071

V 0.926 0.037 0.0 0.008 0.029

P 0.689 0.199 0.085 0.016 0.011

No. 0.509 0.06 0.098 0.009 0.324

Pu. 0.492 0.159 0.152 0.046 0.151

Page 46: My Grad School Experience

February 7, 2005

Arabic Corpora

• 59,040 words of the Saudi ``al-Jazirah'' newspaper, dated 03/03/1999

• 3,104 words of the Egyptian ``al-Ahram'' newspaper, date 25/01/2000

• 5,811 words of the Qatari ``al-Bayan'' newspaper, date 25/01/2000

• 17,204 words of al-Mishkat, an Egyptian published paper in social science, April 1999

Page 47: My Grad School Experience

February 7, 2005

APT: Arabic Part-of-speech Tagger

Arabic Words

LexiconLookup

Stemmer

StatisticalComponent

Words with

multiple tags

Words with a unique

tag

Page 48: My Grad School Experience

February 7, 2005

Result of Lexicon Lookup and Stemmer

• The Arabic word fhm– Would be assigned the tag of conjunction and

personal pronoun by the morphological analyser to mean “so, they”

– Would be found in the lexicon as the noun to mean “understand”

– Would be found in the lexicon as the conjunction and verb meaning “so, he prepared”

Page 49: My Grad School Experience

February 7, 2005

w0 w1 w2 w3 w4

t0 t11 t21 t31 t4

t12 t22 t32

t33

Statistical Component

Statistical Component. Hidden Markov Models.

The probability of the words w0-w4 been tagged with t0, t11, t21, t31 and t4 is:

P(w0 | t0) * P(t11 | t0….) * P(w1 | t11 ) * P(t21 | t11….) * P(w2 | t21 ) * P(t31 | t21….) * P(w3 | t31 ) * P(t4 | t31….) * P(w4 | t4 )

Page 50: My Grad School Experience

February 7, 2005

Initial Results11%

3%

22%64%

10%3%

23%64%

17%

1%

22%

60%

21%

2%

18%

59%

Stemmed

Unambiguous

Unknown

Ambiguous

Al-Jazirah

Al-Bayan Al-Mishkat

Al-Ahram

Page 51: My Grad School Experience

February 7, 2005

Other Findings

63%7%

25%

2%

3%

NounsVerbsParticlesPunctuationResidual

Distribution of Tags

Page 52: My Grad School Experience

February 7, 2005

Summary

• For my M.Sc. I developed an Arabic language stemmer

• For my Ph.D. I developed an Arabic part-of-speech tagger– The tagger used the stemmer

Page 53: My Grad School Experience

February 7, 2005

Advice

• Stay focused

• Do not procrastinate

• Make use of all the resources around you

• Write down everything

• More importantly, organise everything that you write down