Click here to load reader

November 2005CSA3180: Statistics I1 CSA3180: Natural Language Processing Statistics 1 – Empirical Approach Historical Background Fundamental Issues Tokenisation

  • View
    214

  • Download
    0

Embed Size (px)

Text of November 2005CSA3180: Statistics I1 CSA3180: Natural Language Processing Statistics 1 –...

  • Slide 1
  • November 2005CSA3180: Statistics I1 CSA3180: Natural Language Processing Statistics 1 Empirical Approach Historical Background Fundamental Issues Tokenisation and Preprocessing
  • Slide 2
  • November 2005CSA3180: Statistics I2 Introduction Slides based on Lectures by Mike Rosner (2003) and BNC2 POS Tagging Manual (Leech and Smith, 2000) Foundations of Statistical Language Processing, Manning and Schtze, MIT, 1999 Resources for statistical/empirical NLP http://nlp.stanford.edu/links/statnlp.html McEnery & Wilson notes on Corpus Linguistics http://www.ling.lancs.ac.uk/monkey/ihe/linguistic s/contents.htmhttp://www.ling.lancs.ac.uk/monkey/ihe/linguistic s/contents.htm
  • Slide 3
  • November 2005CSA3180: Statistics I3 Historical Perspective Pre-Chomsky linguistics (e.g. Boas 1940) was largely empirical 1970s: Rationalist approach to AI systems in restricted domains (e.g. Winograd 1972, Woods 1977, Waltz 1978) 1980s: hand-coded grammars and knowledge bases (e.g. Allen 1987) Hand-coded systems need great deal of domain- specific/expert knowledge engineering Systems brittle, unscaleable and inflexible Second half of 1980s: focus shifted from rationalist methods to empirical/corpus-based methods Development largely data driven
  • Slide 4
  • November 2005CSA3180: Statistics I4 Historical Perspective Linguistics Research: Automatic Induction of lexical and syntactic information from corpora Speech Recognition: resulted in Hidden Markov Models (HMM) based methods (IBM Yorktown Heights) that outperformed previous knowledge- based approaches Use of probabilistic finite state machines to model word pronunciations Make use of hill-climbing training algorithms to fit model parameters to actual speech data
  • Slide 5
  • November 2005CSA3180: Statistics I5 Application Areas Success of statistical methods in speech spread to other areas like POS tagging, spelling correction, and parsing POS Tagging: assigning appropriate syntactic class tags to words Machine Translation: training on bilingual corpora to extract word and contextual mappings Parsing: based on tree banks (large databases of sentences annotated with syntactic parse trees), such as probabilistic CFGs (PCFGs) Word-sense disambiguation: attachment, anaphora resolution, discourse segmentation Content-based document processing: Information Extraction: text filled templates Information Retrieval: query text set of relevant documents
  • Slide 6
  • November 2005CSA3180: Statistics I6 Empirical Approach: Issues Potential for solutions to old problems: Knowledge Acquisition Coverage Robustness Domain Independence Feasibility depends on data and computing resources Pros Emphasis on applications and evaluation Scalability and applicability to real-life domains Cons Results always corpus dependent
  • Slide 7
  • November 2005CSA3180: Statistics I7 Corpus: Starting Point Corpus (corpora) is an organised body of materials from language that is used as the basis for empirical studies. Important corpus characteristics: Statistical: Representativeness/balance Medium: printed, electronic text, speech, video, images Language: monolingual/multilingual Information Content: plain text vs. tagged text Structure: trees vs. sentences Size Standards Quality
  • Slide 8
  • November 2005CSA3180: Statistics I8 Corpora Examples Project Gutenberg collection of public domain texts http://www.gutenberg.org Brown Corpus tagged corpus of around 1 million words put together at Brown University in 1960s and 70s. Balanced corpus of American English. British National Corpus a balanced corpus of British English containing over 100 million words with morphosyntactic annotation. http://www.natcorp.ox.ac.uk Penn Treebank WordNet Canadian Hansards LDC GigaWord
  • Slide 9
  • November 2005CSA3180: Statistics I9 Tagset Example Here are some example POS tags from the BNC (CLAWS4 BNC Basic Tagset/C5 Tagset) AJ0 Adjective (general or positive) (e.g. good, old, beautiful) AJC Comparative adjective (e.g. better, older) AJS Superlative adjective (e.g. best, oldest) AT0 Article (e.g. the, a, an, no) AV0 General adverb: an adverb not subclassified as AVP or AVQ (see below) (e.g. often, well, longer (adv.), furthest. AVP Adverb particle (e.g. up, off, out)
  • Slide 10
  • November 2005CSA3180: Statistics I10 Tagset Examples Here are some example POS tags from the BNC (CLAWS4 BNC Basic Tagset/C5 Tagset) AVQ Wh-adverb (e.g. when, where, how, why, wherever) CJC Coordinating conjunction (e.g. and, or, but) CJS Subordinating conjunction (e.g. although, when) CJT The subordinating conjunction that CRD Cardinal number (e.g. one, 3, fifty-five, 3609) DPS Possessive determiner-pronoun (e.g. your, their, his)
  • Slide 11
  • November 2005CSA3180: Statistics I11 Tagset Examples Here are some example POS tags from the BNC (CLAWS4 BNC Basic Tagset/C5 Tagset) DT0 General determiner-pronoun: i.e. a determiner-pronoun which is not a DTQ or an AT0. DTQ Wh-determiner-pronoun (e.g. which, what, whose, whichever) EX0 Existential there, i.e. there occurring in the there is... or there are... construction ITJ Interjection or other isolate (e.g. oh, yes, mhm, wow) NN0 Common noun, neutral for number (e.g. aircraft, data, committee)
  • Slide 12
  • November 2005CSA3180: Statistics I12 Tagset Examples Here are some example POS tags from the BNC (CLAWS4 BNC Basic Tagset/C5 Tagset) NN1 Singular common noun (e.g. pencil, goose, time, revelation) NN2 Plural common noun (e.g. pencils, geese, times, revelations) NP0 Proper noun (e.g. London, Michael, Mars, IBM) ORD Ordinal numeral (e.g. first, sixth, 77th, last). PNI Indefinite pronoun (e.g. none, everything, one [as pronoun], nobody) PNP Personal pronoun (e.g. I, you, them, ours)
  • Slide 13
  • November 2005CSA3180: Statistics I13 Tagset Examples Here are some example POS tags from the BNC (CLAWS4 BNC Basic Tagset/C5 Tagset) PNQ Wh-pronoun (e.g. who, whoever, whom) PNX Reflexive pronoun (e.g. myself, yourself, itself, ourselves) POS The possessive or genitive marker 's or ' PRF The preposition of PRP Preposition (except for of) (e.g. about, at, in, on, on behalf of, with) PUL Punctuation: left bracket - i.e. ( or [
  • Slide 14

Search related