30
From Textual Information to Numerical Vectors Chapters 2.7-2.13 Presented by Aaron Hagan

From Textual Information to Numerical Vectors

  • Upload
    malise

  • View
    56

  • Download
    0

Embed Size (px)

DESCRIPTION

From Textual Information to Numerical Vectors. Chapters 2.7-2.13 Presented by Aaron Hagan. Text Mining. Supplements the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. - PowerPoint PPT Presentation

Citation preview

Page 1: From Textual Information to Numerical Vectors

From Textual Information to Numerical Vectors

Chapters 2.7-2.13 Presented by Aaron Hagan

Page 2: From Textual Information to Numerical Vectors

Text Mining

• Supplements the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information.

• The information might be relationships or patterns that are buried in the document collection and which would otherwise be extremely difficult, if not impossible, to discover.

Page 3: From Textual Information to Numerical Vectors

What is Covered• Part-of-speech tagging classifies words into

categories such as noun, verb or adjective• Word sense disambiguation identifies the meaning of

a word, given its usage, from among the multiple meanings that the word may have

• Parsing performs a grammatical analysis of a sentence. Shallow parsers identify only the main grammatical elements in a sentence, such as noun phrases and verb phrases, whereas deep parsers generate a complete representation of the grammatical structure of a sentence

Page 4: From Textual Information to Numerical Vectors

Motivation• Up until now we have been dealing with individual

words and simple-minded (though useful) notions of what sequence of words are likely.

• Now we turn to the study of how words– Are clustered into classes– Group with their neighbors to form phrases and sentences– Depend on other words

• Interesting notions: – Word order– Constituency– Grammatical relations

• Today: syntactic word classes – part of speech tagging

Page 5: From Textual Information to Numerical Vectors

Part-Of-Speech Tagging• At the step where text has been broken into tokens and

sentences.• If no linguistic analysis is necessary, one might proceed

directly to feature generation in which the “features” will be obtained from the tokens.

• If a goal is more specific, such as recognizing names of people, place and organizations, it is usually desirable to perform additional linguistic analyses of the text to extract more sophisticated features.

• Find POS for each token.• Words are organized into grammatical classes or parts of

speech.• English : nouns, verbs, adjectives, adverbs, prepositions,

conjunctions.

Page 6: From Textual Information to Numerical Vectors

History of POS Tagging• Research on part-of-speech tagging has been closely

tied to corpus linguistics. The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kucera and , in the mid-1960s.

• Consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 words.

• Mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the of British English. HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences.

Page 7: From Textual Information to Numerical Vectors

CORPUS• CORPUS OF CONTEMPORARY AMERICAN ENGLISH (COCA)• The first large, balanced corpus of contemporary American English.• The corpus contains more than 385 million words of text, including

20 million words each year from 1990-2008, and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.

• The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. You can search for surrounding words (collocates) within a ten-word window (e.g. all nouns somewhere near chain, all adjectives near woman, or all verbs near key).

• The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways:– By genre: comparisons between spoken, fiction, popular magazines,

newspapers, and academic, or even between sub-genres (or domains), such as movie scripts, sports magazines, newspaper editorial, or scientific journals

– Over time: compare different years from 1990 to the present time

Page 8: From Textual Information to Numerical Vectors

Penn Treebank Tag Set• 1. CC Coordinating

conjunction • 2. CD Cardinal number • 3. DT Determiner • 4. EX Existential there • 5. FW Foreign word • 6. IN Preposition or

subordinating conjunction • 7. JJ Adjective • 8. JJR Adjective, comparative • 9. JJS Adjective, superlative • 10. LS List item marker • 11. MD Modal • 12. NN Noun, singular or

mass • 13. NNS Noun, plural• 14. NP Proper noun, singular

• 15. NPS Proper noun, plural • 16. PDT Predeterminer • 17. POS Possessive ending • 18. PP Personal pronoun • 19. PP$ Possessive pronoun • 20. RB Adverb • 21. RBR Adverb, comparative • 22. RBS Adverb, superlative • 23. RP Particle • 24. SYM Symbol • 25. TO to • 26. UH Interjection • 27. VB Verb, base form • 28. VBD Verb, past tense • 29. VBG Verb, gerund or

present participle

• 30. VBN Verb, past participle• 31. VBP Verb, non-3rd person

singular present • 32. VBZ Verb, 3rd person

singular present • 33. WDT Wh-determiner • 34. WP Wh-pronoun • 35. WP$ Possessive wh-

pronoun • 36. WRB Wh-adverb

http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html

Page 9: From Textual Information to Numerical Vectors

Assigning POS to Tokens• Possible to manually tag POS. Ideally want

automated system to identify POS.• Most successful databases are one generated

automatically by machine-learning algorithms from annotated copora. – Example:

• Wall Street Journal suited well for certain type of data, but may not be ideal for something like email messages.

• A lot of military funding for things such as processing voluminous news source.

• Not much support for generating large training corpora in other domains.

Page 10: From Textual Information to Numerical Vectors

Part-Of-Speech Dictionaries• Dictionaries showing word-POS correspondence can be useful. • Difficult do to several parts of speech tied to one word.

– Example: • Bore – noun - a tiresome person • Bore – verb - to pierce with a turning or twisting movement of a

tool– Example

• Book/VB that/DT flight/NN

• Tagging is a type of disambiguation– Book can be NN or VB– Can I read a book on this flight?– That can be a DT or complementizer– My travel agent said that there would be a meal on this flight

• The goal of POS tagging is to determine which of these possibilities is realized in a particular text instance.

Page 11: From Textual Information to Numerical Vectors

11

Approaches to POS Tagging

• Rule-based Approach

– Uses handcrafted sets of rules to tag input sentences

• Statistical approaches

– Use training corpus to compute probability of a tag in a context

• Hybrid systems (e.g. Brill’s transformation-based learning)

Page 12: From Textual Information to Numerical Vectors

12

ENGTWOL (ENGlish TWO Level analysis) Rule-Based Tagger

A Two-stage architecture• Use lexicon FST (dictionary) to tag each word

with all possible POS• Apply hand-written rules to eliminate tags. • The rules eliminate tags that are inconsistent

with the context, and should reduce the list of POS tags to a single POS per word.

Page 13: From Textual Information to Numerical Vectors

13

ENGTWOL Adverbial-that Rule

Given input “that”• If the next word is adj, adverb, or quantifier, and

following that is a sentence boundary, and the previous word is not a verb like “consider” which allows adjs as object complements,

• Then eliminate non-ADV tags,• Else eliminate ADV tag

• I consider that odd. (that is NOT ADV)• It isn’t that strange. (that is an ADV)

Page 14: From Textual Information to Numerical Vectors

14

Det-Noun Rule:

• If an ambiguous word follows a determiner, tag it as a noun

Page 15: From Textual Information to Numerical Vectors

15

Does it work?

• This approach does work and produces accurate results.

• What are the drawbacks?– Extremely labor-intensive

Page 16: From Textual Information to Numerical Vectors

16

Statistical Tagging• Statistical (or stochastic) taggers use a training

corpus to compute the probability of a tag in a context.

• For a given word sequence, Hidden Markov Model (HMM) Taggers choose the tag sequence that maximixes

P(word | tag) * P(tag | previous-n-tags)• A HMM tagger chooses the tag ti for word wi that

is most probable given the previous tag, ti-1

ti = argmaxj P(tj | ti-1, wi)

Page 17: From Textual Information to Numerical Vectors

HMM Example• For example, once you've seen an article such as

'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. – a program can decide that "can" in "the can" is far more

likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words.

• More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen an article and a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb.

Page 18: From Textual Information to Numerical Vectors

18

Statistical POS Tagging (Example)• Use probability theory for POS tagging.• Suppose, with no context, we just want to know

given the word “flies” whether it should be tagged as a noun or as a verb.

• We use conditional probability for this: we want to know which is greater

PROB(N | flies) or PROB(V | flies)• Note definition of conditional probability

PROB(a | b) = PROB(a & b) / PROB(b)– Where PROB(a & b) is the probability of the two

events a and b occurring simultaneously

Page 19: From Textual Information to Numerical Vectors

19

Calculating POS for “flies”

We need to know which is more• PROB(N | flies) = PROB(flies & N) / PROB(flies)• PROB(V | flies) = PROB(flies & V) / PROB(flies)

• Use Corpus as reference for finding probablities.

Page 20: From Textual Information to Numerical Vectors

20

Corpus to Estimate

1,273,000 words; 1000 uses of flies; 400 flies in N sense; 600 flies in V sensePROB(flies) ≈ 1000/1,273,000 = .0008PROB(flies & N) ≈ 400/1,273,000 = .0003PROB(flies & V) ≈ 600/1,273,000 = .0005

Out best guess is that flies is a VPROB(V | flies) = PROB(V & flies) / PROB(flies)

= .0005/.0008 = .625

Page 21: From Textual Information to Numerical Vectors

Phrase Recognition

• Once tokens have been assigned POS tags, the next step is to group individual tokens into units, called phrases.

• The idea is for creating a “partial parse” of a sentence and as a step in identifying the “named entities” occurring in a sentence.

• Text parsing systems are suppose to scan a text and mark the beginning and end of phrases.

Page 22: From Textual Information to Numerical Vectors

Phrase Recognition• There are a number of conventions for marking,

but the most common :– Mark a word inside a phrase with I-

• Can be extended with a code for the phrase type: I-NP, I-VP, etc

– Mark a word at the beginning of a phrase adjacent to another phrase with B-• Can be extended with a code for the phrase type: B-NP, B-VP,

etc.– And a word outside any phrase with O

• Looking for a particular sequence of words that occurs frequently enough in the corpora.

• Simple statistical approach that looks at multiword tokens.

Page 23: From Textual Information to Numerical Vectors

Named Entity Recognition

• Specialization of phrase finding• Particular noun phrase finding is the

recognition of particular types of proper noun phrases, specifically persons, organizations, and locations.

• Importance of these recognizers for intelligence applications .

• (More on this in chapter 6).

Page 24: From Textual Information to Numerical Vectors

Parsing into Phrases

• Usually a full parse of a sentence is done in most sophisticated kind of text processing.

• Each word in the sentence has a relation to all the other words and the main function (subject, object, etc) in the sentance.

• There are many different kinds of parses, each associated with linguistic theory of the language.

Page 25: From Textual Information to Numerical Vectors

Context-Free Parses

• A tree of nodes in which the leaf nodes are words of a sentence, the phrases into which the words are grouped are internal nodes, and there is one top node at the root of the tree, which has the label S.

• A number of algorithms for producing such a tree from the words of a sentence. With considerable research constructing parsers from a statistical analysis of tree banks of sentences parsed by handle.

• Provides information that phrase identification or partial parsing cannot provide.

Page 26: From Textual Information to Numerical Vectors

Parse Tree Example

S

NP - NJOHNSON VP

VP PP PP

AUXwas

PPARTreplaced

PREPat

PNOUN

PNOUNXYZ

PNOUNCORP

PREPby

PNOUNSmith

Johnson was replaced at XYZ Corp by Smith.

Linear order of phrases in a partial parse, one might conclude that Johnson replaced Smith.

Page 27: From Textual Information to Numerical Vectors

Feature Generation

• Reason for the linguistic processing is to identify features that can be useful for text mining.

• Features that might be useful in identifying the POS include: where the first letter is capitalized (indicating a proper noun), if all the characters are digits, periods, or comma (marking a number), if characters alternate case (usually an abrivation).

• A dictionary as to the possible parts of speech for a token.

Page 28: From Textual Information to Numerical Vectors

Feature Vector

• The feature vector for a document is assigned a set of classes.

• Feature Vector Example: – Classifying periods as End-Of-Sentence.– Identifying tokens as instance of titles, such as

“Doctor” or “President”

Page 29: From Textual Information to Numerical Vectors

Summary

• Part-of-Speech Tagging – is an important step in Natural Language Analysis.– is robust and fast.– works with 95-97% accuracy.

• Parsing (= full syntax analysis)– is more error-prone than PoS-Tagging.– is important to get to the meaning of a sentence.

Page 30: From Textual Information to Numerical Vectors

References / Applications

• http://www.cis.upenn.edu/~treebank/• The Penn Treebank Project annotates naturally-

occuring text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees.

• http://www.americancorpus.org/• http://ucrel.lancs.ac.uk/claws/• Stanford Natural Language Processing Group -

http://nlp.stanford.edu/software/tagger.shtm