From Textual Information to Numerical Vectors

From Textual Information to Numerical Vectors

Chapters 2.7-2.13 Presented by Aaron Hagan

Text Mining

• Supplements the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information.

• The information might be relationships or patterns that are buried in the document collection and which would otherwise be extremely difficult, if not impossible, to discover.

What is Covered• Part-of-speech tagging classifies words into

categories such as noun, verb or adjective• Word sense disambiguation identifies the meaning of

a word, given its usage, from among the multiple meanings that the word may have

• Parsing performs a grammatical analysis of a sentence. Shallow parsers identify only the main grammatical elements in a sentence, such as noun phrases and verb phrases, whereas deep parsers generate a complete representation of the grammatical structure of a sentence

Motivation• Up until now we have been dealing with individual

words and simple-minded (though useful) notions of what sequence of words are likely.

• Now we turn to the study of how words– Are clustered into classes– Group with their neighbors to form phrases and sentences– Depend on other words

• Interesting notions: – Word order– Constituency– Grammatical relations

• Today: syntactic word classes – part of speech tagging

Part-Of-Speech Tagging• At the step where text has been broken into tokens and

sentences.• If no linguistic analysis is necessary, one might proceed

directly to feature generation in which the “features” will be obtained from the tokens.

• If a goal is more specific, such as recognizing names of people, place and organizations, it is usually desirable to perform additional linguistic analyses of the text to extract more sophisticated features.

• Find POS for each token.• Words are organized into grammatical classes or parts of

speech.• English : nouns, verbs, adjectives, adverbs, prepositions,

conjunctions.

History of POS Tagging• Research on part-of-speech tagging has been closely

tied to corpus linguistics. The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kucera and , in the mid-1960s.

• Consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 words.

• Mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the of British English. HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences.

http://en.wikipedia.org/wiki/Corpus_linguistics

http://en.wikipedia.org/wiki/Brown_Corpus

http://en.wikipedia.org/wiki/Brown_University

http://en.wikipedia.org/wiki/Henry_Kucera

http://en.wikipedia.org/wiki/Hidden_Markov_model

CORPUS• CORPUS OF CONTEMPORARY AMERICAN ENGLISH (COCA)• The first large, balanced corpus of contemporary American English.• The corpus contains more than 385 million words of text, including

20 million words each year from 1990-2008, and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.

• The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. You can search for surrounding words (collocates) within a ten-word window (e.g. all nouns somewhere near chain, all adjectives near woman, or all verbs near key).

• The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways:– By genre: comparisons between spoken, fiction, popular magazines,

newspapers, and academic, or even between sub-genres (or domains), such as movie scripts, sports magazines, newspaper editorial, or scientific journals

– Over time: compare different years from 1990 to the present time

http://www.americancorpus.org/








http://www.americancorpus.org/help/texts_e.asp

http://www.americancorpus.org/help/texts_e.asp

http://www.americancorpus.org/help/syntax_e.asp






http://www.americancorpus.org/help/context_e.asp

http://www.americancorpus.org/help/context_e.asp

http://www.americancorpus.org/help/sections_e.asp



Penn Treebank Tag Set• 1. CC Coordinating

conjunction • 2. CD Cardinal number • 3. DT Determiner • 4. EX Existential there • 5. FW Foreign word • 6. IN Preposition or

subordinating conjunction • 7. JJ Adjective • 8. JJR Adjective, comparative • 9. JJS Adjective, superlative • 10. LS List item marker • 11. MD Modal • 12. NN Noun, singular or

mass • 13. NNS Noun, plural• 14. NP Proper noun, singular

• 15. NPS Proper noun, plural • 16. PDT Predeterminer • 17. POS Possessive ending • 18. PP Personal pronoun • 19. PP$ Possessive pronoun • 20. RB Adverb • 21. RBR Adverb, comparative • 22. RBS Adverb, superlative • 23. RP Particle • 24. SYM Symbol • 25. TO to • 26. UH Interjection • 27. VB Verb, base form • 28. VBD Verb, past tense • 29. VBG Verb, gerund or

present participle

• 30. VBN Verb, past participle• 31. VBP Verb, non-3rd person

singular present • 32. VBZ Verb, 3rd person

singular present • 33. WDT Wh-determiner • 34. WP Wh-pronoun • 35. WP$ Possessive wh-

pronoun • 36. WRB Wh-adverb

http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/CQP-HTMLDemo/PennTreebankTS.html

Assigning POS to Tokens• Possible to manually tag POS. Ideally want

automated system to identify POS.• Most successful databases are one generated

automatically by machine-learning algorithms from annotated copora. – Example:

• Wall Street Journal suited well for certain type of data, but may not be ideal for something like email messages.

• A lot of military funding for things such as processing voluminous news source.

• Not much support for generating large training corpora in other domains.

Part-Of-Speech Dictionaries• Dictionaries showing word-POS correspondence can be useful. • Difficult do to several parts of speech tied to one word.

– Example: • Bore – noun - a tiresome person • Bore – verb - to pierce with a turning or twisting movement of a

tool– Example

• Book/VB that/DT flight/NN

• Tagging is a type of disambiguation– Book can be NN or VB– Can I read a book on this flight?– That can be a DT or complementizer– My travel agent said that there would be a meal on this flight

• The goal of POS tagging is to determine which of these possibilities is realized in a particular text instance.

11

Approaches to POS Tagging

• Rule-based Approach

– Uses handcrafted sets of rules to tag input sentences

• Statistical approaches

– Use training corpus to compute probability of a tag in a context

• Hybrid systems (e.g. Brill’s transformation-based learning)

12

ENGTWOL (ENGlish TWO Level analysis) Rule-Based Tagger

A Two-stage architecture• Use lexicon FST (dictionary) to tag each word

with all possible POS• Apply hand-written rules to eliminate tags. • The rules eliminate tags that are inconsistent

with the context, and should reduce the list of POS tags to a single POS per word.

13

ENGTWOL Adverbial-that Rule

Given input “that”• If the next word is adj, adverb, or quantifier, and

following that is a sentence boundary, and the previous word is not a verb like “consider” which allows adjs as object complements,

• Then eliminate non-ADV tags,• Else eliminate ADV tag

• I consider that odd. (that is NOT ADV)• It isn’t that strange. (that is an ADV)

14

Det-Noun Rule:

• If an ambiguous word follows a determiner, tag it as a noun

15

Does it work?

• This approach does work and produces accurate results.

• What are the drawbacks?– Extremely labor-intensive

16

Statistical Tagging• Statistical (or stochastic) taggers use a training

corpus to compute the probability of a tag in a context.

• For a given word sequence, Hidden Markov Model (HMM) Taggers choose the tag sequence that maximixes

P(word | tag) * P(tag | previous-n-tags)• A HMM tagger chooses the tag ti for word wi that

is most probable given the previous tag, ti-1

ti = argmaxj P(tj | ti-1, wi)

HMM Example• For example, once you've seen an article such as

'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. – a program can decide that "can" in "the can" is far more

likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words.

• More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen an article and a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb.

18

Statistical POS Tagging (Example)• Use probability theory for POS tagging.• Suppose, with no context, we just want to know

given the word “flies” whether it should be tagged as a noun or as a verb.

• We use conditional probability for this: we want to know which is greater

PROB(N | flies) or PROB(V | flies)• Note definition of conditional probability

PROB(a | b) = PROB(a & b) / PROB(b)– Where PROB(a & b) is the probability of the two

events a and b occurring simultaneously

19

Calculating POS for “flies”

We need to know which is more• PROB(N | flies) = PROB(flies & N) / PROB(flies)• PROB(V | flies) = PROB(flies & V) / PROB(flies)

• Use Corpus as reference for finding probablities.

20

Corpus to Estimate

1,273,000 words; 1000 uses of flies; 400 flies in N sense; 600 flies in V sensePROB(flies) ≈ 1000/1,273,000 = .0008PROB(flies & N) ≈ 400/1,273,000 = .0003PROB(flies & V) ≈ 600/1,273,000 = .0005

Out best guess is that flies is a VPROB(V | flies) = PROB(V & flies) / PROB(flies)

= .0005/.0008 = .625

Phrase Recognition

• Once tokens have been assigned POS tags, the next step is to group individual tokens into units, called phrases.

• The idea is for creating a “partial parse” of a sentence and as a step in identifying the “named entities” occurring in a sentence.

• Text parsing systems are suppose to scan a text and mark the beginning and end of phrases.

Phrase Recognition• There are a number of conventions for marking,

but the most common :– Mark a word inside a phrase with I-

• Can be extended with a code for the phrase type: I-NP, I-VP, etc

– Mark a word at the beginning of a phrase adjacent to another phrase with B-• Can be extended with a code for the phrase type: B-NP, B-VP,

etc.– And a word outside any phrase with O

• Looking for a particular sequence of words that occurs frequently enough in the corpora.

• Simple statistical approach that looks at multiword tokens.

Named Entity Recognition

• Specialization of phrase finding• Particular noun phrase finding is the

recognition of particular types of proper noun phrases, specifically persons, organizations, and locations.

• Importance of these recognizers for intelligence applications .

• (More on this in chapter 6).

Parsing into Phrases

• Usually a full parse of a sentence is done in most sophisticated kind of text processing.

• Each word in the sentence has a relation to all the other words and the main function (subject, object, etc) in the sentance.

• There are many different kinds of parses, each associated with linguistic theory of the language.

Context-Free Parses

• A tree of nodes in which the leaf nodes are words of a sentence, the phrases into which the words are grouped are internal nodes, and there is one top node at the root of the tree, which has the label S.

• A number of algorithms for producing such a tree from the words of a sentence. With considerable research constructing parsers from a statistical analysis of tree banks of sentences parsed by handle.

• Provides information that phrase identification or partial parsing cannot provide.

Parse Tree Example

S

NP - NJOHNSON VP

VP PP PP

AUXwas

PPARTreplaced

PREPat

PNOUN

PNOUNXYZ

PNOUNCORP

PREPby

PNOUNSmith

Johnson was replaced at XYZ Corp by Smith.

Linear order of phrases in a partial parse, one might conclude that Johnson replaced Smith.

Feature Generation

• Reason for the linguistic processing is to identify features that can be useful for text mining.

• Features that might be useful in identifying the POS include: where the first letter is capitalized (indicating a proper noun), if all the characters are digits, periods, or comma (marking a number), if characters alternate case (usually an abrivation).

• A dictionary as to the possible parts of speech for a token.

Feature Vector

• The feature vector for a document is assigned a set of classes.

• Feature Vector Example: – Classifying periods as End-Of-Sentence.– Identifying tokens as instance of titles, such as

“Doctor” or “President”

Summary

• Part-of-Speech Tagging – is an important step in Natural Language Analysis.– is robust and fast.– works with 95-97% accuracy.

• Parsing (= full syntax analysis)– is more error-prone than PoS-Tagging.– is important to get to the meaning of a sentence.

References / Applications

• http://www.cis.upenn.edu/~treebank/• The Penn Treebank Project annotates naturally-

occuring text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees.

• http://www.americancorpus.org/• http://ucrel.lancs.ac.uk/claws/• Stanford Natural Language Processing Group -

http://nlp.stanford.edu/software/tagger.shtm

http://www.cis.upenn.edu/~treebank/


http://ucrel.lancs.ac.uk/claws/

Documents

From Textual Information to Numerical Vectors