View
5
Download
0
Category
Preview:
Citation preview
Introduction to Text Mining
Mandar Mitra
Indian Statistical Institute
M. Mitra (ISI) Text Mining 1 / 29
Outline
1 Preliminaries
2 Preprocessing
3 Mining word associations
4 Opinion mining
M. Mitra (ISI) Text Mining 2 / 29
What is Text Mining?
.Strict definition..
......The nontrivial extraction of implicit, previously unknown, and potentiallyuseful information from [textual] data
OR.Loose definition..
......The science of extracting useful information from large [textual] datasets
.Old wine in a new bottle?..
......Text mining = information retrieval + statistics + artificial intelligence(natural language processing, machine learning / pattern recognition)
M. Mitra (ISI) Text Mining 3 / 29
What is Text Mining?
.Strict definition..
......The nontrivial extraction of implicit, previously unknown, and potentiallyuseful information from [textual] data
OR.Loose definition..
......The science of extracting useful information from large [textual] datasets
.Old wine in a new bottle?..
......Text mining = information retrieval + statistics + artificial intelligence(natural language processing, machine learning / pattern recognition)
M. Mitra (ISI) Text Mining 3 / 29
Why is it interesting?
Growth of Web / electronic information sources
Multidisciplinary nature
E-commerce potential
“Electronic commerce is emerging as the killer domain fordata-mining technology” — RONNY KOHAVI
M. Mitra (ISI) Text Mining 4 / 29
Data sources
World Wide Webunstructured and semi-structured text
“deep” web: pages that do not exist until they are createddynamically as the result of a specific search
social networks
Intranetinternal correspondence, memos, presentations
white papers, technical reports
customer email, customer forums, product reviews
news Wires. . .
.
...... No structure / general schema / tabular form that fits text
M. Mitra (ISI) Text Mining 5 / 29
Data sources
World Wide Webunstructured and semi-structured text
“deep” web: pages that do not exist until they are createddynamically as the result of a specific search
social networks
Intranetinternal correspondence, memos, presentations
white papers, technical reports
customer email, customer forums, product reviews
news Wires. . .
.
...... No structure / general schema / tabular form that fits text
M. Mitra (ISI) Text Mining 5 / 29
Data sources
World Wide Webunstructured and semi-structured text
“deep” web: pages that do not exist until they are createddynamically as the result of a specific search
social networks
Intranetinternal correspondence, memos, presentations
white papers, technical reports
customer email, customer forums, product reviews
news Wires. . .
.
...... No structure / general schema / tabular form that fits text
M. Mitra (ISI) Text Mining 5 / 29
Outline
1 Preliminaries
2 Preprocessing
3 Mining word associations
4 Opinion mining
M. Mitra (ISI) Text Mining 6 / 29
Indexing
Any text item (“document”) represented as list of terms andassociated weights
D = (⟨t1, w1⟩, . . . , ⟨tn, wn⟩)
Term = keywords or content-descriptors
Weight = measure of the importance of a term in representing theinformation contained in the document
M. Mitra (ISI) Text Mining 7 / 29
Indexing
Tokenization: identify individual words
.
......
Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.
⇓
SachinTendulkar
madea
tearfulbut. . .
M. Mitra (ISI) Text Mining 8 / 29
Indexing
Stopword removal: eliminate common words, e.g. and, of, the,etc..
......
Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.
M. Mitra (ISI) Text Mining 9 / 29
Indexing
Stemming: reduce words to a common roote.g. resignation, resigned, resigns → resignanalysis, analyze, analyzing → analy
use standard algorithms (Porter)
M. Mitra (ISI) Text Mining 10 / 29
Indexing
Thesaurus: find synonyms for words in the document
Phrases: find multi-word terms e.g. computer science, datamining
use syntax/linguistic methods or “statistical” methods
Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.
.
......
Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.
M. Mitra (ISI) Text Mining 11 / 29
Indexing
Thesaurus: find synonyms for words in the document
Phrases: find multi-word terms e.g. computer science, datamining
use syntax/linguistic methods or “statistical” methods
Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.
.
......
Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.
M. Mitra (ISI) Text Mining 11 / 29
Indexing
Thesaurus: find synonyms for words in the document
Phrases: find multi-word terms e.g. computer science, datamining
use syntax/linguistic methods or “statistical” methods
Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.
.
......
Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.
M. Mitra (ISI) Text Mining 11 / 29
Indexing
Thesaurus: find synonyms for words in the document
Phrases: find multi-word terms e.g. computer science, datamining
use syntax/linguistic methods or “statistical” methods
Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.
.
......
Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.
M. Mitra (ISI) Text Mining 11 / 29
Indexing: Term Weights
Term frequency (tf): repeated words are strongly related to content
Inverse document frequency (idf): uncommon term is moreimportant
Normalization by document lengthlong docs. contain many distinct words
long docs. contain same word many times
term-weights for long documents should be reduced
use # bytes, # distinct words, Euclidean length, etc.
Weight = tf x idf / normalization
M. Mitra (ISI) Text Mining 12 / 29
Commonly used weighting schemes
Pivoted normalization [Singhal et al., SIGIR 96]
1+log(tf )1+log(average tf ) × log(Ndf )
(1.0− slope)× pivot + slope ×# unique terms
BM25 (probabilistic model) [Robertson and Zaragoza, FTIR 2009]
tf × log(N−df+0.5df+0.5 )
k1((1− b) + b dlavdl ) + tf
M. Mitra (ISI) Text Mining 13 / 29
Searching
Measure vocabulary overlap between user query and documents.
t1 . . . tnQ = q1 . . . qnD = d1 . . . dn
Sim(Q,D) = Q⃗.D⃗=
∑i qi × di
Use inverted list (index).
Termi → (Di1 , wi1), . . . , (Dik , wik)
M. Mitra (ISI) Text Mining 14 / 29
Searching
Measure vocabulary overlap between user query and documents.
t1 . . . tnQ = q1 . . . qnD = d1 . . . dn
Sim(Q,D) = Q⃗.D⃗=
∑i qi × di
Use inverted list (index).
Termi → (Di1 , wi1), . . . , (Dik , wik)
M. Mitra (ISI) Text Mining 14 / 29
Outline
1 Preliminaries
2 Preprocessing
3 Mining word associations
4 Opinion mining
M. Mitra (ISI) Text Mining 15 / 29
Stemming
YASS [Majumder et al., ACM TOIS 25(4), 2007]
Stemming ≡ grouping morphologically related words togethere.g. { analysis, analyze, analyzing }
Try clusteringdistance measure: edit distance, or
D(X,Y ) =n−m+ 1
m×
n∑i=m
1
2i−mif m > 0, ∞ otherwise
clustering algorithm: hierarchical agglomerative(single link / complete link / average link)
M. Mitra (ISI) Text Mining 16 / 29
Stemming
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a s t r o n o m i c a l l y
a s t r o n o m e r x x x x
Edit distance = 6D = 6
8 × ( 120
+ . . .+ 1213−8 ) = 1.4766
0 1 2 3 4 5 6 7 8 9
a s t o n i s h x x
a s t r o n o m e r
D = 73 × ( 1
20+ . . .+ 1
29−3 ) = 4.6302Edit distance = 5
M. Mitra (ISI) Text Mining 17 / 29
Stemming
Clustering:
[Courtesy: http://espin086.files.wordpress.com/2011/02/2-variable-clustering.png]
M. Mitra (ISI) Text Mining 18 / 29
Word Relations
Motivation:Manual thesauri are:
general purpose (Roget’s Thesaurus, WordNet) – difficult to use fordocument retrieval
retrieval-oriented (INSPEC, MeSH) – expensive to build andmaintain
Construct an automatic thesaurus (based on information aboutco-occurrence of words in a collection)
M. Mitra (ISI) Text Mining 19 / 29
Word Relations
Association: if two terms co-occur within the same paragraph,they constitute an association
⟨term1, term2,assoc. frequency⟩
Gather data about term-associations over a large amount of text
Refine associations:Discard associations with frequency 1
Discard terms that are associated with too many other terms(people, state, company, etc.)
M. Mitra (ISI) Text Mining 20 / 29
Word Relations
Each term is represented by a vector of associated terms
T = (⟨t1, w1⟩, . . . , ⟨tn, wn⟩)
⇒ term = pseudo document
Compare query to the term vectors (instead of document vectors)
Sim(Q,T ) = Σiwt(qi)× wt(ti)
Most “similar” terms are added to the query
Example: 1986 US Immigration Lawsimilar terms: illegal immigration, amnesty program,simpson-mazzoli
M. Mitra (ISI) Text Mining 21 / 29
Word Relations
Experimental results:Data: 500,000 documents (news, computer abstracts, govt.documents); 50 queries
Baseline average precision: 37%
Improves to 6 - 30% by using thesaurus
2 weeks to generate association data!
Processing time can be reduced without major loss inperformance by using a subset of the document collection
M. Mitra (ISI) Text Mining 22 / 29
Outline
1 Preliminaries
2 Preprocessing
3 Mining word associations
4 Opinion mining
M. Mitra (ISI) Text Mining 23 / 29
Challenges
Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions
blogs — harder
Sentiment classificationoverall (polarity) / specific
free form / grades or stars ..
quotations
Presentationhighlighting
aggregation
community identification
estimating reliability..
Query classification: is theuser looking for an opinion?
M. Mitra (ISI) Text Mining 24 / 29
Challenges
Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions
blogs — harder
Sentiment classificationoverall (polarity) / specific
free form / grades or stars ..
quotations
Presentationhighlighting
aggregation
community identification
estimating reliability..
Query classification: is theuser looking for an opinion?
M. Mitra (ISI) Text Mining 24 / 29
Challenges
Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions
blogs — harder
Sentiment classificationoverall (polarity) / specific
free form / grades or stars ..
quotations
Presentationhighlighting
aggregation
community identification
estimating reliability
..
Query classification: is theuser looking for an opinion?
M. Mitra (ISI) Text Mining 24 / 29
Challenges
Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions
blogs — harder
Sentiment classificationoverall (polarity) / specific
free form / grades or stars ..
quotations
Presentationhighlighting
aggregation
community identification
estimating reliability..
Query classification: is theuser looking for an opinion?
M. Mitra (ISI) Text Mining 24 / 29
Opinion Mining
Feature-based opinion summarizationIdentify the features of the product that customers have expressedopinions on (called opinion features)
For each feature, identify how many customer reviews are positive/ negative
Examples:
The pictures are very clear.
Overall a fantastic, very compact, camera.
While light, it will not easily fit in pockets. (HARD!)
M. Mitra (ISI) Text Mining 25 / 29
Opinion Mining
Feature identification1 POS tagging + chunking: identify nouns, verbs, adjectives, simple
noun groups, verb groups
2 Transaction creation for each sentence: item ≡ normalized nouns/ noun phrases
3 Association rule mining: all itemsets with > 1% support arecandidate frequent features
4 Feature pruning:keep features that have some compact occurrences
keep singleton itemsets only if they occur enough times in isolatione.g. manual vs. manual mode, manual setting
M. Mitra (ISI) Text Mining 26 / 29
Opinion Mining
Sentiment / orientation identification1 Examine each sentence in the review database
2 If it contains a frequent feature, extract all the adjective words asopinion words
3 For each feature in the sentence, the nearby adjective is recordedas its effective opinion
4 Look up adjective in a list of adjectives with known orientation, orconsult WordNet (discard unknowns)adjectives arranged in bipolar structures
M. Mitra (ISI) Text Mining 27 / 29
Datasets
Blog06 (25GB) : University of Glasgowhttp://ir.dcs.gla.ac.uk/test_collections/access_to_data.htm
Congressional floor-debate transcriptshttp://www.cs.cornell.edu/home/llee/data/convote.html
Cornell movie-review datasetshttp://www.cs.cornell.edu/people/pabo/movie-review-data/
M. Mitra (ISI) Text Mining 28 / 29
References
Untangling Text Data Mining. M. Hearst. Proceedings of ACL’99.www.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html
An Introduction to Information Retrieval. Manning, Raghavan,Schutze.www-csli.stanford.edu/~schuetze/information-retrieval-book.html
Tutorial on Web Content Mining. Bing Liu. WWW 2005.www.cs.uic.edu/~liub
Web Data Mining. Bing Liu. Springer, 2006.
Opinion Mining and Sentiment Analysis. B. Pang and L. Lee.Foundations and Trends in Information Retrieval, 2(1-2), 2008.
Sentiment Analysis and Opinion Mining. Bing Liu. MorganClaypool, 2012.www.morganclaypool.com/doi/abs/10.2200/S00416ED1V01Y201204HLT016?
journalCode=hltM. Mitra (ISI) Text Mining 29 / 29
Recommended