Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text...

Introduction to Text Mining

Mandar Mitra

Indian Statistical Institute

M. Mitra (ISI) Text Mining 1 / 29

Outline

1 Preliminaries

2 Preprocessing

3 Mining word associations

4 Opinion mining

What is Text Mining?

.Strict definition..

......The nontrivial extraction of implicit, previously unknown, and potentiallyuseful information from [textual] data

OR.Loose definition..

......The science of extracting useful information from large [textual] datasets

.Old wine in a new bottle?..

......Text mining = information retrieval + statistics + artificial intelligence(natural language processing, machine learning / pattern recognition)

What is Text Mining?

.Strict definition..

......The nontrivial extraction of implicit, previously unknown, and potentiallyuseful information from [textual] data

OR.Loose definition..

......The science of extracting useful information from large [textual] datasets

.Old wine in a new bottle?..

......Text mining = information retrieval + statistics + artificial intelligence(natural language processing, machine learning / pattern recognition)

Why is it interesting?

Growth of Web / electronic information sources

Multidisciplinary nature

E-commerce potential

“Electronic commerce is emerging as the killer domain fordata-mining technology” — RONNY KOHAVI

Data sources

World Wide Webunstructured and semi-structured text

“deep” web: pages that do not exist until they are createddynamically as the result of a specific search

social networks

Intranetinternal correspondence, memos, presentations

white papers, technical reports

customer email, customer forums, product reviews

news Wires. . .

...... No structure / general schema / tabular form that fits text

Data sources

social networks

news Wires. . .

Data sources

social networks

news Wires. . .

Outline

1 Preliminaries

2 Preprocessing

4 Opinion mining

Indexing

Any text item (“document”) represented as list of terms andassociated weights

D = (⟨t1, w1⟩, . . . , ⟨tn, wn⟩)

Term = keywords or content-descriptors

Weight = measure of the importance of a term in representing theinformation contained in the document

Indexing

Tokenization: identify individual words

......

Sachin Tendulkar made a tearful but self-effacing farewell as hisglittering 24-year career came to an end on Saturday at his homeground of Wankhede Stadium.

SachinTendulkar

tearfulbut. . .

Indexing

Stopword removal: eliminate common words, e.g. and, of, the,etc..

......

Indexing

Stemming: reduce words to a common roote.g. resignation, resigned, resigns → resignanalysis, analyze, analyzing → analy

use standard algorithms (Porter)

Indexing

Thesaurus: find synonyms for words in the document

Phrases: find multi-word terms e.g. computer science, datamining

use syntax/linguistic methods or “statistical” methods

Named entities: identify names of people, organizations, places;dates; monetary or other amounts, etc.

......

Indexing

......

Indexing

......

Indexing

......

Indexing: Term Weights

Term frequency (tf): repeated words are strongly related to content

Inverse document frequency (idf): uncommon term is moreimportant

Normalization by document lengthlong docs. contain many distinct words

long docs. contain same word many times

term-weights for long documents should be reduced

use # bytes, # distinct words, Euclidean length, etc.

Weight = tf x idf / normalization

Commonly used weighting schemes

Pivoted normalization [Singhal et al., SIGIR 96]

1+log(tf )1+log(average tf ) × log(Ndf )

(1.0− slope)× pivot + slope ×# unique terms

BM25 (probabilistic model) [Robertson and Zaragoza, FTIR 2009]

tf × log(N−df+0.5df+0.5 )

k1((1− b) + b dlavdl ) + tf

Searching

Measure vocabulary overlap between user query and documents.

t1 . . . tnQ = q1 . . . qnD = d1 . . . dn

Sim(Q,D) = Q⃗.D⃗=

∑i qi × di

Use inverted list (index).

Termi → (Di1 , wi1), . . . , (Dik , wik)

Searching

Measure vocabulary overlap between user query and documents.

t1 . . . tnQ = q1 . . . qnD = d1 . . . dn

Sim(Q,D) = Q⃗.D⃗=

∑i qi × di

Use inverted list (index).

Termi → (Di1 , wi1), . . . , (Dik , wik)

Outline

1 Preliminaries

2 Preprocessing

4 Opinion mining

Stemming

YASS [Majumder et al., ACM TOIS 25(4), 2007]

Stemming ≡ grouping morphologically related words togethere.g. { analysis, analyze, analyzing }

Try clusteringdistance measure: edit distance, or

D(X,Y ) =n−m+ 1

n∑i=m

2i−mif m > 0, ∞ otherwise

clustering algorithm: hierarchical agglomerative(single link / complete link / average link)

Stemming

0 1 2 3 4 5 6 7 8 9 10 11 12 13

a s t r o n o m i c a l l y

a s t r o n o m e r x x x x

Edit distance = 6D = 6

8 × ( 120

+ . . .+ 1213−8 ) = 1.4766

0 1 2 3 4 5 6 7 8 9

a s t o n i s h x x

a s t r o n o m e r

D = 73 × ( 1

20+ . . .+ 1

29−3 ) = 4.6302Edit distance = 5

Stemming

Clustering:

[Courtesy: http://espin086.files.wordpress.com/2011/02/2-variable-clustering.png]

Word Relations

Motivation:Manual thesauri are:

general purpose (Roget’s Thesaurus, WordNet) – difficult to use fordocument retrieval

retrieval-oriented (INSPEC, MeSH) – expensive to build andmaintain

Construct an automatic thesaurus (based on information aboutco-occurrence of words in a collection)

Word Relations

Association: if two terms co-occur within the same paragraph,they constitute an association

⟨term1, term2,assoc. frequency⟩

Gather data about term-associations over a large amount of text

Refine associations:Discard associations with frequency 1

Discard terms that are associated with too many other terms(people, state, company, etc.)

Word Relations

Each term is represented by a vector of associated terms

T = (⟨t1, w1⟩, . . . , ⟨tn, wn⟩)

⇒ term = pseudo document

Compare query to the term vectors (instead of document vectors)

Sim(Q,T ) = Σiwt(qi)× wt(ti)

Most “similar” terms are added to the query

Example: 1986 US Immigration Lawsimilar terms: illegal immigration, amnesty program,simpson-mazzoli

Word Relations

Experimental results:Data: 500,000 documents (news, computer abstracts, govt.documents); 50 queries

Baseline average precision: 37%

Improves to 6 - 30% by using thesaurus

2 weeks to generate association data!

Processing time can be reduced without major loss inperformance by using a subset of the document collection

Outline

1 Preliminaries

2 Preprocessing

4 Opinion mining

Challenges

Does a document contain an opinion? In which portion?sites with a review component — easye.g. CNET, Amazon, Epinions

blogs — harder

Sentiment classificationoverall (polarity) / specific

free form / grades or stars ..

quotations

Presentationhighlighting

aggregation

community identification

estimating reliability..

Query classification: is theuser looking for an opinion?

Challenges

blogs — harder

quotations

aggregation

Challenges

blogs — harder

quotations

aggregation

estimating reliability

Challenges

blogs — harder

quotations

aggregation

Opinion Mining

Feature-based opinion summarizationIdentify the features of the product that customers have expressedopinions on (called opinion features)

For each feature, identify how many customer reviews are positive/ negative

Examples:

The pictures are very clear.

Overall a fantastic, very compact, camera.

While light, it will not easily fit in pockets. (HARD!)

Opinion Mining

Feature identification1 POS tagging + chunking: identify nouns, verbs, adjectives, simple

noun groups, verb groups

2 Transaction creation for each sentence: item ≡ normalized nouns/ noun phrases

3 Association rule mining: all itemsets with > 1% support arecandidate frequent features

4 Feature pruning:keep features that have some compact occurrences

keep singleton itemsets only if they occur enough times in isolatione.g. manual vs. manual mode, manual setting

Opinion Mining

Sentiment / orientation identification1 Examine each sentence in the review database

2 If it contains a frequent feature, extract all the adjective words asopinion words

3 For each feature in the sentence, the nearby adjective is recordedas its effective opinion

4 Look up adjective in a list of adjectives with known orientation, orconsult WordNet (discard unknowns)adjectives arranged in bipolar structures

Datasets

Blog06 (25GB) : University of Glasgowhttp://ir.dcs.gla.ac.uk/test_collections/access_to_data.htm

Congressional floor-debate transcriptshttp://www.cs.cornell.edu/home/llee/data/convote.html

Cornell movie-review datasetshttp://www.cs.cornell.edu/people/pabo/movie-review-data/

References

Untangling Text Data Mining. M. Hearst. Proceedings of ACL’99.www.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html

An Introduction to Information Retrieval. Manning, Raghavan,Schutze.www-csli.stanford.edu/~schuetze/information-retrieval-book.html

Tutorial on Web Content Mining. Bing Liu. WWW 2005.www.cs.uic.edu/~liub

Web Data Mining. Bing Liu. Springer, 2006.

Opinion Mining and Sentiment Analysis. B. Pang and L. Lee.Foundations and Trends in Information Retrieval, 2(1-2), 2008.

Sentiment Analysis and Opinion Mining. Bing Liu. MorganClaypool, 2012.www.morganclaypool.com/doi/abs/10.2200/S00416ED1V01Y201204HLT016?

journalCode=hltM. Mitra (ISI) Text Mining 29 / 29

Introduction to Text Mining - Indian Statistical …acmsc/TMW2014/M_mitra.pdfIntroduction to Text...

Documents

Text Mining: Natural Language techniques and Text Mining applications · · 2017-08-27Text Mining: Natural Language techniques and Text Mining applications M. Rajman, R. Besan

SAS Text Mining

CSE 634 – Data Mining: Text Mining · Text Mining vs. • Data Mining – In Text Mining, patterns are extracted from natural language text rather than databases. • Web Mining

CS583 – Data Mining and Text Mining

Text Mining Handbook

Text Data Mining

Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11

Mining Text Using Keyword Distributions - Hebrew …pluto.huji.ac.il/~rfeldman/papers/feldmanHirsh.pdfKeywords: data mining, text mining, text categorization, distribution comparison,

Chapter 5: Text and Web Mining. Learning Objectives Describe text mining and understand the need for text mining Differentiate between text mining, Web

Text mining and data mining

Introduction to Text Mining - en.cs.uni-paderborn.de · Introduction to Text Mining Part VIII: Text Mining using Classiﬁcation and Regression Henning Wachsmuth Text Mining VIII

Introduction to Text Mining - uni-paderborn.de · Introduction to Text Mining Part VII: Text Mining using Similarities and Clustering Henning Wachsmuth Text Mining VII Text Mining

Text Mining

Text Mining with Oracle - Text Mining Summit

A Brief Survey of Text Mining · Text Mining = Text Data Mining. Text mining can be also deﬁned — similar to data mining — as the application of algorithms and methods from

Mining Unstructured Data (Text Data Mining) - Chapters Site IIA Nov5...Mining Unstructured Data (Text Data Mining) ... Text Mining tools and tips for beginning to use text ... free

Text Mining Infrastructure in R - University of Idahostevel/517/Text Mining Infrastructure in R.pdf4 Text Mining Infrastructure in R an established text mining framework with architecture

Text mining & Web mining

Text Mining Medline - Oracle€¦ · Data MiningData mining Text Mining Spectrum Data Mining Chemical/sequence Data Model. Title: Text Mining Medline Author: user Created Date: 7/27/2004

Information Retrieval & Text Mining - Intranet DEIBhome.deib.polimi.it/.../DMTM/DMTM1112_TextMining.pdf · 2012-06-13 · Information Retrieval & Text Mining Data Mining and Text