19
Text mining

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Embed Size (px)

Citation preview

Page 1: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Text mining

Page 2: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

The Standard Data Mining process

Page 3: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Text Mining• Machine learning on text data• Text Data mining• Text analysis• Part of Web mining

• Typical tasks include:– Text categorization (document classification)– Text clustering– Text summarization– Opinion mining– Entity/concept extraction

– Information retrieval: search engines– information extraction: Question answering

Page 4: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Supervised learning algorithms

– Decision tree learning– Naïve Bayes– K-nearest neighbour– Support Vector Machines– Neural Networks– Genetic algorithms

Page 5: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Supervised Machine learning

1. Build or get a representative corpus2. Label it3. Define features4. Represent documents5. Learn and analyse6. Go to 3 until accuracy is acceptable

First test features: stemmed words

Page 6: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Unsupervised Learning• Document clustering

• k-means• Hierarchic Agglomerative Clustering (HAC)• ….• BIRCH• Association Rule Hypergraph Partitioning (ARHP)• Categorical clustering (CACTUS, STIRR)• ……• STC• QDC

• Interactive learning• Learning from unlabelled data• Learning to label• Two systems that teach each other

Page 7: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Similarity measure

There are many different ways to measure how similar two documents are, or how similar a document is to a query

• Highly depending on the choice of terms to represent text documents– Euclidian distance (L2 norm)– L1 norm– Cosine similarity

Page 8: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Document Similarity Measures

Page 9: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Document Similarity measures

Page 10: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Feature Extraction: Task(1)Task: Extract a good subset of words to represent documents

Document collection

All unique words/phrases

Feature Extraction

All good words/phrases

Some slides by Huaizhong Kou

Page 11: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Feature ExtractionTask

Indexing

Weighting Model

Dimensionality Reduction

Page 12: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Feature Extraction: Task(2)While more and more textual information is available online, effective retrieval is difficult without good indexing of text content.

While-more-and-textual-information-is-available-online-effective-retrieval-difficult-without-good-indexing-text-content

Feature Extraction

Text-information-online-retrieval-index

16

5

2 1 1 1 1

Page 13: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Feature Extraction: Indexing(1)

Identification all unique words

Removal stop wordsRemoval

stop words

Word Stemming

Training documents

Term Weighting •Naive terms•Importance of term in Doc

·Removal of suffix to generate word stem ·grouping words · increasing the relevance· ex.{walker,walking}walk

· non-informative word· ex.{the,and,when,more}

Page 14: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Feature Extraction: Indexing(2)

Vector Space Model (VSM) is one of the most commonly used Text data models

Any text document is represented by a vector of terms• Terms are typically words and/or phrases• Every term in the vocabulary becomes an independent dimension• Each term in the text document would be represented by a non zero

value which will be added in the corresponding dimension

• A document collection is represented as a matrix:

• Where xji represents the weight of the ith term in jth document

Page 15: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Feature Extraction:Weighting Model(1)•tf - Term Frequency weighting

wij = Freqij

Freqij : := the number of times jth term occurs in document Di.

Drawback: without reflection of importance factor for document discrimination.

•Ex.

ABRTSAQWAXAO

RTABBAXAQSAK

D1

D2

A B K O Q R S T W X

D1 3 1 0 1 1 1 1 1 1 1

D2 3 2 1 0 1 1 1 1 0 1

Page 16: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Feature Extraction:Weighting Model(2)•tfidf - Inverse Document Frequency weighting

wij = Freqij * log(N/ DocFreqj) .N : := the number of documents in the training document collection.DocFreqj ::= the number of documents in which the jth term occurs.

Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection

A B K O Q R S T W X

D1 0 0 0 0.3 0 0 0 0 0.3 0

D2 0 0 0.3 0 0 0 0 0 0 0

•Ex.

Page 17: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Feature Extraction: Weighting Model•Tf-IDF weighting

•Entropy weighting )(1*0.1logij iij wentropyFREQw

where

N

j j

ij

j

iji

DOCFREQ

FREQ

DOCFREQ

FREQ

Nwentropy

1

loglog

1)(

is average entropy of ith term and -1: if word occurs once time in every document 0: if word occurs in only one document

Ref:[13]Ref:[11][22]

Page 18: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Feature Extraction: Dimension Reduction• Document Frequency Thresholding• X2-statistic• Latent Semantic Indexing• Information Gain• Mutual information

Page 19: Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks

Dimension Reduction:DocFreq Thresholding•Document Frequency Thresholding

Calculates DocFreq(w)

Sets threshold

Removes all words:DocFreq <

Naive TermsTraining documents D

Feature Terms