Upload
nigel-mason
View
221
Download
2
Embed Size (px)
Citation preview
Text mining
The Standard Data Mining process
Text Mining• Machine learning on text data• Text Data mining• Text analysis• Part of Web mining
• Typical tasks include:– Text categorization (document classification)– Text clustering– Text summarization– Opinion mining– Entity/concept extraction
– Information retrieval: search engines– information extraction: Question answering
Supervised learning algorithms
– Decision tree learning– Naïve Bayes– K-nearest neighbour– Support Vector Machines– Neural Networks– Genetic algorithms
Supervised Machine learning
1. Build or get a representative corpus2. Label it3. Define features4. Represent documents5. Learn and analyse6. Go to 3 until accuracy is acceptable
First test features: stemmed words
Unsupervised Learning• Document clustering
• k-means• Hierarchic Agglomerative Clustering (HAC)• ….• BIRCH• Association Rule Hypergraph Partitioning (ARHP)• Categorical clustering (CACTUS, STIRR)• ……• STC• QDC
• Interactive learning• Learning from unlabelled data• Learning to label• Two systems that teach each other
Similarity measure
There are many different ways to measure how similar two documents are, or how similar a document is to a query
• Highly depending on the choice of terms to represent text documents– Euclidian distance (L2 norm)– L1 norm– Cosine similarity
Document Similarity Measures
Document Similarity measures
Feature Extraction: Task(1)Task: Extract a good subset of words to represent documents
Document collection
All unique words/phrases
Feature Extraction
All good words/phrases
Some slides by Huaizhong Kou
Feature ExtractionTask
Indexing
Weighting Model
Dimensionality Reduction
Feature Extraction: Task(2)While more and more textual information is available online, effective retrieval is difficult without good indexing of text content.
While-more-and-textual-information-is-available-online-effective-retrieval-difficult-without-good-indexing-text-content
Feature Extraction
Text-information-online-retrieval-index
16
5
2 1 1 1 1
Feature Extraction: Indexing(1)
Identification all unique words
Removal stop wordsRemoval
stop words
Word Stemming
Training documents
Term Weighting •Naive terms•Importance of term in Doc
·Removal of suffix to generate word stem ·grouping words · increasing the relevance· ex.{walker,walking}walk
· non-informative word· ex.{the,and,when,more}
Feature Extraction: Indexing(2)
Vector Space Model (VSM) is one of the most commonly used Text data models
Any text document is represented by a vector of terms• Terms are typically words and/or phrases• Every term in the vocabulary becomes an independent dimension• Each term in the text document would be represented by a non zero
value which will be added in the corresponding dimension
• A document collection is represented as a matrix:
• Where xji represents the weight of the ith term in jth document
Feature Extraction:Weighting Model(1)•tf - Term Frequency weighting
wij = Freqij
Freqij : := the number of times jth term occurs in document Di.
Drawback: without reflection of importance factor for document discrimination.
•Ex.
ABRTSAQWAXAO
RTABBAXAQSAK
D1
D2
A B K O Q R S T W X
D1 3 1 0 1 1 1 1 1 1 1
D2 3 2 1 0 1 1 1 1 0 1
Feature Extraction:Weighting Model(2)•tfidf - Inverse Document Frequency weighting
wij = Freqij * log(N/ DocFreqj) .N : := the number of documents in the training document collection.DocFreqj ::= the number of documents in which the jth term occurs.
Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection
A B K O Q R S T W X
D1 0 0 0 0.3 0 0 0 0 0.3 0
D2 0 0 0.3 0 0 0 0 0 0 0
•Ex.
Feature Extraction: Weighting Model•Tf-IDF weighting
•Entropy weighting )(1*0.1logij iij wentropyFREQw
where
N
j j
ij
j
iji
DOCFREQ
FREQ
DOCFREQ
FREQ
Nwentropy
1
loglog
1)(
is average entropy of ith term and -1: if word occurs once time in every document 0: if word occurs in only one document
Ref:[13]Ref:[11][22]
Feature Extraction: Dimension Reduction• Document Frequency Thresholding• X2-statistic• Latent Semantic Indexing• Information Gain• Mutual information
Dimension Reduction:DocFreq Thresholding•Document Frequency Thresholding
Calculates DocFreq(w)
Sets threshold
Removes all words:DocFreq <
Naive TermsTraining documents D
Feature Terms