Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
B S HarishDepartment of Information Science & Engineering
SJCE, MysoreINDIA
Mail: [email protected]://sites.google.com/site/bsharishsjce/
Data explosion problem
Automated data collection tools, widely
used database systems, computerized society,
and the Internet lead to tremendous
amounts of data accumulated and/or
to be analyzed in databases, data warehouses,
WWW, and other information repositories
We are drowning in data, but starving for knowledge!
Solution: Data warehousing and data mining
Data warehousing and on-line analytical processing
Mining interesting knowledge (rules, regularities, patterns,
constraints) from data in large databases
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, information
harvesting, business intelligence, etc.
What Is Data Mining?
Selection: Obtain data from various sources
Preprocessing: Cleanse data
Transformation: Convert to common format. Transformto new format
Data Mining: Obtain desired results
Interpretation/Evaluation: Present results to user inmeaningful manner
Different views lead to different classifications
Kinds of data to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
Applications of Data Mining
A case study on Text Mining
Introduction
Text mining is an emerging technology for handling the increasing text data
Text classification/clustering is one of the fundamental functions in textmining
Text classification/clustering is to divide a collection of text documents into
different category groups so that documents in the same category group
describe the same topic, such as classic music or Chinese
Unlike structured data, text data faces a number of new
challenges.
Volume
Dimensionality
Sparsity
Complex semantics
Security applications
Search Engines
Online Media applications
Marketing applications
Sentiment analysis
Academic applications
Text Mining Tasks
Text classification
Text clustering
Concept/entity extraction
Sentiment analysis
Document summarization
Entity relation modeling
Text Classification
Text Classification Applications
E-mail spam filtering
Categorize newspaper articles and newswires into topics
Organize Web pages into hierarchical categories
Sort journals and abstracts by subject categories (e.g.,MEDLINE, etc.)
Assigning international clinical codes to patient clinicalrecords
Poll of
Docu
mentsPre-Processing
Stemming/
Stopword Elimination/
Linguistic Preprocessing
Representation
BOW/VSM/
Probabilistic/
Ontology/
Term Sequence/
N-Grams/
Phrase Based/
Symbolic/
UNL/
Classification
(Supervised)
Clustering
(Unsupervised)
Evaluating
Results
Block Diagram of Text Classification Model
Representation Methods
VSM/BOW Representation Binary Representation Ontology Representation N-Grams Multi-Word terms Latent Semantic Indexing Locality Preserving Indexing Universal Networking Language
Classifiers
Naïve Bayes Classifier K-NN Classifier Centroid Classifier Decision Trees Rocchio Classifier Neural Networks SVM
Symbolic Representation (Our Method); ACM Compute , 2010
Ontology Based
VSMPruning Stemming
Stopword
Removal
Collection
ReaderDetagger Tokenizer
Input
Documents
TF IDF Weighting Scheme
Ontology
Representation
Preprocessing
Text Data and Representation Models
Different representation formats offer different capabilities todescribe context, structure, semantics, presentations of text content
Text Conversion
Representation Model
In information retrieval and text mining, text data ofdifferent formats is represented in a commonrepresentation model.
e.g., Vector Space Model
Text data is converted to the model representation
Vector Space Model (VSM)
The most popular representation model used in information
retrieval and text mining.
In VSM, a text document is represented as a vector of terms
<t1, t2, …, ti, …, tn>.
Each term ti represents a word or a phrase.
The set of all n unique terms in a set of text documents
forms the vocabulary for the set of documents.
A set of documents are represented as a set of vectors, that
can be written as a matrix.
Graphical Representation
Example:D1 = 2T1 + 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
T3
T1
T2
D1 = 2T1+ 3T2 + 5T3
D2 = 3T1 + 7T2 + T3
Q = 0T1 + 0T2 + 2T3
7
32
5
• Is D1 or D2 more similar to Q?
• How to measure the degree of
similarity? Distance? Angle?
Projection?
Document Collection
A collection of n documents can be represented in the vector
space model by a term-document matrix.
An entry in the matrix corresponds to the “weight” of a term
in the document; zero means the term has no significance in
the document or it simply doesn’t exist in the document.
T1 T2 …. Tt
D1 w11 w21 … wt1
D2 w12 w22 … wt2
: : : :
: : : :
Dn w1n w2n … wtn
Term Weights: Term Frequency
More frequent terms in a document are more important,
i.e. more indicative of the topic.
fij= frequency of term i in document j
May want to normalize term frequency (tf) across the
entire corpus:
tfij = fij / max{fij}
Term Weights: Inverse Document Frequency
Terms that appear in many different documents are less indicative ofoverall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
An indication of a term’s discrimination power.
Log used to dampen the effect relative to tf.
Advantages and Disadvantages of VSM
Advantages
Simple
Easy to calculate similarity between two documents
Data mining algorithms can be applied directly to text
data
Disadvantages
Terms are assumed independent (which is not true in
the real text document)
Lack of semantics
High dimensionality and sparsity
To classify documents into 4 classes:
economics, sports, science, life.
There are two approaches:
Rule-based approach
To write a set of rules that classify documents
Machine learning-based approach
Using a set of sample documents that are classifiedinto the classes (training data), automatically createclassifiers based on the training data
Rule-based classification
Pros:
Very accurate when rules are written by experts
Classification criteria can be easily controlled when the number of rules are small.
Cons:
Sometimes, rules conflicts each other
maintenance of rules becomes more difficult as the number of rules increases
The rules have to be reconstructed when a target domain changes
Low coverage because of a wide variety of expressions
Machine Learning-based approach
Pros:
Domain independent
High predictive performance
Cons:
Not accountable for classification results
Training data required
1.Prepare a set of training data
• Attach topic information to the documents in a target domain.
2.Create a classifier (model)
• Apply a Machine Learning tool to the data
Support Vector Machine (SVM), Maximum Entropy Models (MEM)
3.Classify new documents by the classifier
sports
science
lifeclassifier
sports
science
life
classifier
…
life
sports
Training data
f1
f2
f3
f4
game
play
ball
danceClassifier c(·|·)
c(sports|x)
document d
c(science|x)
c(economics|x)
c(y|x)
x=(f1, f2, f3, f4)
features
feature extraction
feature vector
(input vector) c(life|x)
Select the best
classification result
Case Study
Doc 1
This is a document which
contains information on
text clustering
Doc 2
Text clustering is the
information which is
presented in this
document
Doc 3
Cricket is an interesting
game which is played by
eleven players
Doc 4
The maximum numbers of
players in the game cricket
are eleven
Query
Eleven players will be playing cricket game
Elimination of Stop Words
Doc 1
Document contains
information text
clustering
Doc 2
Text clustering information
presented document
Doc 3
Cricket interesting game
played eleven players
Doc 4
maximum numbers
players game cricket
eleven
Query
Eleven players playing cricket game
BOW for
Document 1
Document
contains
information
text
Clustering
Text
Text
Document
information
BOW for
Document 2
Text
clustering
information
presented
document
Text
Clustering
document
BOW for
Document 3
Cricket
interesting
game
played
eleven
players
Cricket
Game
cricket
BOW for
Document 4
maximum
numbers
players
game
cricket
eleven
Eleven
Cricket
numbers
*Assumption: Add more term to documents to get more term frequency
BOW for the Corpus
T1 : Document
T2 : contains
T3 : information
T4 : text
T5 : clustering
T6 : presented
T7 : Cricket
T8 : interesting
T9 : game
T10 : played
T11 : eleven
T12 : players
T13 : maximum
T14 : numbers
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14
D1 2 1 2 3 1 0 0 0 0 0 0 0 0 0
D2 2 0 1 2 2 1 0 0 0 0 0 0 0 0
D3 0 0 0 0 0 0 3 1 2 1 1 1 0 0
D4 0 0 0 0 0 0 2 0 1 0 1 1 1 2
Document Vs Term Matrix for Training
Documents
BOW for Document Query
Eleven
players
playing
cricket
Game
players
players
cricket
cricket
Query
Eleven players will be playing cricket game
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14
Query
Docum
ent
0 0 0 0 0 0 3 0 1 0 0 3 0 0
Document Vs Term Matrix for Query Document
Matching
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14
D1 2 1 2 3 1 0 0 0 0 0 0 0 0 0
D2 2 0 1 2 2 1 0 0 0 0 0 0 0 0
D3 0 0 0 0 0 0 3 1 2 1 1 1 0 0
D4 0 0 0 0 0 0 2 0 1 0 1 1 1 2
Query
Docum
ent
0 0 0 0 0 0 3 0 1 0 0 3 0 0
CosSim(D1 , Q) = 0 / 196 = 0
CosSim(D3 , Q) = 14 / 196 = 0.071 CosSim(D4 , Q) = 10 / 196 = 0.051
CosSim(D2 , Q) = 0 / 196 = 0
Ranking of Documents
Sort the values in descending order and rank the documents
Sorted Sequence: D3, D4, D2, D1
CosSim(D1 , Q) = 0 / 196 = 0
CosSim(D3 , Q) = 14 / 196 = 0.071 CosSim(D4 , Q) = 10 / 196 = 0.051
CosSim(D2 , Q) = 0 / 196 = 0
Documents Related to query document : D3, D4, D2, D1
Application in several areas:querying
clustering, identifying topics
Other:synonym recognition (TOEFL..)
Psychology test
essay scoring
Latent Semantic Indexing isLatent: Captures associations which are not explicit
Semantic: Represents meaning as a function of similarity to other entities
Cool: Lots of spiffy applications, and the potential for some good theory too
Text corpus with many documents (docs)
Given a query, find relevant docs
Classical problems:synonymy: missing docs with reference to “automobile” when querying on “car”
polysemy: retrieving docs on internet when querying on “surfing”
Solution: Represent docs (and queries) by their underlying latent concepts
Represent each document as a word vector
Represent corpus as term-document matrix (T-D matrix)
A classical method:Create new vector from query terms
Find documents with highest dot-product
Query
Word 1
Word 2
Process term-document (T-D) matrix to expose statistical structure
Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff
Related to principal component analysis (PCA), multiple dimensional scaling (MDS)
U = universe of terms
n = number of terms
m = number of docs
A = n x m matrix with rank rcolumns represent docs
rows represent terms
=
=
mxn
A
mxr
U
rxr
D
rxn
VT
Terms
Documents
• LSI uses SVD, a linear analysis method:
r is the rank of A
D diagonal matrix of the r singular values
U and V matrices composed of orthonormal columns
SVD is always possible
numerical methods for SVD exist
run time: O(m n c), where c denotes the average number of words per document
=
=
mxn
A’k
mxk
Uk
kxk
Dk
kxn
VT
k
Terms
Documents
Landauer & DumaisPerform LSI on 30,000 encyclopedia articles
Take synonym test from TOEFL
Choose most similar word
LSI - 64.4% (52.2% corrected for guessing)
People - 64.5% (52.7% corrected for guessing)
Correlated .44 with incorrect alternatives
The model:Topics sufficiently disjoint
Each doc drawn from a single (random) topic
Result:With high probability (whp) :
Docs from the same topic will be similar
Docs from different topics will be dissimilar
K topics, each corresponding to a set of words
The sets are mutually disjoint
Below, all random choices are made uniformly at random
A corpus of m docs, each doc created as follows..
choosing a doc:choose length of the doc
choose a topic
Repeat times:With prob choose a word from topic
With prob choose a word from other topics
1
l
T
T
l
Theory should go beyond explaining (ideally)
Potential for speed up: project the doc vectors onto a suitably small space
perform LSI on this space
Yields O(m( n + c log n)) compared to O(mnc)
2log
Symbolic Representation of Text Documents
D S Guru, B S Harish and S Manjunath
Proceedings of the Third Annual ACM Bangalore Compute 2010
http://portal.acm.org/citation.cfm?id=1754306
Objective of the paper
A novel method of representing a text document by the use of
interval valued symbolic features
Classification of text documents based on new representation
Extensive experimentation on standard dataset
Comparison with the existing models
Overview of (Isa et al., 2008)
Naïve Bayes classifier is used to vectorize the given text document
Text documents contains list of words
Probabilistic classifier calculates the posterior probability of
particular document being annotate to a particular category by
using equation( ). ( )
( )( )
r r
r
r
PP Word Category P Category
Category WordP Word
Isa, D., Lee, L. H., Kallimani, V. P. and Rajkumar, R. 2008. Text document preprocessing with the Bayes formula
for classification using the support vector machine. IEEE Transactions on Knowledge and Data Engineering. vol.
20, pp. 23-31.
The table below shows the occurrence of words in a document and its
probability distribution
Word Probability
Category 1
Probability
Category 2
Probability
Category 3
… Probability
Category k-1
Probability
Category k
w1 Pr(C1|w1) Pr(C2|w1) Pr(C3|w1) … Pr(Ck-1|w1) Pr(Ck|w1)
w2 Pr(C1|w2) Pr(C2|w2) Pr(C3|w2) … Pr(Ck-1|w2) Pr(Ck|w2)
w3 Pr(C1|w3) Pr(C2|w3) Pr(C3|w3) … Pr(Ck-1|w3) Pr(Ck|w3)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
wn-1 Pr(C1|wn-1) Pr(C2|wn-1) Pr(C3|wn-1) … Pr(Ck-1|wn-1) Pr(Ck|wn-1)
wn Pr(C1|wn) Pr(C2|wn) Pr(C3|wn) … Pr(Ck-1|wn) Pr(Ck|wn)
Total …
Probability of
Input Document
…
1( )rP C W 2( )rP C W 3( )rP C W 1( )r kP C W ( )r kP C W
1( )rP C W
n
2( )rP C W
n
3( )rP C W
n
1( )r kP C W
n
( )r kP C W
n
Limitations
Lot of variations of the features with training documents
Solution
Assimilates these variations together will be a good representative of
each classes.
Thus we use Symbolic Data Analysis (SDA) to represent a text document.
Advantages of using SDA:
Real life objects can be better described
Symbolic data appear in the form of continuous ratio, discrete absolute interval
and multivalued, multivalued with weightage, quantitative, categorical, etc..
Proposed Method
Let be a set of ‘n’ training documents of a class
(k denotes the number of categories) and let be the set
of ‘m’ words characterizing the document of the class Cj.
Compute the probabilities of the word in each category
Obtained probabilities of all words w r t each category are combined to form a
feature vector of length k
Process is repeated for all the documents in class Cj and also for all other
trained documents of all other classes
Representation Stage
Classification Stage
1 2 3,...,[ , , ]j j j j
nD D D D ; 1,2,3,...,jC j k
1 2 3[ , , ,..., ]j j j j j
i i i i imW w w w w
j
iD
Symbolic Representation
To capture the intra class variations of class conditional probability ;
we consider minimum and maximum class probability value of each
class in the form of interval types.
These interval valued type of class Cj are represented as
represents the upper and lower limit of feature value of a document
class in the knowledge base.
,i i
j j jC C C
Table below shows the representative vector for different classes.
Where, andmin{ ( )}; 1i
j r i iC P C d i n
max{ ( )}; 1i
j r i iC P C d i n
The representative vector for the class Cj, is formed by representing
each ‘k’ feature in the form of an interval and is given by
This symbolic feature vector is stored in the knowledge base as a
representative of the jth document class.
Similarly, we compute symbolic feature vectors for all the classes
and store them in knowledge base for classification.
1 1 2 2{[ , ],[ , ],...,[ , ]}j j j j j jk jkRF f f f f f f
Document Classification
Given a test document, which is described by a set of ‘k’ feature values of
type crisp and compares it with the corresponding interval type feature
values of the respective class i.e, stored in the knowledge base.
Now, each mth feature value of the test document is compared with the
corresponding interval in RFj to examine whether the feature value of the
test document lies within the corresponding interval.
We make use of Belongingness Count BC as a measure of degree of
belongingness for the test document to decide whether it belongs to the
correct class or not.
1
( ,[ , ]),k
c tm jm jm
m
B C f f f where
1 ( )
0( ,[ , ]) tm jm tm jmIf f f and f f
tm jm jm OtherwiseC f f f
Experimental Settings
Classification Accuracy with varying Training and Testing set
Comparative analysis of proposed method with that of [2]
There are 3 stages
Symbolic
Representation
Symbolic Feature
Selection
Document
Classification
Text Classification
Symbolic Representation
Number of classes: 3
Number of Documents per class: 4
Step 1: Representing the documents in “term-document matrix”
‘X’ represents the “term-document matrix” of size (kN X t) where,
k Number of classes present (In this case k=3 class),
N Number of training documents present per class (here N=5) and
t represents the number of terms present in the training set (i.e., 15 terms).
Step 2: Applying Dimensionality Reduction Technique (RLPI) on “term-document matrix”
Here ‘Y’ represents the “reduced term-document matrix” of size (kN X m) where,
m represents the number of reduced dimensions of the training set
(i.e., 5 dimensions).
Compute interval valued reference term frequency vector for each class on matrix Y. the
class representative vectors are stored in a matrix F of size (k X m) where each feature is
of type interval.
Step 4: Symbolic Feature Selection
To employ Symbolic Feature Selection (SFS) scheme on matrix . After employing SFS we
selected Feature 3 and Feature 5, as the best features. This is stored in the knowledge
base.
Step 5: Document Classification
We make use of a similarity measure proposed by Guru and
Nagendraswamy (2007) to measure the similarity between test feature vector
FDQ and the jth class representative Rj.
Results
Introduction to Data Mining
Text Mining : Issues
Text Classification : Issues
A case study on Text Classification
Symbolic Representation for Text Documents