27
Recent Trends Recent Trends in in Text Mining Text Mining Girish Keswani Girish Keswani [email protected] [email protected]

Recent Trends in Text Mining Girish Keswani [email protected]

Embed Size (px)

Citation preview

Page 1: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Recent Trends Recent Trends in in

Text MiningText Mining

Girish KeswaniGirish Keswani

[email protected]@micron.com

Page 2: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Text Mining?Text Mining?

What?What? Data Mining on Text DataData Mining on Text Data

Why?Why? Information RetrievalInformation Retrieval Confusion Set DisambiguationConfusion Set Disambiguation Topic DistillationTopic Distillation

How?How? Data MiningData Mining

Page 3: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

OrganizationOrganization

Text Mining AlgorithmsText Mining Algorithms Jargon UsedJargon Used BackgroundBackground

Data Modeling,Data Modeling, Text Classification, andText Classification, and Text ClusteringText Clustering

ApplicationsApplications Experiments {NBC, NN and ssFCM}Experiments {NBC, NN and ssFCM} Further workFurther work ReferencesReferences

Page 4: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Text Mining AlgorithmsText Mining Algorithms

Classification AlgorithmsClassification Algorithms Naïve Bayes ClassifierNaïve Bayes Classifier Decision TreesDecision Trees Neural NetworksNeural Networks

Clustering AlgorithmsClustering Algorithms EM AlgorithmsEM Algorithms Fuzzy Fuzzy

Page 5: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

JargonJargon

DM: Data MiningDM: Data Mining IR: Information RetrievalIR: Information Retrieval NBC: Naïve Bayes ClassifierNBC: Naïve Bayes Classifier EM: Expectation MaximizationEM: Expectation Maximization NN: Neural NetworksNN: Neural Networks ssFCM: Semi-Supervised Fuzzy C-ssFCM: Semi-Supervised Fuzzy C-

MeansMeans Labeled Data (Training Data)Labeled Data (Training Data) Unlabeled DataUnlabeled Data Test DataTest Data

Page 6: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Background: ModelingBackground: Modeling

Vector Space Model Vector Space Model

Page 7: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Background: ModelingBackground: Modeling

Generative Models of Data [13] : Generative Models of Data [13] : ProbabilisticProbabilistic

““to generate a document, a class is to generate a document, a class is first selected based on its prior first selected based on its prior probability and then a document is probability and then a document is generated using the parameters of generated using the parameters of the chosen class distribution”the chosen class distribution”

NBC and EM Algorithms are based NBC and EM Algorithms are based on this modelon this model

Page 8: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Importance of Unlabeled Importance of Unlabeled Data?Data?

AD

B

E F

C

G

Provides access to feature distribution Provides access to feature distribution in set F using joint probability in set F using joint probability

distributionsdistributions

Labeled Data

Unlabeled Data

Test Data

Page 9: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

How to make use of How to make use of Unlabeled Data? Unlabeled Data?

Page 10: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

How to make use of How to make use of Unlabeled Data? Unlabeled Data?

Page 11: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Experimental Results [1]Experimental Results [1]

Using NBC, EM and ssFCM

Page 12: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Experimental Results [2]Experimental Results [2]

Using NBC and EM

Page 13: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Extensions and Variants of Extensions and Variants of these approachesthese approaches

Authors in [6] propose a concept Authors in [6] propose a concept of Class Distribution Constraint of Class Distribution Constraint matrixmatrix Results on Confusion Set Results on Confusion Set

DisambiguationDisambiguation Automatic Title Generation [7]:Automatic Title Generation [7]:

Using EM AlgorithmUsing EM Algorithm Non-extractive approach Non-extractive approach

Page 14: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Relational Data [9]Relational Data [9]

A collection of data with A collection of data with relations between entities relations between entities explained is known as relational explained is known as relational datadata

Probabilistic Relational ModelsProbabilistic Relational Models

Page 15: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Commercial Use/ProductsCommercial Use/Products

IBM Text Analyzer [11]IBM Text Analyzer [11] Decision Tree BasedDecision Tree Based

SAS Text Miner[12]SAS Text Miner[12] Singular Value DecompositionSingular Value Decomposition

Filtering Junk EmailFiltering Junk Email Hotmail, Yahoo Hotmail, Yahoo

Advanced Search EnginesAdvanced Search Engines

Page 16: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Applications: Search EnginesApplications: Search Engines

Page 17: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Vivisimo Search Engine: Vivisimo Search Engine: (www.vivisimo.com)(www.vivisimo.com)

Page 18: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

ExperimentsExperiments

NBCNBC Naïve Bayes ClassifierNaïve Bayes Classifier ProbabilisticProbabilistic

NNNN Neural NetworksNeural Networks

ssFCMssFCM Semi-Supervised Fuzzy ClusteringSemi-Supervised Fuzzy Clustering FuzzyFuzzy

Page 19: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Datasets Datasets (20 Newsgroups Data)(20 Newsgroups Data)

Sampling I:Sampling I:

Sampling II:Sampling II:

Dataset min2 min4 min6

# Features -- 9467 5685

Dataset Sampling Percentage Number of Features

Sample25 25% 13925

Sample30 30% 15067

Sample35 35% 16737

Sample40 40% 16871

Sample45 45% 17712

Sample50 50% 19135

Data

Vectors

Raw

Sampling I

Sampling II Vectors

Page 20: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Naïve Bayes ClassifierNaïve Bayes Classifier

SAMPLE % TRAINING % TEST ACCURACY %

Sample25

20 80 34.4637

63 36 48.4945

76 23 50.9322

82 17 47.7728

86 13 48.9971

20 80 31.5436

63 36 48.0729

76 23 47.8661

82 17 50.5568

86 13 50.4587

Sample30

33 66 39.1137

66 33 46.4233

77 22 48.5528

83 16 52.7383

86 13 51.2136

33 66 39.26

66 33 47.0192

77 22 48.8439

83 16 49.6907

86 13 51.6169

Page 21: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Naïve Bayes ClassifierNaïve Bayes Classifier

Acc

urac

y %

30

35

40

45

50

55

Sample25 Sample30

Sample

Sample25

Sample30.01 .05 .10 .25 .50 .75 .90 .95 .99

-3 -2 -1 0 1 2 3

Normal Quantile

Page 22: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

NBCNBC

Acc

urac

y %

30

35

40

45

50

55

20 63 76 82 86

% TRAINING

Acc

urac

y %

40

45

50

55

33 66 77 83 86

% TRAINING

Acc

urac

y %

30

35

40

45

50

55

13 17 23 36 80

% TEST

Acc

urac

y %

40

45

50

55

13 16 22 33 66

% TEST

Sample25 Sample30

Page 23: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

ssFCMssFCM

AC

CU

RA

CY

%

27.5

30

32.5

35

37.5

20 33 42 50 55 60

% LABELED

AC

CU

RA

CY

%

27.5

30

32.5

35

37.5

40 44 50 57 66 80

% UNLABELED

Effect of Labeled Data Effect of Unlabeled Data

Page 24: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

ssFCMssFCM

AC

CU

RA

CY

%

27.5

30

32.5

35

37.5

sam

ple2

5

sam

ple3

0

sam

ple3

5

sam

ple4

0

sam

ple4

5

sam

ple5

0

Sample

Page 25: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Further WorkFurther Work Ensemble of Classifiers [16]Ensemble of Classifiers [16]

Page 26: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

Further WorkFurther Work Knowledge Gathering from Knowledge Gathering from

ExpertsExperts E.g. 3 class Data:E.g. 3 class Data:

C1 C2 C3

Input Data {C1,C2,C3}

Test Data?

Classifier

Page 27: Recent Trends in Text Mining Girish Keswani gkeswani@micron.com

ReferencesReferences

[1] “Text Classification using Semi-Supervised Fuzzy Clustering,” Girish Keswani and L.O.Hall, appeared in IEEE WCCI 2002 conference.

[2] “Using Unlabeled Data to Improve Text Classification,” Kamal Paul Nigam.[3] “Text Classification from Labeled and Unlabeled Documents using EM,” Kamal Paul

Nigam et al.[4] “The Value of Unlabeled Data for Classification Problems,” Tong Zhang.[5] “Learning from Partially Labeled Data,” Martin Szummer et al.[6] “Training a Naïve Bayes Classifier via the EM Algorithm with a Class Distribution

Constraint,” Yoshimasa Tsuruoka and Jun’ichi Tsujii.[7] “Automatic Title Generation using EM,” Paul E. Kennedy and Alexander G. Hauptmann.[8] “Unlabeled Data can degrade Classification Performance of Generative Classifiers,”

Fabio G. Cozman and Ira Cohen.[9] “Probabilistic Classification and Clustering in Relational Data,” Ben Taskar et al.[10] “Using Clustering to Boost Text Classification,” Y.C. Fang et al.[11] IBM Text Analyzer: “A decision-tree-based symbolic rule induction system for text

categorization,” D.E. Johnson et al.[12] “SAS Text Miner,” Reincke[13] “Pattern Recognition,” Duda and Hart 2000[14] “Machine Learning,” Tom Mitchell[15] “Data Mining,” Margaret Dunham[16] http://www-2.cs.cmu.edu/afs/cs/project/jair/pub/volume11/opitz99a-html/