Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
ClusteringCS5604 - Information Storage and
RetrievalSpring 2015
Virginia Tech5/5/2015
Sujit Reddy ThummaRubasri KalidasHanaa Torkey
1
Agenda• Project Description• System Design• Cluster Evaluation
2
Project Motivations
• Problem 1: Query word could be ambiguous:• Eg: Solr query “Star” retrieves documents about astronomy, plants, animals etc.
• Solution: • Clustering document responses to queries along lines of different topics.
• Problem 2: Constructing of topic hierarchies and taxonomies• Solution:
• Preliminary clustering of large samples of web documents.
• Problem 3: Speeding up similarity search• Solution:
• Restrict the search for documents similar to a query to most representative cluster(s).
3
Project Overview
• Data Preparation:• Converting Avro data files to sequence files with key as document ID
and value as cleaned text.
• Data Clustering:• Flat clustering of tweets and webpages collections • Hierarchical clustering of tweets and webpage collections
• Cluster Labeling:• Labeling the output of the clusters using the top terms in the
clustering result.
• Analysis of Output:• Statistics and evaluation for the clustering result
4
• Input tweets• Input Webpages
AVRO=> sequence file
• K-means clustering
• Label extraction
Divide input based on results • Hierarchical
clustering• Merge results
from previous stage
Load data into HDFS
Workflow
Clustering Process• Cluster the document using K-Means algorithm based on:
• Similarity measure: cosine similarity.• Clustering:
• Randomly choose k instances as seeds, one per cluster.• Form initial clusters based on these seeds.• Iterate, repeatedly reallocating instances to different clusters to
improve the overall clustering.• Stop when clustering converges, no change in clusters.
6
System Design
TWEETS / WEBAPGES
KMEANS CLUSTERING
LABEL EXTRACTION
HIERARCHICAL CLUSTERING
MERGE RESULTS
LOAD DATA IN
HBASE
DATA EXTRACTION
MAHOUT
MAHOUT
JAVA/Python
JAVA/Python
JAVA/Python
Evaluation
7
• $MAHOUT clusterdump \
• -i `hadoop fs -ls -d ${OUTPUT_DIR}/output-kmeans/clusters-*-final | awk '{print $8}'` \
• -o ${OUTPUT_DIR}/output-kmeans/clusterdump \
• -d ${OUTPUT_DIR}/output-seqdir-sparse-kmeans/dictionary.file-0 \
• -dt sequencefile \• Typical output example:
Mahout K-Means Clustering
Cluster 1:Top Terms:
ebola => 0.07152884460426318rt =>0.058896787550590086outbreak =>0.021403831817739378viru =>0.014907624119456567africa =>0.013911333665597606patient =>0.012710181097074835liberia =>0.012436799376831979health => 0.01211349267884965spread =>0.011954053181191696fight =>0.011414421218770166
Cluster 2:Top Terms:
drug => 0.16092978007424608experim => 0.10931820179530365make => 0.068495523952033ebola => 0.05933505902257844vaccin => 0.0446040908953977rt =>0.038446578081122056zmapp => 0.03291364296270642trial => 0.03204720309608251develo =>0.030003011526946913monkei => 0.02626380297639355
Cluster 3:Top Terms:
death => 0.1090171554604992kill => 0.08483980316278365toll => 0.07380344646424317ebola => 0.0630359353151828dai => 0.06035789606747709rt =>0.052361887019333926peopl => 0.0515136862735416break =>0.028802429266523474africa =>0.024238362492392238rise =>0.021512270038958534
8
Cluster Labeling
1
2
3
4
5
6
7
8
9
10
boston bomber bomb marathon suspect love stop rt know talk
ass bomb feel rt made make sex food breakfast nap
bomb rt da sound dick time drop af photo fuck
gui hit follow bomb rt da diggiti love car meet
get run bomb rt goofi jealou put trust listen wife
dai bomb rt end happi ass birthdai on todai hope
sai suspect set boston bomb airport rt break polic marathon
lol come bomb rt ass shit know sound love make
yogurt frozen sound greek bomb grape strawberri rn land rt
amp bomb rt listen ass beauti cook commun goal pussi
Labeling result for Boston bomb data set
Evaluation• Silhoutte Scores• Confusion Matrix• Human Judgement
10
Silhoutte• Goal: To measure intra-cluster similarity and inter-cluster
dissimilarity• Silhoutte score is calculated based on following equation:• For each document:
• a = mean intra-cluster distance• b = mean nearest-cluster distance• Silhoutte Coefficient = (b - a) / max(a, b)
• Silhoutte Score = mean of all coefficients of all documents in a collection
• Score = +1 (Very Good), -1 (Very Bad), Anything > Zero (Good)
11
Silhoutte Scores for WebpagesData Set Silhoutte Score
classification_small_00000_v2 (plane_crash_S) 0.0239099
classification_small_00001_v2 (plane_crash_S) 0.296624
clustering_large_00000_v1 (diabetes_B) 0.124263
clustering_large_00001_v1 (diabetes_B) 0.0284772
clustering_small_00000_v2 (ebola_S) 0.0407911
clustering_small_00001_v2 (ebola_S) 0.0163434
hadoop_small_00000 (egypt_B) 0.206282
hadoop_small_00001 (egypt_B) 0.264068
ner_small_00000_v2 (storm_B) 0.0237915
ner_small_00000_v2 (storm_B) 0.219972
noise_large_00000_v1 (shooting_B) 0.027601
12
Silhoutte Scores for Webpages (2)Data Set Silhoutte Scores
noise_large_00000_v1 (shooting_B) 0.0505734
noise_large_00001_v1 (shooting_B) 0.0329083
noise_small_00000_v2 (charlie_hebdo_S) 0.0156003
social_00000_v2 (police) 0.0139787
solr_large_00000_v1 (tunisia_B) 0.467372
solr_large_00001_v1 (tunisia_B) 0.0242648
solr_small_00000_v2 (election_S) 0.0165125
solr_small_00001_v2 (election_S) 0.0639537
13
Silhoutte Scores for Tweets (Small Collections)
Data Set Silhoutte Score
charlie_hebdo_S 0.0106168
ebola_S 0.00891114
Jan.25_S 0.112778
plane_crash_S 0.0219601
winter_storm_S 0.00836856
suicide_bomb_attack_S 0.0492852
election_S 0.00522204
14
Silhoutte Scores for Tweets (Big Collections) Silhoutte Score
bomb_B 0.0114236
diabetes_B 0.014169
egypt_B 0.0778305
Malaysia_Airlines_B 0.0993336
shooting_B 0.00939293
storm_B 0.011786
tunisia_B 0.0310645
15
Confusion Matrix• Data sets we have already are categorized into various clusters
• Small Collections• Tweets related to ebola, elections, plane crash, etc.
• Big Collections• Tweets related to diabetes, shooting, storm, etc.
• We have identified and assumed such collections as a training set for clustering validation
• Used K-Means with various tunables• Changing distance measure• Iterations until convergence• Feature Selection pruning out high frequency words• Changed the analyzer to MailArchivesClusteringAnalyzer• Random initialization of centroids
16
Confusion Matrix for Small Tweet Collection Concatenation (7)
A --> suicide_bomb_attack_SB --> charlie_hebdo_SC --> ebola_SD --> plane_crash_SE --> election_SF --> winter_storm_SG --> Jan.25_S
17
Confusion Matrix for Big Tweet Collection Concatenation (7)
A --> diabetes_BB --> bomb_BC --> storm_BD --> tunisia_BE --> Malaysia_Airlines_BF --> egypt_BG --> shooting_B
18
Silhoutte Scores for concatenated Data Sets
Data Set Silhoutte ScoreSmall Tweet Collection 0.0190188Big Tweet Collection 0.0166863
19
Clustering StatisticsData Set Dictionary Size Sparsity Index Clustering Time (K-
means) (Minutes)
charlie_hebdo_S 13452 99.8591179858 5.39
ebola_S 25648 99.8957347915 6.37
Jan.25_S 17639 99.9185522711 5.45
plane_crash_S 16725 99.8628283162 5.82
winter_storm_S 23717 99.8690672714 6.28
suicide_bomb_attack_S 3748 99.7262108164 5.45
election_S 59643 99.9231350261 6.93
Concat_S 100548 99.9250943396 NA
20
Clustering StatisticsData Set Dictionary Size Sparsity Index Clustering Time (K-
means)bomb_B 508347 99.8591179858 14.68
diabetes_B 224233 99.8957347915 15.32
egypt_B 159347 99.9185522711 8.87
Malaysia_Airlines_B 18498 99.8628283162 15.6
shooting_B 600305 99.8690672714 16.6
storm_B 516101 99.7262108164 16.22
tunisia_B 175966 99.9231350261 7.34
Concat_B 1559466 99.9460551033 NA
21
Clustering StatisticsData Set Dictionary Size Sparsity Index Clustering Time
(Minutes)classification_small_00000_v2
3839 97.3941205733 5.1
classification_small_00001_v2
536 97.6227262127 5.1
clustering_large_00000_v1 11374 99.1809722871 5.3
clustering_large_00001_v1 15593 99.2333229766 5.2
clustering_small_00000_v2 385 97.9647495362 4.9
clustering_small_00001_v2 384 96.7801716442 5.1
hadoop_small_00000 4387 99.1756580464 5.4
hadoop_small_00001 4387 99.1756165435 5.1
22
Human Judgement• Ebola_S data set• # of Clusters: 5• Cluster 1: (related to deaths)
• Ebola kills fourth victim in Nigeria The death toll from the Ebola outbreak in Nigeria has risen to four whi
• RT Dont be victim 827 Ebola death toll rises to 826• RT Ebola outbreak now believed to have infected 2127 people killed 1145 health
officials say• RT Two people in have died after drinking salt water which was rumoured to be
protective against
23
• Cluster 2: (related to doctors )• US doctor stricken with the deadly Ebola virus while in Liberia and brought to the
US for treatment in a speci• RT doctor blogs from frontlines via• Moscow doctors suspect that a Nigerian man might have Ebola• RT Quelling Ebola outbreak will take six months doctors group says• Pray for Government dismissed 16000 doctors on strike despite Ebola pandemic
24
• Cluster 3:(related to politics)• RT For real Obama orders Ebola screening of Mahama other African Leaders
meeting him at USAfrica Summit• Patrick Sawyer was sent by powerful people to spread Ebola to Nigeria Fani
Kayode Fani Kayode has reacted • RT The Economist explains why Ebola wont become a pandemic View video via• Obama Calls Ellen Commits to fight amp WAfrica
25
• Cluster 4: (related to ebola virus itself)• How is this Ebola virus transmitted• RT Ebola symptoms can take 2 21 days to show It usually start in the form of
malaria or cold followed by Fever Diarrhoea V• Ebola puts focus on drugs made in tobacco plants Using plants this way
sometimes called pharming can pr• Ebola virus forces Sierra Leone and Liberia to withdraw from Nanjing Youth
Olympics
26
• Cluster 5: (related to drugs)• Ebola FG Okays Experimental Drug Developed By Nigerian To Treat• RT Drugs manufactured in tobacco plants being tested against Ebola other
diseases• Tobacco plants prove useful in Ebola drug production See All• EBOLA Western drugs firms have not tried to find vaccine because virus only
affects Africans
27
• Q&A• Thanks!!
Acknowledgment• We would like to thank our sponsors, US National Science
Foundation, for funding this project through grant IIS - 1319578.• We would like to thank our class mates for extensive evaluation of
this report through peer reviews and invigorating discussions in the class that helped us a lot in the completion of the project.
• We would like to specifically thank our instructor Prof. Edward A. Fox and teaching assistants Mohammed Magdy, Sunshin Lee for their constant guidance and encouragement throughout the course of the project.