View
222
Download
0
Category
Tags:
Preview:
Citation preview
Marti HearstSIMS 247
SIMS 247 Lecture 19 SIMS 247 Lecture 19 Visualizing Text and Text CollectionsVisualizing Text and Text Collections
March 31, 1998March 31, 1998
Marti HearstSIMS 247
Today and Next TimeToday and Next Time• Purposes of Text VisualizationPurposes of Text Visualization
• Why Text is ToughWhy Text is Tough
• Visualizing Concept SpacesVisualizing Concept Spaces– For Collection Overviews
• Visualizing Query SpecificationsVisualizing Query Specifications– Selecting Term Subsets
– Viewing Metadata
• Visualizing Retrieval ResultsVisualizing Retrieval Results– Term Hit Distribution
– Grouping of Retrieved Documents
Marti HearstSIMS 247
Why Visualize Text?Why Visualize Text?• To help with Information AccessTo help with Information Access
– give an overview of a collection– show user what aspects of their interests are
present in a collection– help user understand why documents retrieved as
a result of a query
• Text Data MiningText Data Mining– not much has been done in this yet
• Software EngineeringSoftware Engineering– not techically text, but has some similar properties
Marti HearstSIMS 247
Why Text is ToughWhy Text is Tough
• Text is Text is notnot pre-attentive pre-attentive• Text consists of abstract conceptsText consists of abstract concepts
– which are difficult to visualize
• Text represents similar concepts in many Text represents similar concepts in many different waysdifferent ways– space ship, flying saucer, UFO, figment of imagination
• Text has very high dimensionalityText has very high dimensionality– Tens or hundreds of thousands of features– Many subsets can be combined together
Marti HearstSIMS 247
Text Meaning is NOT pre-attentiveText Meaning is NOT pre-attentive
SUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXOCERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEMSCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOCGOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREMCERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEMGOVERNS PRECISE EXAMPLE MERCURY SNREVOG ESICERP ELPMAXE YRUCREMSCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOCSUBJECT PUNCHED QUICKLY OXIDIZED TCEJBUS DEHCNUP YLKCIUQ DEZIDIXOCERTAIN QUICKLY PUNCHED METHODS NIATREC YLKCIUQ DEHCNUP SDOHTEMSCIENCE ENGLISH RECORDS COLUMNS ECNEICS HSILGNE SDROCER SNMULOC
Marti HearstSIMS 247
Why Text is ToughWhy Text is Tough
• Abstract concepts are difficult to Abstract concepts are difficult to visualizevisualize
• Combinations of abstract concepts Combinations of abstract concepts are even more difficult to visualizeare even more difficult to visualize– time– shades of meaning– social and psychological concepts– causal relationships
Marti HearstSIMS 247
Why Text is ToughWhy Text is Tough
The man walks the cavorting dog.
So far, we can sort of show this in pictures.
Marti HearstSIMS 247
Why Text is ToughWhy Text is Tough
As the man walks the cavorting dog, thoughtsarrive unbidden of the previous spring, so unlikethis one, in which walking was marching anddogs were baleful sentinals outside unjust halls.
How do we visualize this?
Marti HearstSIMS 247
Why Text is ToughWhy Text is Tough
• Language only hints at meaningLanguage only hints at meaning• Most meaning of text lies within our minds Most meaning of text lies within our minds
and common understandingand common understanding– “How much is that doggy in the window?”
• how much: social system of barter and trade (not the size of the dog)
• “doggy” implies childlike, plaintive, probably cannot do the purchasing on their own
• “in the window” implies behind a store window, not really inside a window, requires notion of window shopping
Marti HearstSIMS 247
Why Text is ToughWhy Text is Tough
• General categories have no standard General categories have no standard ordering (nominal data)ordering (nominal data)
• Categorization of documents by single Categorization of documents by single topics misses important distinctionstopics misses important distinctions
• Consider an article aboutConsider an article about– NAFTA– The effects of NAFTA on truck manufacture– The effects of NAFTA on productivity of truck
manufacture in the neighboring cities of El Paso and Juarez
Marti HearstSIMS 247
Why Text is ToughWhy Text is Tough
• Other issues about languageOther issues about language– ambiguous (many different meanings
for the same words and phrases)– different combinations imply different
meanings
Marti HearstSIMS 247
Why Text is ToughWhy Text is Tough
• I saw I saw PathfinderPathfinder on on MarsMars with a telescope. with a telescope.
• PathfinderPathfinder photographedphotographed MarsMars..• The The PathfinderPathfinder photographphotograph marsmars our our
perception of a lifeless planet.perception of a lifeless planet.
• The The PathfinderPathfinder photographphotograph from from FordFord has has
arrived.arrived.• The The PathfinderPathfinder fordforded the river without ed the river without
marmarring its paint job.ring its paint job.
Marti HearstSIMS 247
Why Text is EasyWhy Text is Easy• Text is easier when you have a lot of itText is easier when you have a lot of it
– Highly redundant– Because people are good at finding associations,
just about any simple algorithm can get “good” results for coarse tasks
• Pull out “important” phrases• Find “meaningfully” related words• Create “summary” from document
– Major problem: Evaluation
• People usually People usually searchsearch on relatively coarse on relatively coarse meaningsmeanings
Marti HearstSIMS 247
Why Text is EasyWhy Text is Easy• Pretty much any simple technique can pull out Pretty much any simple technique can pull out
phrases that seem to characterize a documentphrases that seem to characterize a document• Most frequent words from a lecture last fall:Most frequent words from a lecture last fall:
109 slide 69 to 37 view 37 version 37 graphic 37 first109 slide 69 to 37 view 37 version 37 graphic 37 first
37 back 36 previous 36 next 32 of 31 the37 back 36 previous 36 next 32 of 31 the
30 recall 28 relevant 27 precision 25 retrieved 25 documents30 recall 28 relevant 27 precision 25 retrieved 25 documents
21 and 18 evaluate 15 a 13 what 13 vs 13 how 21 and 18 evaluate 15 a 13 what 13 vs 13 how
12 trec 12 is 12 high 12 for 10 relevance 12 trec 12 is 12 high 12 for 10 relevance
10 queries 10 on 9 information 8 x 8 why 10 queries 10 on 9 information 8 x 8 why
8 as 8 answer 7 search 7 maron 7 document 8 as 8 answer 7 search 7 maron 7 document
7 blair 6 top 6 results 6 measure7 blair 6 top 6 results 6 measure
6 length 6 in 6 evaluation 6 curves6 length 6 in 6 evaluation 6 curves
Marti HearstSIMS 247
Why Text is EasyWhy Text is Easy• Same text, removing most frequent words in Same text, removing most frequent words in
language and most frequent in this text:language and most frequent in this text:
30 recall 28 relevant 27 precision 25 retrieved 25 documents30 recall 28 relevant 27 precision 25 retrieved 25 documents
18 evaluate 13 vs 12 trec 12 high 10 relevance 18 evaluate 13 vs 12 trec 12 high 10 relevance
10 queries 9 information 8 x 8 answer 7 search 10 queries 9 information 8 x 8 answer 7 search
7 maron 7 document 7 blair 6 top 6 results 7 maron 7 document 7 blair 6 top 6 results
6 measure 6 length 6 evaluation 6 curves6 measure 6 length 6 evaluation 6 curves
• These words can act as a simple summary of the These words can act as a simple summary of the documentdocument– people are good at inferring the relations– redundancy in the word meanings
Marti HearstSIMS 247
Text Collection OverviewsText Collection Overviews
• How can we show an overview of the How can we show an overview of the contents of a text collection?contents of a text collection?– show info external to the docs
• e.g., date, author, source, number of inlinks• does not show what they are about
– show the meanings or topics in the docs• show a list of titles• show results of clustering words or documents• organize according to categories
– how to show arbitrary subsets?
Marti HearstSIMS 247
Showing Collection OverviewsShowing Collection Overviews
• Showing the DocumentsShowing the Documents– External Metadata
• e.g., author, date, hyperlink connectivity• Does not show what the documents are about
– Visualizations of Document Clusters• Mapping document clusters into nearby points• Networks with Force-Directed Placement• Kohonen Feature Maps
– Zoomable “Landscapes”
Marti HearstSIMS 247
Showing Collection OverviewsShowing Collection Overviews
• Distinguish betweenDistinguish between– showing the documents – showing the words/concepts
• Distinguish betweenDistinguish between– a general overview– a query-centered view
Marti HearstSIMS 247
Clustering for Collection OverviewsClustering for Collection Overviews
• Two main stepsTwo main steps– cluster the documents according to the words
they have in common– map the cluster representation onto a
(interactive) 2D or 3D representation
• Since text has tens of thousands of Since text has tens of thousands of featuresfeatures– the mapping to 2D loses a tremendous
amount of information– only very coarse themes are detected
Marti HearstSIMS 247
Clustering for Collection OverviewsClustering for Collection Overviews– Scatter/Gather
• show main themes as groups of text summaries
– Scatter Plots• show docs as points; closeness indicates nearness in
cluster space• show main themes of docs as visual clumps or
mountains
– Kohonen Feature maps• show main themes as adjacent polygons
– BEAD• show main themes as links within a force-directed
placement network
Marti HearstSIMS 247
Sca
tter
Plo
t of
Clu
ster
sS
catt
er P
lot
of C
lust
ers
(Ch
en e
t al
. 97)
(Ch
en e
t al
. 97)
Marti HearstSIMS 247
Koh
onen
Fea
ture
Map
sK
ohon
en F
eatu
re M
aps
(Lin
92,
Ch
en e
t al
. 97)
(Lin
92,
Ch
en e
t al
. 97)
(594 docs)
Marti HearstSIMS 247
Visualizing Concept OverviewsVisualizing Concept Overviews
• Huge 2D maps may be inappropriate Huge 2D maps may be inappropriate focus for information retrieval focus for information retrieval – cannot see what the documents are about– documents are forced into one position in
semantic space– space is difficult to browse for IR purposes
• Perhaps more suited for pattern Perhaps more suited for pattern discoverydiscovery– problem: often only one view on the space
Marti HearstSIMS 247
How Useful are Graphical Clusters?How Useful are Graphical Clusters?
• A study A study (Kleiboemer et al. 96) (Kleiboemer et al. 96) comparedcompared– a system with 2D graphical clusters– a system with 3D graphical clusters– a system that shows textual clusters
• Novice usersNovice users• Only textual clusters were helpful Only textual clusters were helpful
(and they were difficult to use well)(and they were difficult to use well)
Recommended