Where does this new information belong?
From developing mining algorithms
to supporting knowledge discovery
Bettina Berendt – thanks for joint work with and support from
Ilija SubasićMathias Verbeke
Siegfried NijssenLuc De Raedt
K.U. Leuven
The solution? Automatic topic dectection
Period 1 Period 2 Period 3 Period 4
Healthcare agenda
Green energy plan
Opposition to healthcare reform
Another healthcare
vote
Climate agenda
A healthcare vote
Peace Nobel Prize
Cophenhagen climate summit
Health 0.017Care 0.015Insurance 0.013American 0.013Uninsured 0.009Families 0.008Working 0.005
Same event/document; different interpretations & categorisations
Visionary president
Party-politics (right and left)
Obama‘s overall agenda
Damp-rag presidentRhetorics
Similar problems in science and learning
Topic detection intime-indexedcorpora of news texts
Conferenceprogramme!
Text mining
Stream mining Media studies
Music collections, multimedia collections: see Andreas Nürnberger‘s talk at SML 2010
Similar problems in other areas
The solution?Context-aware systems / personalisation
Female
Has problems withanger management
You probably do / should think about it this way:...
Politicalactivist
is (nearly) green
What users want
left right
squares / circles
green / not green
... to structure the world how they see it
... to re-use their categories (that they worked so hard to find)
... to be able to see through their eyes
interactivity
... to acknowledge that others see the world differently
semantics
Social similarity / diversity
perspective-taking
... to provide data mining methods to do all that!
Research agenda
interactivity
semantics
Social similarity / diversity
perspective-taking
... to provide data mining methods to do all that!
automatic topic dectection
support sense-making
= provide methods / tools for Knowledge Disovery(in the full sense)
The problem
Research agenda
interactivity
semantics
Social similarity / diversity
perspective-taking
... to provide data mining methods to do all that!
Our solutionapproach
The problem
automatic topic dectection
support sense-making
= provide methods / tools for Knowledge Disovery(in the full sense)
STORIES: mining basics (1)Graphical summarisation of multiple text documents
Document / text pre-processing
Document summarization strategy
• Template recognition• Multi-document named entities• Stopword removal, lemmatization•“fact (assertion) recognition”
• no topics, but salient concepts & relations• time window; word-span window
Selection approach for concepts• concepts = words or named entities• salient concept = high TF & involved in a salient relation, time-indexed
Similarity measure to determine salient relations• bursty co-occurrence
Burstiness measure• time relevance, a “temporal co-occurrence lift”
Aim: highlight subgraphs that represent an event
Topological properties
Change: Subgraph new in this period
STORIES: mining basics (2)Graph analysis for query recommendation
STORIES: evaluation
4. Comparison with other temporal text mining methods New (and only) framework for cross-method comparison Recall-&precision-style metrics different method rankings
3. Learning effectiveness Document search with story graphs leads to averages of
67-75% accuracy on judgments of story fact truth on average, 1.3-4.7 queries with 3.4-5.2 nodes/words per query
1. Information retrieval quality• Edges – events: up to 80% recall, ca.
30% precision
2. Search quality• Subgraphs index
coherent document clusters
Damilicious: functionality basics
Apply my grouping rfid (Security/privacy, Group 2, ...) to the following new search result:
Apply my grouping rfid (Security/privacy, Group 2, ...) to the following new search result:
* Show users and how similarly they group* Apply U4‘s grouping to my new search result:
* Show users and how similarly they group* Apply U4‘s grouping to my new search result:
Damilicious: mining basics (1)Methods and process1. Query
2. Automatic clustering
3. Manual regrouping
4. Re-use1. Learn classifier & present way(s) of grouping
2. Transfer the constructed concepts
Features/methods for the conceptual/predictive clustering: Lingo phrases, Lingo clustering, Ripper co-citation, bibliometric coupling, word or LSA similarity,
combinations; k-means, hierarchical
• “How similarly do two users group documents?“• For each query q, consider their groupings gr:
• For several queries: aggregate
Damilicious: mining basics (2)Measures of grouping and user diversity
Diversity = 1 – similarity = 1 - Normalized mutual information
(entropy-based measure)
NMI = 0
• “How similarly do two users group documents?“• For each query q, consider their groupings gr:
• For several queries: aggregate
Damilicious: evaluation
• Clustering: Does it generate meaningful document groups?– yes (tradition in bibliometrics) – but: data?– Small expert evaluation of CiteseerCluster
• Choosing the clustering and classification methods for conceptual clustering– Experiments: different features, clustering methods,
classification methods quality of reconstruction and extension-over-time (NMI)
• Technology acceptance– End-user experiment (clustering & regrouping)– 5-person formative user study (transfer of own results)
• Sense-making involves– Extracting information from texts– Extracting structural information between entities– Creating, using and modifying categories– Interacting with external representations– Acknowledging diversity and perspective-taking– ...
• Appropriate mining methods, measures, ...?• More/better evaluation methods and frameworks?• Use cases?
KD approachText mining
Graph miningSemantics Interactivity
Usage mining and “model-processing“ (conceptual / predictive clustering)
Conclusions and (some) questions
• Sense-making involves– Extracting information from texts– Extracting structural information between entities– Creating, using and modifying categories– Interacting with external representations– Acknowledging diversity and perspective-taking– ...
• Subašić, I. & Berendt, B. (2009). Discovery of interactive graphs for understanding and searching time-indexed corpora. Knowledge and Information Systems. DOI - 10.1007/s10115-009-0227-x (PDF)
• Berendt, B. & Subašić, I. (2009). STORIES in time: a graph-based interface for news tracking and discovery. n N. Cristianini & M. Turchi (Eds.), Proceedings of Intelligent Analysis and Processing of Web News Content (IAPWNC) at The 2009 IEEE /WIC / ACM International Conferences Web Intelligence (WI'09) / Intelligent Agent Technology (IAT'09). 15 September 2009, Milan, Italy. (Proceedings of WI-IAT.2009, DOI 10.1109/WI-IAT.2009.342, pp. 531-534) (PDF)
• Verbeke, M., Berendt, B., & Nijssen, S. (2009). Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search. In G. Boato & C. Niederee (Eds.), Proceedings of First International Workshop on Living Web, collocated with the 8th International Semantic Web Conference (ISWC-2009), Washington D.C., USA, October 26, 2009. CEUR Workshop Proceedings Vol-515. (PDF)
• Berendt, B. (2010). Diversity in search: what, how, and what for? Talk at Barcelona Media / Yahoo! Research and UPF, 4 March 2010. (PPT)
• Berendt, B., Krause, B., & Kolbe-Nusser, S. (2010). Intelligent scientific authoring tools: Interactive data mining for constructive uses of citation networks. networks. Information Processing & Management, 46(1), 1-10. (PDF)
To Read