View
215
Download
1
Category
Tags:
Preview:
Citation preview
Concept and Theme Discovery through Probabilistic Models and ClusteringQiaozhu Mei
Oct. 12, 2005
Concepts and Themes
Language units in biology literature mining: Terms Phrases Entities Concepts (tight groups of terms/entities
representing semantics: e.g. Gene Synonyms) Themes (loose groups of terms representing
topic/subtopics)
Theme Discovery
What we’ve got now: A Generative Model to extract k themes from a collection
Each theme as a language model, represented by top probability words in a theme language model
KL Divergence to model the distance/similarity between themes;
retrieve most similar themes to a term group
k
iiid wPbBwPbdwP
1, )|(*)1()|(*):(
Theme Discovery (cont.)
What we’ve got now (cont.): Use HMM to segment the whole collection with the theme
extracted Use MMR to find most representative and least redundant
phrases to represent a theme (currently using n-gram prob. as and edit distance as similarity, performance to be tuned..)
Results: http://ucair.cs.uiuc.edu/qmei2/ThemeNavigation.html
Some justifications Fly collection:
Cluster 0: circadian Cluster 1: adh, evolution Cluster 2: a mixture of two topics, apoptosis and promoters Cluster 6: brain development Cluster 8: cell division Cluster 12: drosophila immunity Cluster 13: nervous systems Cluster 14: hedgehog segment Polarity gene Cluster 16: Histone, Polycomb Cluster 17: visual system
Theme Discovery (cont.)
Problems: How to select k? (how many themes do we believe are
there in the collection: bee collection should have smaller k than fly collection)
Can we find themes in a hierarchical manner? This can solve the former problem…however, when to cutoff?
How to represent a theme? Top words sometimes difficult to tell the semantics Phrases? Sentences?
Other possible approaches to extract theme? (LDAs, Clustering methods)
Hierarchical Theme Discovery A straightforward approach (top
down splitting): Discover k themes from the initial
collection Segment the collection by the k
themes For each theme, build a sub-
collection with the segments in previous step
For each sub-collection, extract k’ themes
Do these processes iteratively Problem: When to stop splitting
iteration?
Theme1Theme2
Theme3
Collection
Theme2.1Theme2.2 Theme2.3
……
Hierarchical Theme Discovery (results)
A bee collection with 929 documents
Level1: 5 themes
Level2: 3 sub-themes for each higher level theme
… … …
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
africaneuropeanpopulationpopulationspatternspatterngeneticdiscriminationmitochondrialstudiesinformationarecontrastgreentwobeeshavederivedafricasubspecies
larvaemicroorganismsgrambacteria0coloniesroyalqueenjellyeubacterianonworkersqueensproduction2nestitalian5fractionnestmates
venomrewardpatientsnajakdaproteinswaspproteindipterapla2vespulaprimateshominidaechordatavertebratamugstingspermdosequality
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
queenworkerworkerscoloniespollenvibrationeggsforagingdevelopmentbroodsignalqueensbeesanarchisticbehavioraliridaceaelarvaeeggpheromonemay
foodforagersdancetransferenzymebiosynthesisreceiverscontrastnectarflightsourceflowwaterinformationratesddtrjcaucasianvisualgreen
mammalsvertebratesvenomnonhumanlmlmodelsmodelchordatesbeeswaxmugomegaembryomammaliavertebratahaschordatanursecolouredvg
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
seedpercropsunflowernumbercruciferaefruithybridagricultureseedsqualitycultivarweighthelianthusoilseedcompositaeannuusyieldpollinationset
ecologyisspeciesenvironmentalsciencesfloweringfloralterrestrialpollinatorvisitingreproductionplantsccashewselfanimaliafoodinsectsfabasize
polleneephoneybeesmatingbumblebeessphivebacteriascentmimosabrazilundertakerschromatographymarksrecentlygrameubacteriacarawaymicroorganismspropolis
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
dopaminelevelsdevelopmentagebindingpupalbrainoctopaminedivisionadultcolonieslaborglasstreatedcolonyryrpigmentationchromosomesaroliumda
beessucroseconditioningresponselearningextensionproboscispollenforagersperformancebetweenthresholdshoneybeessolutiondiscriminationstrainrateforagingconcentrationlow
imidaclopridcurrentmemorymushroomneurons1expressed4cellsantennalmbbodiescurrentsnervousbrainmvkinasereceptorstermprotein
Hierarchical Theme Discovery (results)african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
mitevarroamitesbroodjacobsoniacarinacoloniesparasiteforworkercontroladroneformicpopulationacidhost0cellstreatment
pollenbeesforagerstheirortaheatathygienicforagingproteinactivitybehaviourincreasedresponsebloodflightstripsmetabolicremoval
viruseslarvaemicroorganismsvirusbacteriaanimalpaenibacillusinfectionmolecularpathogeneubacteriagramformingendosporepositivespapventomopathogen
Phrase Representations: african jelly royal european venom population africanized sting kda feral m reward subspecies proteins patients discrimination naja cue characters areas
queenworkersworkersignaljhvibrationpheromoneglandeggssignalshormonejuvenileanarchisticqueenseggiridaceaepolicingixiabehavioralage
pollinatorplantspollinationflowersplantaespermatophytaangiospermaedicotyledonespollenseedfruitangiospermsspermatophytesvasculardicotscropplantflowerpollinatorsspecies
learningbrainconditioningolfactoryneuralneuronsmushroommemorysucrosenervouscoordinationdopamineextensionantennalodorsystemproboscisbodieslobekenyon
varroamitemitesjacobsoniacarinabroodparasitecolonieshostcontrolcheliceratachelicerateshygienicvirusesinfestationdestructorpestinfestedparasitologymortality
biochemistry and molecular biophysics endocrine system chemical coordination and homeostasis molecular genetics biochemistry and molecular biophysics sense organs sensory reception animals arthropods chordates insects invertebrates mammals system chemical coordination and homeostasis vertebrata chordata animalia honey bee behavior terrestrial ecologymammalia vertebrata chordata animalia juvenile hormone queen rodentia mammalia vertebrata chordata animalia worker laid eggs vibration signal genetics biochemistry and molecular biophysics dufour s gland mammals nonhuman mammals workers egg laying queen mandibular gland pheromone nonhuman vertebrates iridaceae ixia arthropoda invertebrata animalia muridae aves vertebrata chordata animalia mug ml
Hierarchical Theme Discovery (cont.) A bottom up agglomerative approach:
Find many micro-themes Group similar micro-themes into larger ones Borrow strategy from data mining:
BIRCH: incrementally form many micro-clusters, organized in a tree structure
Macro-clustering based on micro-clusters. Problem: Again, when to stop?
Hierarchical Theme Discovery (cont.) Model-based approach:
Hofmann, IJCAI 99. Assume we know the collection is generated from
a hierarchical structure, use a generative model to learn the themes. (e.g. make use of GO hierarchies)
Problem: in most cases we don’t know the hierarchies.
Other Research Problems
Represent a theme: Using top words: where to cut Using phrases: have to tune the MMR (many
possible strategies and parameter tuning) Using sentence? Like summarization
Themes are interesting… but how to make use of the themes?
How to evaluate themes??
Concept Extraction
What we have now: N-gram algorithm (actually 2-gram): iteratively group a pair
of terms which are most likely to be replaceable considering the context of one term before/after it.
Time Complexity: O(N3), Space Complexity: now O(N2). Beespace server can deal with <= 9000 terms now (2.4g memory). (performance not evaluated due to the small data size acceptable).
Problem: based on Mutual Information, preferring 2-grams with low frequency. Doesn’t make use of farther context.
Will removing stop words help or turn down the performance?
Some finding:
A small dataset: (200+ abstracts containing gene synonyms)
Only 600 iterations (merge 600 times) Most of them are reasonable, but not really useful E.g. head-to-head tail-to-tail E.g. within-locus between-locus
FBgn0000017: Dsrc Dabl FBgn0000078: amylase-null AMY-null Problem: doc-set too small, n-gram too sparse to fin
d useful concepts.
Concept Extraction (cont.)
Other Possible strategy: Lin et al, KDD 02: Use feature vector to represent
terms, the weights are the mutual information between term and context feature. Thus more flexible than n-gram. (if only consider 2-gram as context features, this will be similar to what we have)
Use committee to represent a cluster, thus assures the clusters are tight and robust.
Problem: not sure how to select features
Summary
Theme Extraction: Generally performs well, if we can find a good k. Hierarchical Clustering can solve this problem, but still
need to find a reasonable stop criteria. Representation is an interesting problem: MMR phrase
extraction should be further tuned Difficult to evaluate other than expert justification
Concept extraction: N-gram has space constraints: haven’t really tested the
performance… Generally, the performance should be better on large data sets
Other clustering algorithms can be explored.
Recommended