Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Topic Models
"Words are how you use them" L. Wittgenstein"You shall know a word by the company it keeps" J.R. Firth
The Measurement of Meaning (1957) Osgood, Suci & TananbaumClassification Space (1964) P.G. Ossorio
Latent Dirichlet Allocation (2000) J. Pritchard, (2003) D. BleiWord2Vec (2013) T. Mikolov
Statistical Factor Space (1990) M.J. Kurtz
Finding Your Literature Match - A Recommender System (2011) E.A. Henneken
Looking for the Unknown
Swanson (1986) Fish oil cures Raynaud’s disease, ABC Methodhttps://www.ncbi.nlm.nih.gov/pubmed/3797213
Fish oil, Raynaud's syndrome, and undiscovered public knowledge
Vettle Torvik Arrowsmithhttps://indigo.uic.edu/handle/10027/31
From Data to DecisionMachine Learning Using ADS Records
Golnaz Shapurian, Michael J. Kurtz, Alberto Accomazzi
ADS Users Group Meeting - 11/29/2018
Motivation
• We have over 13 million records in the ADS, and we are adding approximately 100 thousands per month
• Need a tool to automatically organize, summarize and understand these records
• Text Clustering• Topic Modeling – Latent Dirichlet Allocation (LDA) developed by David
Blei, et al. at Berkeley in 2003• Word Embedding – Word2Vec developed by Tomas Mikolov, et al. at
Google in 2013
Use Cases
• Shrink large collection of text data to some keywords (or sequence of keywords )
1. Reducing the number of resources required for searching and retrieving information
2. Clustering or searching a huge number of documents (also large size documents) becomes clustering or searching the keywords (topics)
3. Recommend similar articles using the topics
• Automatically tag new incoming text data by using the topics learned1. Classifying a multi disciplinary article in multiple collections
2. Distinguish similar acronym from multiple discipline
• Discover relationships and concept similarities that exist
Topic Modeling – LDA
• The intuition behind the LDA topic model is that words belonging to a topic appear together in documents
• Given collection of documents• LDA discovers the hidden structure (topics and probabilities)• LDA represents each document as mixtures of topics (words and
probabilities)• Computes
• Per-collection, topics (words and their distribution)• Per-document, topic proportions• Per-word, topic assignment
LDA – Generative Model
NASA astrophysics data system 0.02information providers 0.02bibliographic records 0.01…
Topics
community-driven effort 0.04conversations 0.02mission statement 0.01participation in conferences 0.01
...
literature software data products 0.02publication venues 0.01digital platforms 0.01…
DocumentsTopic Proportions
Each topic is a distribution over words
Each document is a mixture of corpus-wide topics
Each word is drawn from one of those topics
LDA – Topic Modeling (Accomazzi et al. 2018)
• papers• eprints• literature• variety• digital platforms• time• literature
software data products
• publication venues
• collection records• references
• services• years• system• participation
conferences• time• conversations• community-dri
ven effort• publications• reasons• mission
statement
• transition• challenges• infrastructure• records
collection• application• sheets• access• number• services• users
• community• activities• NASA astrophysics
data system• librarians• discovery platform• mission statement
participation conferences
• information providers• scientist• bibliographic records• users
• number• claims• access• libraries• application• interface• records
collection• infrastructure• transition• sheets
Most probable words in five of the topics:
Pro
bab
ility
Probability
LDA – Learn and Predict
• Fitting• Using training documents system
learns both document-topic distribution and topic-word distribution
• Inference • For new documents topic-word
distribution is fixed (learned) and hence only document-topic distribution needs to be inferred
Galaxies Topic
Stars Topic X-rays Topic
stellarevolutionhaloesactiveclustersredshiftISMdwarfformationspiral
oscillations pulsationsmassiveformationmagnetic fieldmagnetarsmass-lossneutronoscillations
ISMdiffusestarsgalaxiesbinariesaccretionpulsarsradiationthermalclusters
Word Embedding – Word2Vec
• John Rupert Firth, (1957, 'A synopsis of linguistic theory')• You shall know a word by the company it keeps
• The complete meaning of a word is always contextual and no study of meaning apart from context can be taken seriously
• Word2Vec learns a vector-space representation for each word, based on the local word collocation patterns that are observed in a text corpus
• Map words/sentences/paragraphs to vectors of high dimensional space (100-1000)
• Semantically related words are close together in the vector space and
semantically unrelated words are far apart
small sizes for faint z̃ 2-8constraints on the sizes of the faint
mesh refinement high-resolution (30 pc) simulationsnature luminosity
low surface luminosity dwarfexternal shear can account for tidal stretching from
parallel we performance a blind search emission linepresent a samples of 80 candidate strong lens
analysis of angle clusters of redresults of a depth image survey of virgo clusters of
catalog of calibration environment measures forthe quasar mode of active
observed stars kinematics of dispersion supported
galaxiesgalaxiesgalaxiesgalaxiesgalaxiesgalaxies galaxiesgalaxiesgalaxiesgalaxiesgalaxiesgalaxiesgalaxies
in the hubble frontier fields a key input for establishing lensed by the hubble frontier fields clusterswith the aim of characterizing their internal propertieshubble space telescopeborn in the field are late processes in groupat redshifts z slant but it required correction external using an origin method based on advanced test statisticalwith flux density above 100 mjy herschel astrophysicaland their cross-correlation with weak gravity backgroundconcentrated around the cores of virgo subclusterin the five 3D-Hubble space telescope hst candels deep fieldsnuclei agn in the high redshift universe is routineare often used to measure dynamic mass
Word2Vec – Distributional SemanticsSource Text (considering nouns only) Co-occurrence
(galaxies, sizes)(galaxies, faint)(galaxies, hubble)(galaxies, frontier)…(galaxies, sizes)(galaxies, faint)(galaxies, hubble)(galaxies, frontier)…(galaxies, nature)(galaxies, luminosity)(galaxies, hubble)(galaxies, space)...
Context Window (=2)Target Word
Word2Vec – Bag-of-Words Architecture0
1
0
0
0
0
0
…
0
galaxies
Input layer
Hidden layer
Output layer
V-dim
N nodesV nodes
A '1' in the position corresponding to the word "galaxies"
Σ
Σ
Σ0
0
0
0
0
0
1
…
0
hubble
0
0
0
0
0
0
1
…
0
quasar
V one-hotVector of length V
Σ
Σ
ΣProbability that the word at a randomly chosen, nearby position is "hubble"
… sizes
… frontier
... ...
WIV×N
WOV×N
Word2Vec – Word Embedding
2D representation of word2vec vocabulary
2048 words9224 abstracts year 2017 journals ApJ, A&A, and MNRAS
Word2Vec – SimilaritySi
mila
rity
germany october supersolar astronomy database nasa
insu-cnrsfrancespainitalytenerifeswitzerlandadministrationcanaryinafroque
novembermarchseptemberdecemberjanuaryaugustjunerhessiapriljuly
depletionmgfenemgrefractoryoxygennliabundancecemp
cooperativeuniversityfoundationauraluniverscouncilresearchnasacanadacnrs
trainingonlineatlaspublicresourcesmachineapogeeadvantagesviperslamost
auraluniverscouncilcooperativeuniversityfoundationagencyadministrationsciencemidi
Word2Vec – Similarity (Astronomy Terms)Si
mila
rity
jupiter sunspot maser solar exoplanets spectrograph redshift
extrasolarneptunehabitatgrazingexoplanetsclimaterockssaturnmercurygreenhouse
plageumbracyclepenumbralatitudepolaritynoaaribbonbutterflysuccessive
methanolirasbaselinevlbiclasscarmacm2protostarsburekarl
stormhelioseismologyweathersohoulyssescoronalstereoplagecmehinode
transitextrasolarjupiterhabitattransmissiontessjwstneptuneclimatehaze
echellegeminihermesgrantololoastroncampanasmicoswedenafrica
lbgssliceultravistavimosbosscmassredmappercandelsrankbin
Word2Vec – Analogyseptember to april is as france to spain
Word2Vec – Analogy (Astronomy Terms)spiral to arm is as galaxies to lenticular cepheid to luminosity is as scuti to doradus
Word2Vec – Vector Distance
Source: https://www.kdnuggets.com/2017/04/cartoon-word2vec-espresso-cappuccino.html
LDA and Word2Vec K-means Cluster Comparison
K-means cluster LDA’s document-topic probability vectors
K-means cluster Word2Vec vector space vectors