Topic Models - NASA Astrophysics Data Systemads.harvard.edu/adsug/2018/06-2_From_Data_to_Decision...Topic Models "Words are how you use them" L. Wittgenstein "You shall know a word

Topic Models

"Words are how you use them" L. Wittgenstein"You shall know a word by the company it keeps" J.R. Firth

The Measurement of Meaning (1957) Osgood, Suci & TananbaumClassification Space (1964) P.G. Ossorio

Latent Dirichlet Allocation (2000) J. Pritchard, (2003) D. BleiWord2Vec (2013) T. Mikolov

Statistical Factor Space (1990) M.J. Kurtz

Finding Your Literature Match - A Recommender System (2011) E.A. Henneken

https://www.ncbi.nlm.nih.gov/pubmed/10835412

https://dl.acm.org/citation.cfm?id=944937

https://arxiv.org/abs/1301.3781

https://ui.adsabs.harvard.edu/#abs/2011ASSP...24..125H/abstract

Looking for the Unknown

Swanson (1986) Fish oil cures Raynaud’s disease, ABC Methodhttps://www.ncbi.nlm.nih.gov/pubmed/3797213

Fish oil, Raynaud's syndrome, and undiscovered public knowledge

Vettle Torvik Arrowsmithhttps://indigo.uic.edu/handle/10027/31



https://indigo.uic.edu/handle/10027/31

From Data to DecisionMachine Learning Using ADS Records

Golnaz Shapurian, Michael J. Kurtz, Alberto Accomazzi

ADS Users Group Meeting - 11/29/2018

Motivation

• We have over 13 million records in the ADS, and we are adding approximately 100 thousands per month

• Need a tool to automatically organize, summarize and understand these records

• Text Clustering• Topic Modeling – Latent Dirichlet Allocation (LDA) developed by David

Blei, et al. at Berkeley in 2003• Word Embedding – Word2Vec developed by Tomas Mikolov, et al. at

Google in 2013

Use Cases

• Shrink large collection of text data to some keywords (or sequence of keywords )

1. Reducing the number of resources required for searching and retrieving information

2. Clustering or searching a huge number of documents (also large size documents) becomes clustering or searching the keywords (topics)

3. Recommend similar articles using the topics

• Automatically tag new incoming text data by using the topics learned1. Classifying a multi disciplinary article in multiple collections

2. Distinguish similar acronym from multiple discipline

• Discover relationships and concept similarities that exist

Topic Modeling – LDA

• The intuition behind the LDA topic model is that words belonging to a topic appear together in documents

• Given collection of documents• LDA discovers the hidden structure (topics and probabilities)• LDA represents each document as mixtures of topics (words and

probabilities)• Computes

• Per-collection, topics (words and their distribution)• Per-document, topic proportions• Per-word, topic assignment

LDA – Generative Model

NASA astrophysics data system 0.02information providers 0.02bibliographic records 0.01…

Topics

community-driven effort 0.04conversations 0.02mission statement 0.01participation in conferences 0.01

...

literature software data products 0.02publication venues 0.01digital platforms 0.01…

DocumentsTopic Proportions

Each topic is a distribution over words

Each document is a mixture of corpus-wide topics

Each word is drawn from one of those topics

LDA – Topic Modeling (Accomazzi et al. 2018)

• papers• eprints• literature• variety• digital platforms• time• literature

software data products

• publication venues

• collection records• references

• services• years• system• participation

conferences• time• conversations• community-dri

ven effort• publications• reasons• mission

statement

• transition• challenges• infrastructure• records

collection• application• sheets• access• number• services• users

• community• activities• NASA astrophysics

data system• librarians• discovery platform• mission statement

participation conferences

• information providers• scientist• bibliographic records• users

• number• claims• access• libraries• application• interface• records

collection• infrastructure• transition• sheets

Most probable words in five of the topics:

Pro

bab

ility

Probability

LDA – Learn and Predict

• Fitting• Using training documents system

learns both document-topic distribution and topic-word distribution

• Inference • For new documents topic-word

distribution is fixed (learned) and hence only document-topic distribution needs to be inferred

Galaxies Topic

Stars Topic X-rays Topic

stellarevolutionhaloesactiveclustersredshiftISMdwarfformationspiral

oscillations pulsationsmassiveformationmagnetic fieldmagnetarsmass-lossneutronoscillations

ISMdiffusestarsgalaxiesbinariesaccretionpulsarsradiationthermalclusters

Word Embedding – Word2Vec

• John Rupert Firth, (1957, 'A synopsis of linguistic theory')• You shall know a word by the company it keeps

• The complete meaning of a word is always contextual and no study of meaning apart from context can be taken seriously

• Word2Vec learns a vector-space representation for each word, based on the local word collocation patterns that are observed in a text corpus

• Map words/sentences/paragraphs to vectors of high dimensional space (100-1000)

• Semantically related words are close together in the vector space and

semantically unrelated words are far apart

small sizes for faint z̃ 2-8constraints on the sizes of the faint

mesh refinement high-resolution (30 pc) simulationsnature luminosity

low surface luminosity dwarfexternal shear can account for tidal stretching from

parallel we performance a blind search emission linepresent a samples of 80 candidate strong lens

analysis of angle clusters of redresults of a depth image survey of virgo clusters of

catalog of calibration environment measures forthe quasar mode of active

observed stars kinematics of dispersion supported

galaxiesgalaxiesgalaxiesgalaxiesgalaxiesgalaxies galaxiesgalaxiesgalaxiesgalaxiesgalaxiesgalaxiesgalaxies

in the hubble frontier fields a key input for establishing lensed by the hubble frontier fields clusterswith the aim of characterizing their internal propertieshubble space telescopeborn in the field are late processes in groupat redshifts z slant but it required correction external using an origin method based on advanced test statisticalwith flux density above 100 mjy herschel astrophysicaland their cross-correlation with weak gravity backgroundconcentrated around the cores of virgo subclusterin the five 3D-Hubble space telescope hst candels deep fieldsnuclei agn in the high redshift universe is routineare often used to measure dynamic mass

Word2Vec – Distributional SemanticsSource Text (considering nouns only) Co-occurrence

(galaxies, sizes)(galaxies, faint)(galaxies, hubble)(galaxies, frontier)…(galaxies, sizes)(galaxies, faint)(galaxies, hubble)(galaxies, frontier)…(galaxies, nature)(galaxies, luminosity)(galaxies, hubble)(galaxies, space)...

Context Window (=2)Target Word

Word2Vec – Bag-of-Words Architecture0

1

0

0

0

0

0

…

0

galaxies

Input layer

Hidden layer

Output layer

V-dim

N nodesV nodes

A '1' in the position corresponding to the word "galaxies"

Σ

Σ

Σ0

0

0

0

0

0

1

…

0

hubble

0

0

0

0

0

0

1

…

0

quasar

V one-hotVector of length V

Σ

Σ

ΣProbability that the word at a randomly chosen, nearby position is "hubble"

… sizes

… frontier

... ...

WIV×N

WOV×N

Word2Vec – Word Embedding

2D representation of word2vec vocabulary

2048 words9224 abstracts year 2017 journals ApJ, A&A, and MNRAS

Word2Vec – SimilaritySi

mila

rity

germany october supersolar astronomy database nasa

insu-cnrsfrancespainitalytenerifeswitzerlandadministrationcanaryinafroque

novembermarchseptemberdecemberjanuaryaugustjunerhessiapriljuly

depletionmgfenemgrefractoryoxygennliabundancecemp

cooperativeuniversityfoundationauraluniverscouncilresearchnasacanadacnrs

trainingonlineatlaspublicresourcesmachineapogeeadvantagesviperslamost

auraluniverscouncilcooperativeuniversityfoundationagencyadministrationsciencemidi

Word2Vec – Similarity (Astronomy Terms)Si

mila

rity

jupiter sunspot maser solar exoplanets spectrograph redshift

extrasolarneptunehabitatgrazingexoplanetsclimaterockssaturnmercurygreenhouse

plageumbracyclepenumbralatitudepolaritynoaaribbonbutterflysuccessive

methanolirasbaselinevlbiclasscarmacm2protostarsburekarl

stormhelioseismologyweathersohoulyssescoronalstereoplagecmehinode

transitextrasolarjupiterhabitattransmissiontessjwstneptuneclimatehaze

echellegeminihermesgrantololoastroncampanasmicoswedenafrica

lbgssliceultravistavimosbosscmassredmappercandelsrankbin

Word2Vec – Analogyseptember to april is as france to spain

Word2Vec – Analogy (Astronomy Terms)spiral to arm is as galaxies to lenticular cepheid to luminosity is as scuti to doradus

Word2Vec – Vector Distance

Source: https://www.kdnuggets.com/2017/04/cartoon-word2vec-espresso-cappuccino.html

LDA and Word2Vec K-means Cluster Comparison

K-means cluster LDA’s document-topic probability vectors

K-means cluster Word2Vec vector space vectors