Upload
vothien
View
227
Download
0
Embed Size (px)
Citation preview
Modern Variational Inference
David M. BleiDepartments of Computer Science and StatisticsData Science InstituteColumbia University
ST01CH10-Blei ARI 4 December 2013 17:0
Game
Second
SeasonTeam
Play
Games
Players
Points
Coach
Giants
1
House
Bush
Political
Party
ClintonCampaign
Republican
Democratic
SenatorDemocrats
6
School
Life
Children
Family
Says
Women
HelpMother
ParentsChild
11
StreetSchool
House
Life
Children
FamilySays
Night
Man
Know
2
Percent
Street
House
Building
Real
SpaceDevelopment
SquareHousing
Buildings
7
Percent
Business
Market
Companies
Stock
Bank
Financial
Fund
InvestorsFunds
12
Life
Says
Show
ManDirector
Television
Film
Story
Movie
Films
3
Game
SecondTeam
Play
Won
Open
Race
Win
RoundCup
8
Government
Life
WarWomen
PoliticalBlack
Church
Jewish
Catholic
Pope
13
House
Life
Children
Man
War
Book
Story
Books
Author
Novel
4
Game
Season
Team
RunLeague
GamesHit
Baseball
Yankees
Mets
9
Street
Show
ArtMuseum
WorksArtists
Artist
Gallery
ExhibitionPaintings
14
Street
House
NightPlace
Park
Room
Hotel
Restaurant
Garden
Wine
5
Government
Officials
WarMilitary
Iraq
Army
Forces
Troops
Iraqi
Soldiers
10
Street
YesterdayPolice
Man
CaseFound
Officer
Shot
Officers
Charged
15
Figure 5Topics found in a corpus of 1.8 million articles from the New York Times. Modified from Hoffman et al. (2013).
a particular movie), our prediction of the rating depends on a linear combination of the user’sembedding and the movie’s embedding. We can also use these inferred representations to findgroups of users that have similar tastes and groups of movies that are enjoyed by the same kindsof users.
Figure 4c illustrates the graphical model. This model is closely related to a linear factor model,except that each cell’s distribution is determined by hidden variables that depend on the cell’s rowand column. The overlapping plates show how the observations at the nth row share its embeddingwn but use different variables γm for each column. Similarly, the observations in the mth columnshare its embedding γm but use different variables wn for each row. Casting matrix factorization
214 Blei
Ann
ual R
evie
w o
f Sta
tistic
s and
Its A
pplic
atio
n 20
14.1
:203
-232
. Dow
nloa
ded
from
ww
w.a
nnua
lrevi
ews.o
rgby
Prin
ceto
n U
nive
rsity
Lib
rary
on
01/0
9/14
. For
per
sona
l use
onl
y.
Topics found in 1.8M articles from the New York Times
[Hoffman, Blei, Wang, Paisley, JMLR 2013]
Adygei BalochiBantuKenyaBantuSouthAfricaBasque Bedouin BiakaPygmy Brahui BurushoCambodianColombianDai Daur Druze French Han Han−NChinaHazaraHezhenItalian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmyMelanesianMiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba
prob
pops1234567
Adygei BalochiBantuKenyaBantuSouthAfricaBasque Bedouin BiakaPygmy Brahui BurushoCambodianColombianDai Daur Druze French Han Han−NChinaHazaraHezhenItalian Japanese Kalash KaritianaLahu Makrani Mandenka MayaMbutiPygmyMelanesianMiaoMongola Mozabite NaxiOrcadianOroqen Palestinian Papuan Pathan Pima Russian San Sardinian She Sindhi Surui Tu TujiaTuscanUygurXibo Yakut Yi Yoruba
prob
pops1234567
Population analysis of 2 billion genetic measurements
[Gopalan, Hao, Blei, Storey, submitted]
The probabilistic pipeline
Make assumptions Discover patterns
DATA
Predict & Explore
KNOWLEDGE
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTCAll use subject to JSTOR Terms and Conditions
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops1234567
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops12345678
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIHpops
123456789
K=9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithmat three values for the number of populations K. The visualization of the ✓’s in the Figure showspatterns consistent with the major geographical regions. Some of the clusters identify a specificregion (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-ropeans and Central/South Americans). The presence of clusters that are shared between differentregions demonstrates the more continuous nature of the structure. The new cluster from K = 7 toK = 8 matches structure differentiating between American groups. For K = 9, the new cluster isunpopulated.
28
I Customized data analysis is important to many fields.
I Pipeline separates assumptions, computation, application
I Eases collaborative solutions to statistics problems
The probabilistic pipeline
Make assumptions Discover patterns
DATA
Predict & Explore
KNOWLEDGE
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTCAll use subject to JSTOR Terms and Conditions
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops1234567
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops12345678
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIHpops
123456789
K=9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithmat three values for the number of populations K. The visualization of the ✓’s in the Figure showspatterns consistent with the major geographical regions. Some of the clusters identify a specificregion (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-ropeans and Central/South Americans). The presence of clusters that are shared between differentregions demonstrates the more continuous nature of the structure. The new cluster from K = 7 toK = 8 matches structure differentiating between American groups. For K = 9, the new cluster isunpopulated.
28
I Inference is the key algorithmic problem.
I Answers the question: What does this model say about this data?
I Our goal: General and scalable approaches to inference
The probabilistic pipeline
Make assumptions Discover patterns
DATA
Predict & Explore
KNOWLEDGE
R A. 7Aty
This content downloaded from 128.59.38.144 on Thu, 12 Nov 2015 01:49:31 UTCAll use subject to JSTOR Terms and Conditions
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops1234567
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops12345678
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIHpops
123456789
K=9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithmat three values for the number of populations K. The visualization of the ✓’s in the Figure showspatterns consistent with the major geographical regions. Some of the clusters identify a specificregion (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-ropeans and Central/South Americans). The presence of clusters that are shared between differentregions demonstrates the more continuous nature of the structure. The new cluster from K = 7 toK = 8 matches structure differentiating between American groups. For K = 9, the new cluster isunpopulated.
28
I Variational methods: inference as optimization [Jordan et al., 1999]
I Scale up with stochastic variational inference (SVI)
I Generalize with black box variational inference (BBVI)
GLOBAL HIDDEN STRUCTURE
Subsampledata
Infer local structure
Update global structure
MASSIVEDATA
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops1234567
K=7
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIH
pops12345678
K=8
LWK YRI ACB ASW CDX CHB CHS JPT KHV CEU FIN GBR IBS TSI MXL PUR CLM PEL GIHpops
123456789
K=9
Figure S2: Population structure inferred from the TGP data set using the TeraStructure algorithmat three values for the number of populations K. The visualization of the ✓’s in the Figure showspatterns consistent with the major geographical regions. Some of the clusters identify a specificregion (e.g. red for Africa) while others represent admixture between regions (e.g. green for Eu-ropeans and Central/South Americans). The presence of clusters that are shared between differentregions demonstrates the more continuous nature of the structure. The new cluster from K = 7 toK = 8 matches structure differentiating between American groups. For K = 9, the new cluster isunpopulated.
28
“Stochastic variational inference” [Hoffman et al., 2013, JMLR]
REUSABLE VARIATIONAL
FAMILIES
BLACK BOX VARIATIONAL INFERENCE
p.ˇ; z j x/ANY MODEL
REUSABLE VARIATIONAL
FAMILIESREUSABLE
VARIATIONAL FAMILIES
MASSIVEDATA
“Black box variational inference” [Ranganath et al., 2014, AISTATS]
Edward: A library for probabilistic modeling, inference, and criticism
github.com/blei-lab/edward
(lead by Dustin Tran)
Example: The admixture model of Pritchard et al. (2000)
Copyright 2000 by the Genetics Society of America
Inference of Population Structure Using Multilocus Genotype Data
Jonathan K. Pritchard, Matthew Stephens and Peter DonnellyDepartment of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom
Manuscript received September 23, 1999Accepted for publication February 18, 2000
ABSTRACTWe describe a model-based clustering method for using multilocus genotype data to infer population
structure and assign individuals to populations. We assume a model in which there are K populations(where K may be unknown) , each of which is characterized by a set of allele frequencies at each locus.Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more popula-tions if their genotypes indicate that they are admixed. Our model does not assume a particular mutationprocess, and it can be applied to most of the commonly used genetic markers, provided that they are notclosely linked. Applications of our method include demonstrating the presence of population structure,assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individu-als. We showthat the method can produce highlyaccurate assignments using modest numbers of loci—e.g.,seven microsatellite loci in an example using genotype data from an endangered bird species. The softwareused for this article is available from http:// www.stats.ox.ac.uk/ zpritch/ home.html.
IN applications of population genetics, it is often use- populations based on these subjective criteria representsa natural assignment in genetic terms, and it would beful to classify individuals in a sample into popula-
tions. In one scenario, the investigator begins with a useful to be able to confirm that subjective classificationsare consistent with genetic information and hence ap-sample of individuals and wants to say something about
the properties of populations. For example, in studies propriate for studying the questions of interest. Further,there are situations where one is interested in “cryptic”of human evolution, the population is often considered
to be the unit of interest, and a great deal of work has population structure—i.e., population structure that isdifficult to detect using visible characters, but may befocused on learning about the evolutionary relation-
ships of modern populations (e.g., Caval l i et al. 1994) . significant in genetic terms. For example, when associa-tion mapping is used to find disease genes, the presenceIn a second scenario, the investigator begins with a set
of predefined populations and wishes to classify individ- of undetected population structure can lead to spuriousassociations and thus invalidate standard tests (Ewensuals of unknown origin. This type of problem arises
in many contexts ( reviewed by Davies et al. 1999) . A and Spiel man 1995) . The problem of cryptic populationstructure also arises in the context of DNA fingerprint-standard approach involves sampling DNA from mem-
bers of a number of potential source populations and ing for forensics, where it is important to assess thedegree of population structure to estimate the probabil-using these samples to estimate allele frequencies inity of false matches (Bal d ing and Nich ol s 1994, 1995;each population at a series of unlinked loci. Using theFor eman et al. 1997; Roeder et al. 1998) .estimated allele frequencies, it is then possible to com-Pr it ch ar d and Rosenber g (1999) considered howpute the likelihood that a given genotype originated in
genetic information might be used to detect the pres-each population. Individuals of unknown origin can beence of cryptic population structure in the associationassigned to populations according to these likelihoodsmapping context. More generally, one would like to bePaet kau et al. 1995; Rannal a and Mount a in 1997) .able to identify the actual subpopulations and assignIn both situations described above, a crucial first stepindividuals (probabilistically) to these populations. Inis to define a set of populations. The definition of popu-this article we use a Bayesian clustering approach tolations is typically subjective, based, for example, ontackle this problem. We assume a model in which therelinguistic, cultural, or physical characters, as well as theare K populations (where K may be unknown) , each ofgeographic location of sampled individuals. This subjec-which is characterized by a set of allele frequencies attive approach is usually a sensible way of incorporatingeach locus. Our method attempts to assign individualsdiverse types of information. However, it maybe difficultto populations on the basis of their genotypes, whileto know whether a given assignment of individuals tosimultaneously estimating population allele frequen-cies. The method can be applied to various types ofmarkers [e.g., microsatellites, restriction fragment
Corresponding author: Jonathan Pritchard, Department of Statistics, length polymorphisms (RFLPs) , or single nucleotideUniversity of Oxford, 1 S. Parks Rd., Oxford OX1 3TG, United King-dom. E-mail: [email protected] polymorphisms (SNPs)] , but it assumes that the marker
Genetics 155: 945–959 ( June 2000)
TeraStructure: Fast inference for the PSD model.
Oracle
TeraStructure
1M individuals1,000,000 individuals
10,000 individualsOracle
TeraStructure
ADMIXTURE
fastSTRUCTURE
10,000 individualsOracle
TeraStructure
100,000 individuals100,000 individuals
Figure 2: TeraStructure recovers the underlying per-individual population proportions on the sim-ulated data sets generated via Scenario A. Each panel shows a visualization of the simulationparameters ✓⇤
i and the inferred EŒ✓i j O✓i ç for all individuals in a data set. The current state-of-the-artalgorithms cannot complete their analyses of 100,000 and 1,000,000 individuals. TeraStructure isable to analyze data of this size and gives highly accurate estimates.
25
TeraStructure is fast.
Data set N L S Time (hours)
TeraStructure ADMIXTURE fastSTRUCTURE
HGDP 940 644,258 0.9 < 1 < 1 12TGP 1718 1,854,622 0.5 3 3 21Scenario A (10K) 10,000 1,000,000 1.0 9 28 216Scenario A (100K) 100,000 1,000,000 0.7 158 – –Scenario A (1M) 1,000,000 1,000,000 0.5 509 – –
Scenario B 10,000 100,000 1.0 6.9 31 140Scenario B 10,000 1,000,000 0.5 9.3 – –
Table 1: The running time of all algorithms on both real and simulated data. TeraStructure is theonly algorithm that can scale beyond N D 10; 000 individuals to the simulated data sets withN D 100; 000 individuals and N D 1; 000; 000 individuals. S is the fraction of SNP locationssubsampled, with repetition, during training; L is the number of SNP locations. The TeraStruc-ture and ADMIXTURE algorithms were run with ten parallel threads, while fastSTRUCTURE,which does not have a threading option, was run with a single thread. Even under the best-caseassumption of ten times speedup due to parallel computation, the TeraStructure algorithm is twiceas fast as both ADMIXTURE and fastSTRUCTURE algorithms on the data set with N D 10; 000individuals. On the real data sets, TeraStructure is as fast as the other algorithms. In contrast toother methods that iterated multiple times over the entire data set, TeraStructure iterated over theSNP locations at most once on all data sets.
27
TeraStructure is accurate.
Data set Replication N L Median per-individual KL divergence
TeraStructure ADMIXTURE fastSTRUCTURE
Scenario A (10K) 1 10,000 1,000,000 0.016 0.020 6.68Scenario A (10K) 2 10,000 1,000,000 0.009 0.019 5.15Scenario A (10K) 3 10,000 1,000,000 0.020 0.022 4.49Scenario A (100K) 1 100,000 1,000,000 0.006 – –Scenario A (100K) 2 100,000 1,000,000 0.013 – –Scenario A (100K) 3 100,000 1,000,000 0.009 – –Scenario A (1M) 1 1,000,000 1,000,000 0.015 – –
Scenario B 1 10,000 100,000 0.21 0.42 7.21Scenario B 2 10,000 100,000 0.27 0.42 7.97Scenario B 3 10,000 100,000 0.26 0.42 7.68Scenario B 1 10,000 1,000,000 0.16 – –Scenario B 2 10,000 1,000,000 0.23 – –Scenario B 3 10,000 1,000,000 0.25 – –
Table S2: The accuracy of the algorithms on simulated data generated via Scenario A. TeraStruc-ture is the only algorithm that was able to complete its analysis on the simulated data sets withN D 100; 000 individuals and N D 1; 000; 000 individuals. On these massive data sets, TeraS-tructure found a highly accurate fit to the data (see also Figure 2). On smaller simulated data,TeraStructure finds a fit to the data that is closer to the simulation model than either of the othermethods. The number of ancestral populations is set to the number of ancestral populations usedin the simulation: K=6.
32
Figures and Tables
Subsample a SNP Infer ancestral population frequencies for the SNP Update population proportions
Oracle
TeraStructure
1M individuals
0 1 2 1 0 1 1 0
Fitted population proportions
SNPs
individuals
Massive genotype data
TeraStructure
Randomly initialize population proportions
Check for convergence
Figure 1: A schematic diagram of TeraStructure, stochastic variational inference for the Pritchard-Stephens-Donnelly (PSD) model. The algorithm maintains an estimate of the latent populationproportions for each individual. At each iteration it samples SNP measurements from the largedatabase, infers the per-population frequencies for that SNP, and updates its idea of the populationproportions. This is much more efficient than algorithms that must iterate across all SNPs at eachiteration.
24