20
On low-dimensional embeddings of high-dimensional data ●● ●● ●● ●● ●● 30 20 10 0 10 20 30 20 0 20 P1 P2 cmdscale ●● ●● ●● ●● ●● 30 20 10 0 10 20 30 20 0 20 P1 P2 isoMDS ●● ●● ●● ●● ●●● ●●● ●● ●● ●● ●● ●● ●●●●●●● ●● ●● 10 5 0 5 10 10 5 0 5 10 P1 P2 tSNE ●● ●●● ●●●●● ●● ●● ●● ●●● ●● ●● ●● ●● ●● ●● ●● ●● ●●● 7.5 5.0 2.5 0.0 2.5 10 5 0 5 P1 P2 UMAP Wolfgang Huber and Susan Holmes

On low-dimensional embeddings of high-dimensional data

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: On low-dimensional embeddings of high-dimensional data

On low-dimensional embeddings of high-dimensional data

0

50

100

150

0 50 100 150

0

20

40

60

value

as.matrix(dist(rw)) ●●

●●

●●●●●

●●●

●●●●

●●●●

●●●●

● ● ●●●●●

● ●

●●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●

● ●●

●●●●●

●●●●

●●

●●●●●●●● ●●

●●● ●

●●●

−30

−20

−10

0

10

20

30

−20 0 20P1

P2

cmdscale

●●

●●

●●●●●

●●●

●●●●

●●●●

●●●●

● ● ●●●●●

● ●

●●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●

● ●●

●●●●●

●●●●

●●

●●●●●●●● ●●

●●● ●

●●●

−30

−20

−10

0

10

20

30

−20 0 20P1

P2

isoMDS

●●●●

●●●

●●

●●●

●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●●●●

●●●●●●●

−10

−5

0

5

10

−10 −5 0 5 10P1

P2

t−SNE

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−7.5

−5.0

−2.5

0.0

2.5

−10 −5 0 5P1

P2

UMAP

Wolfgang Huber and Susan Holmes

Page 2: On low-dimensional embeddings of high-dimensional data

include detecting coverage peaks or concentrations in chromatin immunoprecipitation–sequencing, counting the number of cDNA fragments that match each transcript or exon (RNA-seq) and call-ing DNA sequence variants (DNA-seq). Such summaries can be stored in an instance of the class GenomicRanges.

Coordinated analysis of multiple samples. To facilitate the analysis of experiments and studies with multiple samples, Bioconductor defines the SummarizedExperiment class. The computed summa-ries for the ranges are compiled into a rectangular array whose rows correspond to the ranges and whose columns correspond to the dif-ferent samples (Fig. 2). For a typical experiment, there can be tens of thousands to millions of ranges and from a handful to hundreds of samples. The array elements do not need to be single numbers: the summaries can be multivariate.

The SummarizedExperiment class also stores metadata on the rows and columns. Metadata on the samples usually include experi-mental or observational covariates as well as technical information such as processing dates or batches, file paths, etc. Row metadata comprise the start and end coordinates of each feature and the identifier of the containing polymer, for example, the chromo-some name. Further information can be inserted, such as gene or exon identifiers, references to external databases, reagents, func-tional classifications of the region (e.g., from efforts such as the Encyclopedia of DNA Elements (ENCODE)5) or genetic associa-tions (e.g., from genome-wide association studies, the study of rare diseases, or cancer genetics). The row metadata aid integrative analysis, for example, when matching two experiments according to overlap of genomic regions of interest. Tight coupling of meta-data with the data reduces opportunities for clerical errors during reordering or subsetting operations.

Annotation packages and resources. Reference genomes, annota-tions of genomic regions and associated gene products (transcripts or proteins), and mappings between molecule identifiers are essen-tial for placing statistical and bioinformatic results into biological perspective. These needs are partly addressed by the Bioconductor annotation data repository, which provides 894 prebuilt standardized annotation packages for use with common model

organisms as well as other organisms. Each of the packages presents its data through a standard interface using defined Bioconductor classes, including classes for whole-genome sequences (BSgenome), gene model or transcript databases (TxDb) derived from UCSC (University of California, Santa Cruz) tracks or BioMart annota-tions, and identifier cross-references from the US National Center for Biotechnology Information, or NCBI (org). There are also facili-ties for users to create their own annotation packages.

The AnnotationHub resource provides ready access to more than 10,000 genome-scale assay and annotation data sets obtained from Ensembl, ENCODE, dbSNP, UCSC and other sources and delivered in an easy-to-access format (e.g., Ranges-compatible, where appropriate). Bioconductor also supports direct access to underlying file formats such as GTF, 2bit or indexed FASTA.

Bioconductor also offers facilities for directly accessing online resources through their application programming interfaces. This can be valuable when a resource is not represented in an annotation package or when the very latest version of the data is required. The rtracklayer package accesses tables and tracks underlying the UCSC Genome Browser, and the biomaRt package supports fine-grained on-line harvesting of Ensembl, UniProt, COSMIC (Catalogue Of Somatic Mutations In Cancer) and allied resources. Many additional packages access web resources, for example, KEGGREST, PSICQUIC and Uniprot.ws.

Figure 2 | The integrative data container SummarizedExperiment. Its assays component is one or several rectangular arrays of equivalent row and column dimensions. Rows correspond to features, and columns to samples. The component rowData stores metadata about the features, including their genomic ranges. The colData component keeps track of sample-level covariate data. The exptData component carries experiment-level information, including MIAME (minimum information about a microarray experiment)-structured metadata21. The R expressions exemplify how to access components. For instance, provided that these metadata were recorded, rowData(se)$entrezId returns the NCBI Entrez Gene identifiers of the features, and se$tissue returns the tissue descriptions for the samples. Range-based operations, such as %in%, act on the rowData to return a logical vector that selects the features lying within the regions specified by the data object CNVs. Together with the bracket operator, such expressions can be used to subset a SummarizedExperiment to a focused set of genes and tissues for downstream analysis.

BOX 1 GETTING STARTEDInstall R and Bioconductor following the directions at http://www.bioconductor.org/install. Optionally, choose an Integrated Development Environment (IDE), for example, RStudio (http://www.rstudio.com). Learn the basics of the R language, for example, with http://tryr.codeschool.com.

Explore the Bioconductor help, http://www.bioconductor.org/help—which includes material from training courses, sample workflows, vignettes and manual pages—and the online support forum (https://support.bioconductor.org).

Identify and install Bioconductor packages using hierarchically organized “BiocViews” and text search (http://www.bioconductor.org/packages/release/BiocViews.html) and by exploring ‘landing pages’ for package descriptions and links to vignettes, manual pages and usage statistics.

Get to work exploring sample data sets and adapting established workflows for your own analysis!

Feat

ures

(gen

es)

Samples

assays(se)

Feat

ures

(gen

es)

rowData(se)

Sam

ples

exptData(se)exptData(se)$projectId

colData(se)

rowData(se)$entrezId assays(se)$count

colData(se)$tissuese$tissue

se <-SummarizedExperiment( assays, rowData, colData, exptData )

se %in% CNVs

NATURE METHODS | VOL.12 NO.2 | FEBRUARY 2015 | 117

PERSPECTIVE

Dimension reduction / embedding

U Λ Vt

n x mn = 20000, m=1000:

n x m = 2 x 108

n x p p x mp x pp = 3

(n + 1 + m) x p = 63003

Applications: Principal component analysis (PCA), Non-negative matrix factorization (NMF), ...

Page 3: On low-dimensional embeddings of high-dimensional data

Multi-dimensional scaling (MDS)

Classical MDS is achieved by singular value decomposition of (double centred) D 2. In R: cmdscale

Non-linear extensions: t-SNE, UMAP, ...

������������ ������� ��� ������������� ���� 249

I Question 9.1 Make a barplot of all the eigenvalues ouput by the cmdscale func-tion: what do you notice? J

I Solution 9.1 If you execute:plotbar(MDSEuro, m = length(MDSEuro$eig))

you will note that unlike in PCA, there are some negative eigenvalues, these aredue to the fact that the data do not come from a Euclidean space. ⇤

●●

● ●

BarcelonaBelgrade

Berlin Brussels

Bucharest

Budapest

CopenhagenDublin

Hamburg

Istanbul

Kiev

London

MadridMilan

Moscow

Munich

ParisPrague

Rome

Saint_Petersburg

Sofia

Stockholm

Vienna

Warsaw

−1000

−500

0

500

1000

−2000 −1000 0 1000 2000PCo1

PCo2

Figure 9.3: MDS map of European cities based ontheir distances.

To position the points on the map we have projected them on the new coordinatescreated from the distances (we will discuss how the algorithm works in the nextsection). Note that while relative positions in Figure 9.3 are correct, the orientation ofthe map is unconventional: e. g., Istanbul, which is in the South-East of Europe, is atthe top left.MDSeur = tibble(

PCo1 = MDSEuro$points[, 1],

PCo2 = MDSEuro$points[, 2],

labs = rownames(MDSEuro$points))

g = ggplot(MDSeur, aes(x = PCo1, y = PCo2, label = labs)) +

geom_point(color = "red") + xlim(-1950, 2000) + ylim(-1150, 1150) +

coord_fixed() + geom_text(size = 4, hjust = 0.3, vjust = -0.5)

g

We reverse the signs of the principal coordinates and redraw the map. We also readin the cities’ true longitudes and latitudes and plot these alongside for comparison(Figure 9.4).g %+% mutate(MDSeur, PCo1 = -PCo1, PCo2 = -PCo2)

Eurodf = readRDS("../data/Eurodf.rds")

ggplot(Eurodf, aes(x = Long,y = Lat, label = rownames(Eurodf))) +

geom_point(color = "blue") + geom_text(hjust = 0.5, vjust = -0.5)

●●

●●

Barcelona Belgrade

BerlinBrussels

Bucharest

Budapest

CopenhagenDublin

Hamburg

Istanbul

Kiev

London

MadridMilan

Moscow

Munich

ParisPrague

Rome

Saint_Petersburg

Sofia

Stockholm

Vienna

Warsaw

−1000

−500

0

500

1000

−2000 −1000 0 1000 2000PCo1

PCo2

●●

● ●

●●

Saint_PetersburgStockholm

Dublin

MoscowCopenhagen

Hamburg

BrusselsBerlin

London

Paris

Rome

Prague

Warsaw

MunichVienna

Kiev

Budapest

Milan

MadridBarcelona

BelgradeBucharest

SofiaIstanbul

40

45

50

55

60

0 10 20 30Long

Lat

Figure 9.4: Left: same as Figure 9.3, but with axesflipped. Right: true latitudes and longitudes.

I Question 9.2 Which cities seem to have the worst representation on the PCoA mapin the left panel of Figure 9.4? J

I Solution 9.2 It seems that the cities at the extreme West: Dublin, Madrid andBarcelona have worse projections than the central cities. This is likely because thedata are more sparse in these areas and it is harder for the method to ‘triangulate’ the

Two-dimensional layout of European cities based on matrix D of their pairwise distances

Page 4: On low-dimensional embeddings of high-dimensional data

What does this have to do with RNA-seq?

ARTICLE RESEARCH

(Extended Data Fig. 6h) and performed flow cytometry analysis using the markers CD16/32, and CSF1R. Rare CD16/32+CSF1R+ cells were found in all dissected regions (Extended Data Fig. 6i), indicating that by E8.5 this population has already started to migrate out of the yolk sac.

A platform to dissect genetic mutationsPrevious work has emphasized the critical role of the basic helix–loop–helix (bHLH) transcription factor TAL1 (also known as SCL) in haematopoiesis; in these experiments, Tal1−/− mouse embryos died of severe anaemia at around E9.531. Dissecting the temporal and mechanistic roles of such major regulatory genes in vivo is challenging using knockout mice—breeding mice and genotyping embryos is time- consuming, and furthermore, the direct effects of a mutation are often masked by gross developmental malformations or embryo lethality. To circumvent these difficulties, we generated chimeric mouse embryos in which Tal1−/− tdTomato+ mouse embryonic stem (ES) cells were injected into wild-type blastocysts. In the resulting chimaeras,

wild-type cells still produce blood cells, and this allows the specific effects of TAL1 depletion to be studied in an otherwise healthy embryo32.

To determine whether Tal1 mutant cells were associated with abnor-malities in specific lineages, we sorted tdTomato− (wild type) and tdTomato+ (Tal1−/−) cells from chimeric embryos at E8.5, and then performed scRNA-seq (Fig. 4a; Extended Data Fig. 7a, b). Each cell was annotated by computationally mapping its transcriptome onto our wild-type atlas (Methods; Fig. 4b; Extended Data Fig. 7c–e). Consistent with the pivotal role of Tal1 in haematopoiesis, tdTomato+ cells did not contribute to blood lineages (Fig. 4b; Extended Data Fig. 7e–g). Notably, we confirmed that wild-type control tdTomato+ Tal1+/+ ES cells, when injected into wild-type embryos, make a similar contribu-tion to haematopoiesis as the tdTomato− host cells (Extended Data Fig. 7h, i).

Comparisons between wild-type and Tal1−/− chimeric cells mapped to the landscape defined in Fig. 3a illustrated that TAL1 depletion

a

c

b

e

Mes

ECHaem

Ery

MyMk

BP

BP4 Haem4

EC7

f

g

Ncf2Spi1Alox5apNrrosDok2Hcls1Celf2Fcgr3Lyz2Fcer1gTyrobpCoro1aPtpn7LynMef2cClec1bTimp3LatBin2Rab27bPlekSlaItgb3Slc35d3GnazMplRgs18Gp5Thbs1Gp1bbGp9Pf4Treml1F2rl2Gimap5MfngGimap1Fxyd5Cd34Oit3Igf1Gpr182Cldn5SelenopPlvapIcam2Lyve1Gap43

EC7 Haem4 My BP4 Mk

My

Csf1rPtprcKitFcgr3Tmem119Adgre1Cx3cr1

EMP

Mic

r.

d

4 3 21

2

13

Mk43

My4

76

58

34

21

12

1 2

Mes

Ery

BP

Haem

EC

E6.5

E6.75

E7.0

E7.25

E7.5

E7.75

E8.0

E8.25

E8.5

Spi1

Itga2b

Kdr Low

High

Expr

essi

on le

vels

0 Frac

tion

of c

ells

EC1

EC2

EC3

EC4

EC5

EC6

EC7

EC80

0.5

1.0

Location of endothelium:

Yolk sacEmbryo properAllantois

High

Low

Expr

essi

on le

vels

Fig. 3 | Temporal analysis of blood emergence reveals early myeloid cells. a, Force-directed graph layout of cells associated with the blood lineage, coloured by sub-cluster (15,875 cells). The inset box shows a zoomed-in section that focuses on myeloid, megakaryocytic, and haemogenic endothelial cells. BP, blood progenitor; EC, endothelial cell; Ery, erythrocyte; Haem, haemato-endothelial progenitor; Mes, mesodermal cell; Mk, megakaryocyte; My, myeloid cell. b, Graph abstraction summarizing the relationships between the sub-clusters as in a, coloured by sub-cluster (left) and collection time point (right). Two samples of mixed-time point embryos were excluded. c, Expression levels of Kdr, overlaid on the force-directed layout from a. d, Expression levels of Spi1 and Itga2b, overlaid on the inset of the force-directed layout from a.

e, Fraction of endothelial cells that mapped to yolk sac, allantois, and embryo proper. f, Heat map illustrating row-normalized expression of genes that were significantly upregulated in cells of the EC7 (n = 197), Haem4 (n = 102), My (n = 56), BP4 (n = 54), and Mk (n = 32) sub-clusters when performing pairwise differential expression analyses between a specific sub-cluster and the rest of the cells in a. Significance was considered if log2(mean expression of specific cluster/mean expression of the rest of cells) > 2.5 and Benjamini–Hochberg-adjusted P < 0.05. g, Heat map illustrating the log-count expression (log2(normalized count + 1), ranging from 0 (blue) to 3.5 (red)) of previously described microglial (Micr.) and erythro-myeloid progenitor (EMP) markers.

2 8 F E B R U A R Y 2 0 1 9 | V O L 5 6 6 | N A T U R E | 4 9 3

Fig. 3a from Pijuan-Sala et al. (Nature 2019)

15,875 cells from blood lineage (out of 116,312) from mouse embryos at 9 time points from 6.5-8.5 days post-fertilization

Branch points 👉 Lineages

Trajectories 👉 Differentiation

Clusters

👉 Cell types

Page 5: On low-dimensional embeddings of high-dimensional data

But the geometry of high-dimensional spaces is weird

Every pair of random points has nearly the same distance: d(x, y)2 =

n

∑i=1

(xi − yi)2 and central limit theorem

235

10

15

20

0

5

10

15

20

0.00 0.25 0.50 0.75 1.00r

dV/d

r

Almost all volume of any shape (e.g. hypercube, sphere) is close to its surface

V(r) = rd

dV(r)dr

= drd−1

Page 6: On low-dimensional embeddings of high-dimensional data

MDS of 100 objects

0

25

50

75

100

0 25 50 75 100

0.0

0.2

0.4

0.6

0.8

value

D●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.2

−0.1

0.0

0.1

0.2

−0.4 −0.2 0.0 0.2 0.4P1

P2

cmdscale

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.2

0.0

0.2

−0.6 −0.3 0.0 0.3 0.6P1

P2

isoMDS

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●

●●●●●●●●●●●●

●●●● ●● ●●●●●●●●−5.0

−2.5

0.0

2.5

5.0

−5.0 −2.5 0.0 2.5 5.0P1

P2

t−SNE

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−5

0

5

−10 −5 0 5 10P1

P2

UMAPDij = 1 − e−λ|i−j|idealized model for such data:

Page 7: On low-dimensional embeddings of high-dimensional data

MDS of a 100-dimensional dataset

0

25

50

75

100

0 25 50 75 100

0.0

0.2

0.4

0.6

0.8

value

D●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.2

−0.1

0.0

0.1

0.2

−0.4 −0.2 0.0 0.2 0.4P1

P2

cmdscale

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.2

0.0

0.2

−0.6 −0.3 0.0 0.3 0.6P1

P2

isoMDS

●●●●●●●

●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●

●●●●●●●●●●●●

●●●● ●● ●●●●●●●●−5.0

−2.5

0.0

2.5

5.0

−5.0 −2.5 0.0 2.5 5.0P1

P2

t−SNE

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−5

0

5

−10 −5 0 5 10P1

P2

UMAP

Dij = 1 − e−λ|i−j|

λ = 2

Page 8: On low-dimensional embeddings of high-dimensional data

MDS of a 100-dimensional dataset

Dij = 1 − e−λ|i−j|

λ = 6

0

25

50

75

100

0 25 50 75 100

0.00

0.25

0.50

0.75

value

D2

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.2

0.0

0.2

0.4

−0.50 −0.25 0.00 0.25 0.50P1

P2

cmdscale

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

−0.6

−0.4

−0.2

0.0

0.2

−0.5 0.0 0.5P1

P2

isoMDS

●●●●●●●

●●●●●

●●●

●●●●●

●●●●

●●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●●

●●●

●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●

●●●

−4

0

4

−4 0 4P1

P2

t−SNE

●●●●●●●●●●●

●●

●●

●●●

●●●●

●●

●●●●●●●●●●●

●●●●●

●●

●●● ●

●● ●● ●● ●●● ●●● ●● ●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●

−7.5

−5.0

−2.5

0.0

2.5

−3 0 3 6P1

P2

UMAP

Page 9: On low-dimensional embeddings of high-dimensional data

MDS of a 100-dimensional dataset

Dij = 1 − e−λ|i−j|

λ = 20

0

25

50

75

100

0 25 50 75 100

0.00

0.25

0.50

0.75

value

D3●●●●●●●●●●●●●●●

●●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

−0.2

0.0

0.2

0.4

−0.2 0.0 0.2P1

P2

cmdscale

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●−0.50

−0.25

0.00

0.25

−0.4 0.0 0.4P1

P2

isoMDS

●●●●●●

● ●●

●●●● ●

● ●

● ●

● ●●● ●

● ●● ●

●●●

●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●●●

●●

●●●●

●●

●●

●●●

●●●●

●●

●●

●●●

●●●

●●● ●

●−5

0

5

10

−10 −5 0 5P1

P2

t−SNE

●●●●●●●●●●

●●●

●●●●●●●●●●●●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●

−5

0

5

−5 0 5P1

P2

UMAP

Page 10: On low-dimensional embeddings of high-dimensional data

−0.1

0.0

0.1

0 25 50 75 100Var1

valu

e

EigenvectorsEV1

EV2

EV3

EV4

EV5

−2

−1

0

1

2

0 25 50 75 100Var1

valu

e

EigenvectorsEV1

EV2

EV3

EV4

EV5

What is going on?

EV1

−0.1

00.

05

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.1

50.

000.

15

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

−0.10 0.05

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.10 0.05

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

EV2

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

EV3

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

−0.15 0.00 0.15

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

−0.15 0.00 0.15

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●

●●●●●

●●●●

●●●●●●●●●●●●

●●●●

●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●

●●

●●

●●●

●●●●

●●●●●●●●●

EV4

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●

●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●●

−0.1

00.

05

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−0.1

50.

000.

15

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●

●●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●

●●

●●

●●

●●

●●●●●●●●●

−0.15 0.00 0.15

−0.1

50.

000.

15

EV5

eigD ← eigen(double.center(D))pairs(eigD$vectors[, 1:5])

n

eig

D$ve

ctor

Page 11: On low-dimensional embeddings of high-dimensional data

For the mathematically inclined:

0

25

50

75

100

0 25 50 75 100

−0.25

0.00

0.25

0.50

value

double centered D

d2

dx2f(x) = lim

h→0

f(x + h) − 2f(x) + f(x − h)h2

d2

dx2f(x) = − k2 f(x) ⇔

f(x) ∝ eikx = cos kx + i sin kx

Multidimensional Scaling and Local Kernel Methods Persi Diaconis, Sharad Goel and Susan Holmes The Annals of Applied Statistics 2008

Page 12: On low-dimensional embeddings of high-dimensional data

A straight line in ℝ100

x = at + bfor t ∈ [t1, t2]; a, b ∈ ℝ100

0

10

20

30

40

0 10 20 30 40

0

50

100

150

200

250value

dm

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−20−10

01020

−100 −50 0 50 100P1

P2

cmdscale

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−20−10

01020

−100 −50 0 50 100P1

P2

isoMDS

●●

●● ●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

−60

−40

−20

0

20

40

−20 0 20 40P1

P2t−SNE

●●●

●●

●●

●●●●

●●

●●

●● ●

●●●●

●●

●● ●●●

●●

●●●●

●●

−2.5

0.0

2.5

5.0

−1.0−0.50.00.51.0P1

P2

UMAP

Distance matrix D

Page 13: On low-dimensional embeddings of high-dimensional data

A straight line in ℝ100 - with saturation of larger distances

x = at + bfor t ∈ [t1, t2]; a, b ∈ ℝ100

Distance matrix

0

10

20

30

40

0 10 20 30 40

0

20

40

60

value

sat(dm)

●●●●

●●

●●

●●

●● ● ● ● ●

●●

●●●●●

−20

−10

0

10

20

−20 0 20P1

P2

cmdscale

●●●●●●

●●

●● ● ● ● ● ● ● ● ● ● ●

●●

●●●●●●●

−20

−10

0

10

−40 −20 0 20 40P1

P2

isoMDS

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−25

0

25

−20 0 20P1

P2t−SNE

●●●

●●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●

●●

●●

●●

−5.0

−2.5

0.0

2.5

−2 −1 0 1 2 3P1

P2

UMAP

sat(D)

0

20

40

60

0 50 100 150 200 250d

sat(d

)

Page 14: On low-dimensional embeddings of high-dimensional data

2D random field with spatial correlation

0

50

100

150

0 25 50 75 100

−100

−50

0

50

value

x

0

50

100

150

0 50 100 150

−1000

0

1000

2000value

cov(x)

Page 15: On low-dimensional embeddings of high-dimensional data

0

25

50

75

100

0 25 50 75 100

0

200

400

600

800

value

as.matrix(dist(x))●●

●●

●●

●●

● ●

●● ●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●●

●●

●● ●

●●●●●●

●●●

●●

●●

●●●

●●●●

●●

−200

0

200

−200 0 200 400P1

P2

cmdscale

●●●

●●

●●

● ●

●● ●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●●

●●

●● ●

●●●●●●

●●●

●●

●●

●●●

●●●●

●●

−200

0

200

−200 0 200 400P1

P2

isoMDS

●●

●●● ●

●●

●●

●●●

● ●●

●●

● ●

●●●

●●●●

● ●●

● ●

●●

●●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●●●

●●

●●●

●●

●●

−5.0

−2.5

0.0

2.5

5.0

−6 −3 0 3P1

P2

t−SNE

●●●●●●●●

●●●●●

●●

●●●●●●●●●

●●●●

●●

●●●●●●●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●● ●● ●●●●●●

●●

●●●●●

−10

−5

0

5

−2 0 2 4 6P1

P2

UMAP

2D random field with spatial correlation

Page 16: On low-dimensional embeddings of high-dimensional data

Matrix filled with random numbers with sequential correlation

0

25

50

75

100

0 50 100 150

−10−5051015

value

rw●●

●●

●●●

●●

●●

●●●●

●●●

●●

●●

● ● ●●

●●

● ●

●●●

●●

●●

●●●

●●●

●●●●●

●●

●●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

● ●

●●●●●

●●●●

●●

●●●●

●●● ●●

● ●

●●●

−0.1

0.0

0.1

−0.1 0.0 0.1PC1 (17.39%)

PC2

(14.

38%

)

Page 17: On low-dimensional embeddings of high-dimensional data

Matrix filled with random numbers with sequential correlation

0

50

100

150

0 50 100 150

0

20

40

60

value

as.matrix(dist(rw)) ●●

●●

●●●●●

●●●

●●●●

●●●●

●●●●

● ● ●●●●●

● ●

●●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●

● ●●

●●●●●

●●●●

●●

●●●●●●●● ●●

●●● ●

●●●

−30

−20

−10

0

10

20

30

−20 0 20P1

P2

cmdscale

●●

●●

●●●●●

●●●

●●●●

●●●●

●●●●

● ● ●●●●●

● ●

●●●

●●

●●

●●●

●●●

●●●●

●●●

●●

●●●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●●●●

●●●

●●●

●●

● ●●

●●●●●

●●●●

●●

●●●●●●●● ●●

●●● ●

●●●

−30

−20

−10

0

10

20

30

−20 0 20P1

P2

isoMDS

●●●●

●●●

●●

●●●

●●●●●

●●●●

●●●●

●●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●●●

●●●●●●●

●●●●●●●

−10

−5

0

5

10

−10 −5 0 5 10P1

P2

t−SNE

●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

−7.5

−5.0

−2.5

0.0

2.5

−10 −5 0 5P1

P2

UMAP

Page 18: On low-dimensional embeddings of high-dimensional data

2D t-SNE on 'impossible' shapes●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

−20

−10

0

10

20

−20 −10 0 10 20x

y

2D grid

●●

●●

● ● ●●

●●

●●

● ●●●

●●●● ● ● ●●

●●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●● ● ●●

●●

●● ● ●

●●

● ● ●● ●●

●●●

● ●●●●

●●

●●

●●●●

●●

●●

●●

●●

● ●●

●●

●● ●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●●●●●

●●

●●

●●

●●●●

●●●●●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●●

●●

●●●●●

●●●

●●●

●●●●●

●●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●●●

●●

●●●●●

●●●●

●●●●

●●●

●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●●●●●

●●●●●●●● ●

●●●●●●●

●●●

●●●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●

● ● ● ●●

●●●

● ●● ● ●●

●●

● ● ● ● ● ● ●●● ● ●

● ● ● ●●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●● ●

●●●●

●● ● ● ●●●●

●● ● ● ● ●●●

●●●

● ● ●●●●●●

●●●

●●

●●●

●●●

●●

−20

−10

0

10

20

−20 −10 0 10 20x

y

3D grid

●●

●●

●●

●●

−10

−5

0

5

10

−10 −5 0 5 10x

y

2D torus

●●

●●●●

●● ●

● ● ●

●●

●● ●

●●●

●●

●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●●

●●● ●

●●●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●● ●

●●●●

●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●●

●● ●

●●●

●●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●●

●●●

●● ●

●●●

●● ●

●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●●

● ●

●●●●

● ●

●●●

●●●●

●●

●●

● ● ●

●●●

●●

● ● ●

●●●

●●

●●

●● ●

●●●

●●●●

●●

●●●

●● ●

●●●●

●●

●●●

●●●

●●

●●

● ● ●

●●

●●

● ● ●

●●

●●●

●●

●●●

●●●

●●

●●●●

●● ●

●● ●

●●

●● ●

●●● ●

●●

●●

●● ●

●●

● ● ●

●●

●●●

●●

●●●

●●●

−20

−10

0

10

20

−20 −10 0 10 20x

y

3D torus

●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−50

−25

0

25

50

−40 −20 0 20 40x

y

2D sphere surface

Page 19: On low-dimensional embeddings of high-dimensional data

2D cmdscale on 'impossible' shapes

−0.4

0.0

0.4

−0.4 0.0 0.4V1

V2

2D grid

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ●

−0.4

0.0

0.4

−0.8 −0.4 0.0 0.4 0.8V1

V2

3D grid

●●

● ●●

●●●

●●

● ●●

●●●

●●

● ●●

●●●

●●

● ●●

●●●

●●

● ●●

●●●

●●

● ●●

●●●

●●

● ●●

●●●

●●

● ●●

●●●

−0.2

−0.1

0.0

0.1

0.2

−0.2 0.0 0.2V1

V2

2D torus

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

−0.4

−0.2

0.0

0.2

0.4

−0.4 −0.2 0.0 0.2 0.4V1

V2

3D torus

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

−1.0

−0.5

0.0

0.5

1.0

−1.0 −0.5 0.0 0.5 1.0V1

V2

2D sphere surface

Page 20: On low-dimensional embeddings of high-dimensional data

Take home messages

Embeddings of high-dimensional data into lower-dimensional space are useful

But they can create one-dimensional (“time-like”) patterns that have little to do with the data-generating process

Sometimes, a faithful embedding is mathematically impossible

High-dimensional geometry is weird

Be aware