17
1 - Lectures.GersteinLab.org 2 - Lectures.GersteinLab.org Vision for Mining Large-scale Structured Literature [Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08] %"&&" ' 5' -*&'6 %#! %7& #%'&

Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

1 - L

ectu

res.

Ger

stei

nLab

.org

2

- Lec

ture

s.G

erst

einL

ab.o

rg

Vision for Mining

Large-scale Structured Literature

[Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08]

Page 2: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

3 - L

ectu

res.

Ger

stei

nLab

.org

Vision for Mining

Large-scale Structured Literature

Doing better science: Finding new protein relationships (e.g. protein interactions), looking for inconsist-encies in arguments, assembling consen-sus definitions automatically Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's disease. PNAS ('04); Iossifov et al. Probabilistic inference of molecular networks from noisy data sources. Bioinformatics ('04)

[Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08]

4 - L

ectu

res.

Ger

stei

nLab

.org

Vision for Mining

Large-scale Structured Literature

Making it understand-able (through mashup )

SciVee, podcasts

[Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08]

Page 3: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

5 - L

ectu

res.

Ger

stei

nLab

.org

Vision for Mining

Large-scale Structured Literature

Mapping Science + Studying its Dynamics & Evolution

[Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08]

6 - L

ectu

res.

Ger

stei

nLab

.org

[Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08]

Vision for Mining

Large-scale Structured Literature

•  Revealing patterns of collaboration •  Understanding basis of terms & nomenclature • Tracking the evolution of ideas •  Models for the evolution of science; •  Helping set policy & research directions

Page 4: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

7 - L

ectu

res.

Ger

stei

nLab

.org

Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines

•  Intro: Using the Data Exhaust from Consortia

• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.

Genomics centers • Analysis of Evolution of

the ENCODE Consortium - Differences in “modularity”

for members & users - Key role for brokers

• Other Examples of Idea & Data Diffusion - RNAi & SRA bases

• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative

process - Moderated by parm.

w/ 2 time regimes

8 - L

ectu

res.

Ger

stei

nLab

.org

Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines

•  Intro: Using the Data Exhaust from Consortia

• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.

Genomics centers • Analysis of Evolution of

the ENCODE Consortium - Differences in “modularity”

for members & users - Key role for brokers

• Other Examples of Idea & Data Diffusion - RNAi & SRA bases

• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative

process - Moderated by parm.

w/ 2 time regimes

Page 5: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

9 - L

ectu

res.

Ger

stei

nLab

.org

Increase in Consortium Science

10 -

Lect

ures

.Ger

stei

nLab

.org

Examples Illuminating Current State of Affairs: Using Network Representations to Make Maps of Science -- Studying

the Publication Patterns of Genomics Consortia

Page 6: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

11 -

Lect

ures

.Ger

stei

nLab

.org

Different Representations of the Publication Network of a Structural Genomics Center (NESG)

12 -

Lect

ures

.Ger

stei

nLab

.org

Co-authorship Networks

comparing the 9 NIH

Structural Genomics

Centers

(45)

(19)

(7)

Average Degree

Page 7: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

13 -

Lect

ures

.Ger

stei

nLab

.org

Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines

•  Intro: Using the Data Exhaust from Consortia

• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.

Genomics centers • Analysis of Evolution of

the ENCODE Consortium - Differences in “modularity”

for members & users - Key role for brokers

• Other Examples of Idea & Data Diffusion - RNAi & SRA bases

• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative

process - Moderated by parm.

w/ 2 time regimes

14 -

Lect

ures

.Ger

stei

nLab

.org

HumanGenome Project

WormGenome

ENCODEPilot

ComparativeENCODE EpigenomeRoadmap

1000 GenomesPilot

GTEx

ENCODEProduction

1000 GenomesPhase 3

modENCODE

Page 8: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

15 -

Lect

ures

.Ger

stei

nLab

.org

2004

2005

2006

2007

2008

2009

2010

2011

2012

2013

2014

2015

Year

Num

ber

of p

aper

s

0

100

200

300

(0,1

]

(1,2

]

(2,3

]

(3,4

]

(4,5

]

(5,6

]

(6,7

]

(7,8

]

(8,9

]

(9,1

0]

(10,

15]

(15,

20]

(20,

50]

(50,

100]

(100

,400

]

Range of author numbers

0

20

40

60

80

100

With help of NHGRI, identified: 1,786 ENCODE members & 8,263 non-members from 558 consortium papers supported by ENCODE funding & 702 community papers that used ENCODE data but were not supported by ENCODE funding

non−ENCODE (papers used ENCODE data) ENCODE

16 -

Lect

ures

.Ger

stei

nLab

.org

co-authorship

Page 9: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

17 -

Lect

ures

.Ger

stei

nLab

.org

7

--- 7

Lec

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLtu

res

Coalesced into single module due to ENCODE consortium papers in 2007

Some separation but retention of a unified modular structure

Mod

ular

ity

Num

ber o

f clu

ster

s

Network statistics highlight change in modularity with consortium rollouts (L)

& importance of broker role (R)

18 -

Lect

ures

.Ger

stei

nLab

.org

Number of non-member neighbors

Num

ber o

f mem

ber n

eigh

bors

modENCODE

consortium membernon−membermember brokernon−member brokerconsortium networknon−consortiumnetworkrandom networkco−authorship

0.0

0.2

0.4

0.6

0.8

1.0

Mod

ular

ity

0

50

100

150

Year

Num

ber

of c

o−au

thor

ship

clu

ster

2007

2008

2009

2010

2011

2012

2013

2014

modENCODEnon−modENCODE

Similar Findings in terms of modularity & broker scientists in the modENCODE consortium as for

ENCODE

Page 10: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

19 -

Lect

ures

.Ger

stei

nLab

.org

Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines

•  Intro: Using the Data Exhaust from Consortia

• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.

Genomics centers • Analysis of Evolution of

the ENCODE Consortium - Differences in “modularity”

for members & users - Key role for brokers

• Other Examples of Idea & Data Diffusion - RNAi & SRA bases

• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative

process - Moderated by parm.

w/ 2 time regimes

20 -

Lect

ures

.Ger

stei

nLab

.org

Cumulative bases deposited in the Sequence Read Archive by journal

Nu

mb

er

of

ba

ses

8

7

6

3.0

2.5

2.0

1.5

1.0

0.5

05

4

3

2

1

02009

20

12

20

13

20

14

20

15

2010 2011 2012

Date

2013 2014 2015

NatureScienceCellNucleic Acids Res.Genome BiologyNat. Biotech.ISME J.Proc Natl Acad Sci USANat Chem Biol.Molecular ecology

1e13 1e12Diffusion of a data type

(sequenced bases)

measured by occurrence in

specialty journals

Page 11: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

21 -

Lect

ures

.Ger

stei

nLab

.org

RNAi: Birth of a Field in

the Literature

Culmin-ating in the 2006

Nobel

Source: Gerstein & Douglas. PLoS Comp. Bio. 3:e80 (2007) PubNet.GersteinLab.org

22 -

Lect

ures

.Ger

stei

nLab

.org

Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines

•  Intro: Using the Data Exhaust from Consortia

• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.

Genomics centers • Analysis of Evolution of

the ENCODE Consortium - Differences in “modularity”

for members & users - Key role for brokers

• Other Examples of Idea & Data Diffusion - RNAi & SRA bases

• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative

process - Moderated by parm.

w/ 2 time regimes

Page 12: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

23 -

Lect

ures

.Ger

stei

nLab

.org

Spread of information as a diffusion process

24 -

Lect

ures

.Ger

stei

nLab

.org

Spread of information as a diffusion process

Page 13: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

25 -

Lect

ures

.Ger

stei

nLab

.org

Modeling Information diffusion

26 -

Lect

ures

.Ger

stei

nLab

.org

Modeling Information diffusion

Modeling Information diffusion

Page 14: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

27 -

Lect

ures

.Ger

stei

nLab

.org

Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines

•  Intro: Using the Data Exhaust from Consortia

• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.

Genomics centers • Analysis of Evolution of

the ENCODE Consortium - Differences in “modularity”

for members & users - Key role for brokers

• Other Examples of Idea & Data Diffusion - RNAi & SRA bases

• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative

process - Moderated by parm.

w/ 2 time regimes

28 -

Lect

ures

.Ger

stei

nLab

.org

Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines

•  Intro: Using the Data Exhaust from Consortia

• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.

Genomics centers • Analysis of Evolution of

the ENCODE Consortium - Differences in “modularity”

for members & users - Key role for brokers

• Other Examples of Idea & Data Diffusion - RNAi & SRA bases

• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative

process - Moderated by parm.

w/ 2 time regimes

Page 15: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

29 -

Lect

ures

.Ger

stei

nLab

.org

“Encode authors”

D Wang, KK Yan, J Rozowsky, E Pan “Info Flow”

KK Yan

Hiring Postdocs. See gersteinlab.org/jobs

Cost of Seq. & SRA bases P Muir, S Li, S Lou, D Wang, DJ Spakowicz, L Salichos, J Zhang, GM Weinstock, F Isaacs, J Rozowsky PubNet.gersteinlab.org & RNAi -- SM Douglas Vision for “Text Mining” -- A Rzhetsky, M Seringhaus

30 -

Lect

ures

.Ger

stei

nLab

.org

Extra

Page 16: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

31 -

Lect

ures

.Ger

stei

nLab

.org

Clustering structures determined

by struc. genomics consortia

according to functional similarity: Is there a functional

bias in consortia

structures?

31

(c) M

ark

Ger

stei

n, 2

002,

Yal

e, b

ioin

fo.m

bb.y

ale.

edu

Avg. Degree

Avg. Path Clust. Coeff.

Diameter

PSI 24 2.6 37% 7 PDB 6 3.9 31% 9

32 -

Lect

ures

.Ger

stei

nLab

.org

Over-representation of crystallography among the Nobel Prizes, highlighted by the 2006 Nobels

Page 17: Vision for Large-scale Structured Literature...Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's

33 -

Lect

ures

.Ger

stei

nLab

.org

Info about content in this slide pack •  General PERMISSIONS - This Presentation is copyright Mark Gerstein,

Yale University, 2015. - Please read permissions statement at

www.gersteinlab.org/misc/permissions.html . -  Feel free to use slides & images in the talk with PROPER acknowledgement

(via citation to relevant papers or link to gersteinlab.org). -  Paper references in the talk were mostly from Papers.GersteinLab.org.

•  PHOTOS & IMAGES. For thoughts on the source and permissions of many of the photos and

clipped images in this presentation see http://streams.gerstein.info . -  In particular, many of the images have particular EXIF tags, such as kwpotppt , that can be

easily queried from flickr, viz: http://www.flickr.com/photos/mbgmbg/tags/kwpotppt