Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
1 - L
ectu
res.
Ger
stei
nLab
.org
2
- Lec
ture
s.G
erst
einL
ab.o
rg
Vision for Mining
Large-scale Structured Literature
[Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08]
3 - L
ectu
res.
Ger
stei
nLab
.org
Vision for Mining
Large-scale Structured Literature
Doing better science: Finding new protein relationships (e.g. protein interactions), looking for inconsist-encies in arguments, assembling consen-sus definitions automatically Krauthammer et al. Molecular triangulation: bridging linkage and molecular-network information for identifying candidate genes in Alzheimer's disease. PNAS ('04); Iossifov et al. Probabilistic inference of molecular networks from noisy data sources. Bioinformatics ('04)
[Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08]
4 - L
ectu
res.
Ger
stei
nLab
.org
Vision for Mining
Large-scale Structured Literature
Making it understand-able (through mashup )
SciVee, podcasts
[Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08]
5 - L
ectu
res.
Ger
stei
nLab
.org
Vision for Mining
Large-scale Structured Literature
Mapping Science + Studying its Dynamics & Evolution
[Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08]
6 - L
ectu
res.
Ger
stei
nLab
.org
[Rzhetsky et al, Cell ('08), PLOS CB ('09); Bourne et al. PLOS CB '08]
Vision for Mining
Large-scale Structured Literature
• Revealing patterns of collaboration • Understanding basis of terms & nomenclature • Tracking the evolution of ideas • Models for the evolution of science; • Helping set policy & research directions
7 - L
ectu
res.
Ger
stei
nLab
.org
Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines
• Intro: Using the Data Exhaust from Consortia
• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.
Genomics centers • Analysis of Evolution of
the ENCODE Consortium - Differences in “modularity”
for members & users - Key role for brokers
• Other Examples of Idea & Data Diffusion - RNAi & SRA bases
• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative
process - Moderated by parm.
w/ 2 time regimes
8 - L
ectu
res.
Ger
stei
nLab
.org
Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines
• Intro: Using the Data Exhaust from Consortia
• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.
Genomics centers • Analysis of Evolution of
the ENCODE Consortium - Differences in “modularity”
for members & users - Key role for brokers
• Other Examples of Idea & Data Diffusion - RNAi & SRA bases
• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative
process - Moderated by parm.
w/ 2 time regimes
9 - L
ectu
res.
Ger
stei
nLab
.org
Increase in Consortium Science
10 -
Lect
ures
.Ger
stei
nLab
.org
Examples Illuminating Current State of Affairs: Using Network Representations to Make Maps of Science -- Studying
the Publication Patterns of Genomics Consortia
11 -
Lect
ures
.Ger
stei
nLab
.org
Different Representations of the Publication Network of a Structural Genomics Center (NESG)
12 -
Lect
ures
.Ger
stei
nLab
.org
Co-authorship Networks
comparing the 9 NIH
Structural Genomics
Centers
(45)
(19)
(7)
Average Degree
13 -
Lect
ures
.Ger
stei
nLab
.org
Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines
• Intro: Using the Data Exhaust from Consortia
• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.
Genomics centers • Analysis of Evolution of
the ENCODE Consortium - Differences in “modularity”
for members & users - Key role for brokers
• Other Examples of Idea & Data Diffusion - RNAi & SRA bases
• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative
process - Moderated by parm.
w/ 2 time regimes
14 -
Lect
ures
.Ger
stei
nLab
.org
HumanGenome Project
WormGenome
ENCODEPilot
ComparativeENCODE EpigenomeRoadmap
1000 GenomesPilot
GTEx
ENCODEProduction
1000 GenomesPhase 3
modENCODE
15 -
Lect
ures
.Ger
stei
nLab
.org
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
Year
Num
ber
of p
aper
s
0
100
200
300
(0,1
]
(1,2
]
(2,3
]
(3,4
]
(4,5
]
(5,6
]
(6,7
]
(7,8
]
(8,9
]
(9,1
0]
(10,
15]
(15,
20]
(20,
50]
(50,
100]
(100
,400
]
Range of author numbers
0
20
40
60
80
100
With help of NHGRI, identified: 1,786 ENCODE members & 8,263 non-members from 558 consortium papers supported by ENCODE funding & 702 community papers that used ENCODE data but were not supported by ENCODE funding
non−ENCODE (papers used ENCODE data) ENCODE
16 -
Lect
ures
.Ger
stei
nLab
.org
co-authorship
17 -
Lect
ures
.Ger
stei
nLab
.org
7
--- 7
Lec
LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLtu
res
Coalesced into single module due to ENCODE consortium papers in 2007
Some separation but retention of a unified modular structure
Mod
ular
ity
Num
ber o
f clu
ster
s
Network statistics highlight change in modularity with consortium rollouts (L)
& importance of broker role (R)
18 -
Lect
ures
.Ger
stei
nLab
.org
Number of non-member neighbors
Num
ber o
f mem
ber n
eigh
bors
modENCODE
consortium membernon−membermember brokernon−member brokerconsortium networknon−consortiumnetworkrandom networkco−authorship
0.0
0.2
0.4
0.6
0.8
1.0
Mod
ular
ity
0
50
100
150
Year
Num
ber
of c
o−au
thor
ship
clu
ster
2007
2008
2009
2010
2011
2012
2013
2014
modENCODEnon−modENCODE
Similar Findings in terms of modularity & broker scientists in the modENCODE consortium as for
ENCODE
19 -
Lect
ures
.Ger
stei
nLab
.org
Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines
• Intro: Using the Data Exhaust from Consortia
• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.
Genomics centers • Analysis of Evolution of
the ENCODE Consortium - Differences in “modularity”
for members & users - Key role for brokers
• Other Examples of Idea & Data Diffusion - RNAi & SRA bases
• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative
process - Moderated by parm.
w/ 2 time regimes
20 -
Lect
ures
.Ger
stei
nLab
.org
Cumulative bases deposited in the Sequence Read Archive by journal
Nu
mb
er
of
ba
ses
8
7
6
3.0
2.5
2.0
1.5
1.0
0.5
05
4
3
2
1
02009
20
12
20
13
20
14
20
15
2010 2011 2012
Date
2013 2014 2015
NatureScienceCellNucleic Acids Res.Genome BiologyNat. Biotech.ISME J.Proc Natl Acad Sci USANat Chem Biol.Molecular ecology
1e13 1e12Diffusion of a data type
(sequenced bases)
measured by occurrence in
specialty journals
21 -
Lect
ures
.Ger
stei
nLab
.org
RNAi: Birth of a Field in
the Literature
Culmin-ating in the 2006
Nobel
Source: Gerstein & Douglas. PLoS Comp. Bio. 3:e80 (2007) PubNet.GersteinLab.org
22 -
Lect
ures
.Ger
stei
nLab
.org
Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines
• Intro: Using the Data Exhaust from Consortia
• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.
Genomics centers • Analysis of Evolution of
the ENCODE Consortium - Differences in “modularity”
for members & users - Key role for brokers
• Other Examples of Idea & Data Diffusion - RNAi & SRA bases
• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative
process - Moderated by parm.
w/ 2 time regimes
23 -
Lect
ures
.Ger
stei
nLab
.org
Spread of information as a diffusion process
24 -
Lect
ures
.Ger
stei
nLab
.org
Spread of information as a diffusion process
25 -
Lect
ures
.Ger
stei
nLab
.org
Modeling Information diffusion
26 -
Lect
ures
.Ger
stei
nLab
.org
Modeling Information diffusion
Modeling Information diffusion
27 -
Lect
ures
.Ger
stei
nLab
.org
Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines
• Intro: Using the Data Exhaust from Consortia
• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.
Genomics centers • Analysis of Evolution of
the ENCODE Consortium - Differences in “modularity”
for members & users - Key role for brokers
• Other Examples of Idea & Data Diffusion - RNAi & SRA bases
• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative
process - Moderated by parm.
w/ 2 time regimes
28 -
Lect
ures
.Ger
stei
nLab
.org
Analyzing the Structure of Genomic Science: Mapping the diffusion of ideas & data across disciplines
• Intro: Using the Data Exhaust from Consortia
• PubNet Tool for analyzing co-publication patterns - Different patterns for Str.
Genomics centers • Analysis of Evolution of
the ENCODE Consortium - Differences in “modularity”
for members & users - Key role for brokers
• Other Examples of Idea & Data Diffusion - RNAi & SRA bases
• Quant. Model of Information Diffusion - Based on PLOS ALM - Random multiplicative
process - Moderated by parm.
w/ 2 time regimes
29 -
Lect
ures
.Ger
stei
nLab
.org
“Encode authors”
D Wang, KK Yan, J Rozowsky, E Pan “Info Flow”
KK Yan
Hiring Postdocs. See gersteinlab.org/jobs
Cost of Seq. & SRA bases P Muir, S Li, S Lou, D Wang, DJ Spakowicz, L Salichos, J Zhang, GM Weinstock, F Isaacs, J Rozowsky PubNet.gersteinlab.org & RNAi -- SM Douglas Vision for “Text Mining” -- A Rzhetsky, M Seringhaus
30 -
Lect
ures
.Ger
stei
nLab
.org
Extra
31 -
Lect
ures
.Ger
stei
nLab
.org
Clustering structures determined
by struc. genomics consortia
according to functional similarity: Is there a functional
bias in consortia
structures?
31
(c) M
ark
Ger
stei
n, 2
002,
Yal
e, b
ioin
fo.m
bb.y
ale.
edu
Avg. Degree
Avg. Path Clust. Coeff.
Diameter
PSI 24 2.6 37% 7 PDB 6 3.9 31% 9
32 -
Lect
ures
.Ger
stei
nLab
.org
Over-representation of crystallography among the Nobel Prizes, highlighted by the 2006 Nobels
33 -
Lect
ures
.Ger
stei
nLab
.org
Info about content in this slide pack • General PERMISSIONS - This Presentation is copyright Mark Gerstein,
Yale University, 2015. - Please read permissions statement at
www.gersteinlab.org/misc/permissions.html . - Feel free to use slides & images in the talk with PROPER acknowledgement
(via citation to relevant papers or link to gersteinlab.org). - Paper references in the talk were mostly from Papers.GersteinLab.org.
• PHOTOS & IMAGES. For thoughts on the source and permissions of many of the photos and
clipped images in this presentation see http://streams.gerstein.info . - In particular, many of the images have particular EXIF tags, such as kwpotppt , that can be
easily queried from flickr, viz: http://www.flickr.com/photos/mbgmbg/tags/kwpotppt