31
Issues in Managing and Disseminating Changing Information in Biology Sue Rhee [email protected] Carnegie Institution Department of Plant Biology Stanford, CA

Issues in Managing and Disseminating Changing Information in Biology Sue Rhee [email protected] Carnegie Institution Department of Plant Biology

Embed Size (px)

Citation preview

Issues in Managing and Disseminating Changing Information in Biology

Sue Rhee

[email protected]

Carnegie Institution

Department of Plant Biology

Stanford, CA

Information Dissemination Media in Biology

Journals~150 years

peer-reviewedhighly referenced

limited sizestatic

Public Repositories

~20 yearsminimum review

minimum referenceunlimited size

static

Community Databases~5 years

Curator-reviewModerately referenced

unlimited sizedynamic

TAIR:the Arabidopsis Information Resource

• A Community Database about Arabidopsis Information• Researchers can search, download, analyze data via

commonly-used web browsers and ftp• NSF funded project (1999-2004)• Collaboration between Carnegie (Stanford, CA), NCGR

(Santa Fe, NM) and ABRC (Columbus, OH)• http://www.arabidopsis.org

Who are the users? People

Unspecified 8863Graduate Student 792Post-Doctoral Researcher 748Professor 361Research Scientist 348Assistant Professor 330Associate Professor 246Research Associate 154Group Leader 135Research Assistant 121Other 110Unknown 100Research Fellow 82Project Leader 71Undergraduate Student 70Director 42Lecturer 36Senior Research Officer 25Curator 20Programmer 16Teacher 10Coordinator 10Senior Lecturer 10High School Teacher 9High School Student 8President 7Advisory Board Member 2Secretary 1Middle School Teacher 1

Lab 4724Institute 74Project 41Education_outreach_program 21Facility 15Company 14University 14Collaboration 13Database 84_year_college 7Center 4Stock_center 3Committee 2Foundation 1Organization 1Community_college 1

GroupsArabidopsis 3211Rice 777Maize 535Wheat 390Tomato 351Legumes 331Bacteria 274Fungi 227Potato 209Animals 177Other Crops 150Microorganisms 124Legume 93tobacco 28barley 17cotton 12Tobacco 11tomato 10Brassica 9petunia 8Barley 7Chlamydomonas 7poplar 6

Organism of Interest

Total: 12,300 invidualsand 4700 labs working onplant research

Usage Statistics Monthly:~5 million files served~900,000 page views~29,000 IP addresses~30 Gb served

What do we do?1. Capture data generated by large genome projects and individual

researchers– Read and extract info from literature, establish contact with large-scale

project groups

2. Curate and analyze the information– Error checking, making associations, synthesizing summary, adding quality

control filters through a series of standard operation procedures and analysis pipelines

3. Make information accessible to users in intuitive form– In-house biologists and user feedback from surveys & workshops

4. Develop data query, analysis, curation, visualization tools– Collaboration between software developers and biologists, iterative process

5. Communicate with the users– Data submssion, suggestions, error and other problem reports

What is PubSearch?• A web application and database for literature curation• Stores complete literature information

– References, abstracts, full text articles (pdf)

• Stores biological information– Genes, proteins, descriptions

• Stores ontologies (GO Terms)• Links literature, GO terms and biological information.• Assists manual curation with fast, automatic matching

(using suffix trees indicer)• Is password-protected, and easy to set up and use.

PubSesarch System Architecture

TAIR Installation Statistics (9/12/03)• 20,272 literature references• 14,920 research papers with abstracts• 8,642 full-text papers (58%)• 16,956 controlled vocabulary terms• 105,671 hits between terms and articles (2359 terms)• 38,010 gene names• 29,841 hits between genes and articles (4268 genes)• 14,943 hits validated

– (70% valid, 29% not valid, 0.5% maybe)

• 11,497 manual annotations to 5981 genes from 2113 articles

• 38 relationship types for gene2term and gene2gene• 103 evidence types

Pub* Tools Website: http://pubsearch.org

TAIR Data Size

Type of Info Stored Size in 1999

Size in 2003

Website General information, help, external sites

0.7 Gb 25 Gb

Database Data, external links, definition of database fields

3 Gb 20 Gb

FTP directory Large datasets generated from database or external sites

N.D. 13 Gb

DVD Archive Microarray raw data 0 1.6 Gb

Current Issues in Community Databases

1. How to maximize connection with public repositories and journals?

2. How to ensure information is up-to-date?

3. How to cross-reference all the information in independent sites?

4. What happens after the funding?

Overlap and Interconnection Between Existing Media

Journals

Public Repositories

Community Databases

Overlap and Interconnection Between Existing Media

Journals

Public Repositories

Community Databases

Making Connections with Public Repositories1. Utilizing existing standards

A. LinkOutA. Data capture includes Genbank accession (e.g. seed stock

containing an insertion and the insert-site sequence with Genbank accession)

B. Data downloaded from Genbank using the accession using e-utilities

C. Data curation/analysis generates additional associations (e.g. the insertion site used to identify the associating gene and a polymorphism for that gene)

D. Sequence-associated information sent back to Genbank using the LinkOut XML format

2. Collaborating to make new standardsA. Plant microarray submission standards with ArrayExpress

B. MIAME standards for microarraysA. Researchers submit microarray data in prefilled Excel sheetsB. Convert Excel into XML and load into TAIR databaseC. Data curation/analysis generates additional associations (e.g. usage

of controlled vocabularies)D. Data exported into XML and sent to ArrayExpress

Making Connections with Journals

1. Publication requirement to adhere to existing standardsA. Stock AccessionsB. Gene symbol Registry (currently under discussion)

2. Data sharingA. Image data for gene expressionB. Supplementary data (e.g. microarray results)

3. Resource sharingA. Publication through community databases?

Keeping Information Up-To-Date

1. In-house curation-pro: experience and standard operation procedures can

ensure consistency-con: becoming difficult keep up as the amount and

complexity of information increases

2. Community involvement-pro: expertise and sheer number of the community-con: has not worked successfully (no incentive in the

current academic reward structure, not considered to be a typical role of a scientist)

3. Others?

JournalImpact Factor

Total Citations

Total Articles

Citationsper Articles

Nature 30.4 326546 889 367Science 29.0 296080 987 300Cell 27.3 139765 350 399Genes & Development 19.7 45227 268 169Current Opinion in Cell Biology 19.0 12818 90 142Molecular Cell 16.5 16125 271 60Journal of Cell Biology 12.5 68928 412 167Trends in Plant Sciences 12.4 4283 60 71Developmental Cell 11.5 1196 139 9The Plant Cell 10.8 17373 241 72PNAS 10.7 315820 2911 108EMBO Journal 10.7 77524 677 115Current Opinion in Plant Biology 9.5 2510 74 34Molecular Biology of the Cell 7.6 14170 347 41Current Biology 7.0 20020 341 59Journal of Cell Science 7.0 20840 460 45Journal of Biological Chemistry 6.7 370056 6444 57The Plant Journal 5.9 12721 287 44Molecular Microbiology 5.8 23553 521 45Plant Physiology 5.8 33690 531 63Traffic 5.4 1182 87 14Plant Molecular Biology 4.5 10522 194 54Molecular Plant Microbe Interactions 3.8 4449 140 32Journal of Computational Biology 3.5 711 44 16Fungal Genetics and Biology 3.2 1044 72 15Planta 3.0 10641 245 43Phytopathology 2.2 9913 167 59Current Genetics 1.9 2788 77 36

Impact Factor of Top Journals

Impact Factor of Top Databases?

2000 2001 2002 2003

percent mentioned

TAIR mentioned

total full- text

percentmentioned

4.27% 5.79% 8.96% 11.37%

TAIR mentioned 44 60 110 143

total full- text 1031 1036 1228 1258

2000 2001 2002 2003

Impact of TAIR

Current Issues in Community Databases

1. How to maximize connection with public repositories and journals?

2. How to ensure information is up-to-date?

3. How to cross-reference all the information in independent sites?

4. What happens after the funding?

The End

People Involved

TAIR-CarnegieTanya BerardiniMarga Garcia-HernandezEva HualaSuparna MundodiLeonore Reiser Julie TacklindIris XuDanny YooPeifen ZhangNick MoseykoBrandon ZoeklerJessie Zhang

TAIR-NCGRDan WeemsNeil Miller Mary Montoya

ABRCRandy SchollDebbie CristEmma KneeLuz Rivero

Information Dissemination Media in Biology

1. Scientific Journals• Traditional medium of knowledge dissemination• Long history of publishing• Recently have move to electronic publishing

3. Community Databases• Information resources that are created, maintained, and improved

by research community• Funded by governments, not permanent.• A few large databases share similar history as public repositories• Recently there has been a radiation of the community databases

2. Public Repositories• Permanent operations for electronic storage and dissmination of

basic data• Shorter history than journals, about 20 years• A good example is NCBI’s Genbank

What is the infrastructure?

Web browser applications

TAIR DB

Data object layer

Application Program Interface

Analysis cluster

FTPDirectory

DVD archive

Software Development, Curation, Testing, Staging Environments