59
Introduction to Bioinformatics Part 1 of 2 Jonathan Pevsner, Ph.D. [email protected] M.E:440.714 September 8, 2003

Introduction to Bioinformatics Part 1 of 2 Jonathan Pevsner, Ph.D. [email protected] M.E:440.714 September 8, 2003

Embed Size (px)

Citation preview

Introduction to BioinformaticsPart 1 of 2

Jonathan Pevsner, [email protected]

M.E:440.714 September 8, 2003

Many of the images in this powerpoint presentationare from Bioinformatics and Functional Genomicsby Jonathan Pevsner (ISBN 0-471-21004-8). Copyright © 2003 by John Wiley & Sons, Inc.

These images and materials may not be usedwithout permission from the publisher. We welcomeinstructors to use these powerpoints for educationalpurposes, but please acknowledge the source.

The book has a homepage at http://www.bioinfbook.orgIncluding hyperlinks to the book chapters.

Copyright notice

Hugh Cahill

Mayra Garcia

Gek Ming Sia

Teaching assistants

• People with very diverse backgrounds in biology

• People with diverse backgrounds in computer

science and biostatistics

• Most people have a favorite gene, protein, or disease

Who is taking this course?

What are the goals of the course?

• To provide an introduction to bioinformatics with a focus on the National Center for Biotechnology Information (NCBI) and EBI

• To focus on the analysis of DNA, RNA and proteins

• To introduce you to the analysis of genomes

• To combine theory and practice to help you solve research problems

Themes throughout the course

Textbooks

Web sites

Literature references

Gene/protein families

Computer labs

Themes throughout the course: textbooks

Several textbooks are available on reserve:• Baxevanis and Ouellette• David Mount• Durbin et al.

I have written a textbook that will appear Oct. 1,Bioinformatics and Functional Genomics.The chapters contain content, lab exercises,and quizzes that were developed in this course.We will provide chapters as handouts.

Once the book becomes available, we will putcopies on reserve. The book is recommended(not required).

Themes throughout the course: web sites

The course website is:http://pevsnerlab.kennedykrieger.org/bioinfo_course.htm

The textbook website is:http://www.bioinfbook.orgThis has 1000 URLs, organized by chapterThe site offers a 15% discount on book purchases(although the book is not required)

The principal website we will explore is NCBI: http://www.ncbi.nlm.nih.gov

Themes throughout the course: Literature references

You are encouraged to read original sourcearticles. Although articles are not required,they will enhance your understanding of thematerial.

You can obtain articles through PubMedand through the WelDoc service at Welch.Some articles will be available on reserve.

Themes throughout the course: gene/protein families

We will use retinol-binding protein 4 (RBP4) as a modelgene/protein throughout the course. RBP4 is a member of the lipocalin family. It is a small, abundant carrier protein. We will study it in a variety of contexts including--sequence alignment--gene expression--protein structure--phylogeny--homologs in various species

We will also use the Pol protein of HIV-1 as an example.

The HIV-1 pol gene encodes three proteins

Aspartylprotease

Reversetranscriptase

Integrase

PR RT IN

Themes throughout the course: computer labs

There is a computer lab each Friday. This is a chanceto gain practical experience using a variety of web resources.

You can do the lab on your own if you wish. However, during the lab you can get help on problems,and in some cases the computers will havespecialized software.

Grading

30% weekly quizzes (open book)

30% final exam November 13

40% discovery of a novel gene (by Oct. 9)

and phylogenetic tree (by Nov. 13)

extra credit: find a mistake in a database

What is bioinformatics?

• Interface of biology and computers

• Analysis of proteins, genes and genomes using computer algorithms and computer databases

• Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.

Top ten challenges for bioinformatics

[1] Precise models of where and when transcription will occur in a genome (initiation and termination)

[2] Precise, predictive models of alternative RNA splicing

[3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli

[4] Determining protein:DNA, protein:RNA, protein:protein recognition codes

[5] Accurate ab initio protein structure prediction

Top ten challenges for bioinformatics

[6] Rational design of small molecule inhibitors of proteins

[7] Mechanistic understanding of protein evolution

[8] Mechanistic understanding of speciation

[9] Development of effective gene ontologies: systematic ways to describe gene and protein function

[10] Education: development of bioinformatics curricula

Source: Ewan Birney, Chris Burge, Jim Fickett

Three perspectives on bioinformatics

The tree of life

The organism

The cell

Time ofdevelopment

Body region, physiology, pharmacology, pathology

DNA RNA phenotypeprotein

DNA RNA

cDNAESTsUniGene

phenotype

genomicDNAdatabases

protein sequence databases

protein

GenBankEMBL DDBJ

There are three major public DNA databases

The underlying raw DNA sequences are identical

GenBankEMBL DDBJ

Housedat EBI

EuropeanBioinformatics

Institute

There are three major public DNA databases

Housed at NCBINational

Center forBiotechnology

Information

Housed in Japan

>100,000 species are represented in GenBank

all species 128,941

viruses 6,137

bacteria 31,262

archaea 2,100

eukaryota 87,147

The most sequenced organisms in GenBank

Homo sapiens (6.9 million entries)Mus musculus (5.0 million)Zea mays (896,000)Rattus norvegicus (819,000)Gallus gallus (567,000)Arabidopsis thaliana (519,000)Danio rerio (492,000)Drosophila melanogaster (350,000)Oryza sativa (221,000)

National Center for BiotechnologyInformation (NCBI)

www.ncbi.nlm.nih.gov

www.ncbi.nlm.nih.gov

PubMed is… • National Library of Medicine's search service• 11 million citations in MEDLINE• links to participating online journals• PubMed tutorial (via “Education” on side bar)

Entrez integrates…

• the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; • population study data sets; • assemblies of complete genomes

Entrez is a search and retrieval system that integrates NCBI databases

BLAST is…

• Basic Local Alignment Search Tool• NCBI's sequence similarity search tool• supports analysis of DNA and protein databases• 80,000 searches per day

OMIM is…

•Online Mendelian Inheritance in Man•catalog of human genes and genetic disorders•edited by Dr. Victor McKusick, others at JHU

Books is…

• searchable resource of on-line books

TaxBrowser is…

• browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses)• taxonomy information such as genetic codes• molecular data on extinct organisms

Structure site includes…

• Molecular Modelling Database (MMDB)

• biopolymer structures obtained from

the Protein Data Bank (PDB)• Cn3D (a 3D-structure viewer)• vector alignment search tool (VAST)

Four questions we can answer at NCBI (and elsewhere):

[1] How can I do a literature search using PubMed?

[2] How can WelchWeb help?

[3] How can I use Entrez to find information about a particular gene or protein?

(What is an accession number?)

[4] How can I find informationabout a particular disease?

Question #1:How can I use PubMed at NCBIto find literatureinformation?

PubMed is the NCBI gateway to MEDLINE.

MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published in the United States and in 70 foreign countries.

It has 12 million records dating back to 1966.

MeSH is the acronym for "Medical Subject Headings."

MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE.

The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature.

PubMed search strategies

Try the tutorial (“education” on the left sidebar)

Use boolean querieslipocalin AND disease

Try using “limits”

Try “LinkOut” to find external resources

Obtain articles on-line via Welch Medical Library(and download pdf files):

http://www.welch.jhu.edu/

lipocalin AND disease(35 results)

lipocalin OR disease(1,300,000 results)

lipocalin NOT disease(350 results)

1 AND 2

1 OR 2

1 NOT 2

1

1

1

2

2

2

Question #2: How can I use WelchWeb(from the Welch Medical Library) to doliterature (and other) searches?

WelchWeb is available at http://www.welch.jhu.edu

WelchWeb is available at http://www.welch.jhu.edu

E-mail gateway

PubMed gateway

Library catalog

Remote accessto Welch services

Request literature

Browse journals

Browse databases

Basic Sciences Subject Guide http://www.welch.jhu.edu/internet/bsci.html RAUL (remote access)http://proxy.hcf.jhu.edu/ Weldoc (Inter Library Loan, and electronic delivery of articles)http://weldoc.welch.jhmi.edu/weldoc/logon.html MyWelch (personal library portal)https://mywelch.welch.jhmi.edu Welch E-Learning page (online tutorials and hand-outs)http://www.welch.jhu.edu/classes/elearning/index.html Johns Hopkins Author Publishing Tool http://openaccess.jhmi.edu/authors_resource.cfm Browse Welch E-Resources by Subjecthttp://www.welch.jhu.edu/eresources/edatabases_subject.cfm Liaison Librarian Program (every dept has a liaison librarian)http://www.welch.jhu.edu/liaison/index.html

Thanks to Brian Brown ([email protected]), the Welch Medical Library liason to the basic sciences

WelchWeb URLs of interest

Visit the Basic SciencesSubject guide for a longlist of bioinformatics-related sites...