55
CS 177 Introduction to Bioinformatics Fall 2005 • Instructor: Anna Panchenko ( [email protected] ) • Instructor: Tom Madej ([email protected])

CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko ([email protected])[email protected] Instructor: Tom Madej ([email protected])

  • View
    221

  • Download
    3

Embed Size (px)

Citation preview

Page 1: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

CS 177 Introduction to BioinformaticsFall 2005

• Instructor: Anna Panchenko ([email protected])

• Instructor: Tom Madej ([email protected])

Page 2: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Lecture 1: Introduction

• Instructors• Course goals• Grading policy• Motivating problem• Course overview• Molecular basis of cellular processes• Historical timeline

Page 3: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Course Goals• The student will be introduced to the fundamental problems and

methods of bioinformatics.

• The student will become thoroughly familiar with on-line public bioinformatics databases and their available software tools.

• The student will acquire a background knowledge of biological systems so as to be able to interpret the results of database searches, etc.

• The student will also acquire a general understanding of how important bioinformatics algorithms/software tools work, and how the databases are organized.

Page 4: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Grading Policy

• Homework: 50%, weekly assignments

• Mid-term exam: 20%

• Final exam: 30%

“All examinations, papers, and other graded work products and assignments are to be completed in conformance with: The George Washington University Code of Academic Integrity”.

Page 5: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Important!

• Please get computer accounts for Tompkins 405 by filling out a form in the TA room on the 4th floor of Tompkins.

• “Office hours”: AP available before class, TM available after class. If you want to see AP or TM before class, please ask in advance.

• We will also accept questions by email, although we may not be able to reply immediately.

Page 6: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Homework

• Homework assignments are due by the start of the next class period (3:30 pm Monday).

• For an assignment turned in up to one week late: 20% penalty.

• Homework more than one week late: No credit!

• Assignments/exams are to be done individually, no copying of assignments is allowed!

Page 7: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)
Page 8: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

NCBI Books

• NCBI home page: http://www.ncbi.nlm.nih.gov• Follow the “Books” link.• 45 books available (currently).• Many specialty topics.• Also useful general references.• Searchable!• Exercise: search the books with “phylogenetic tree”.

Page 9: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

What is Bioinformatics?

• A merger of biology, computer science, and information technology.

• Enables the discovery of new biological insights and unifying principles.

• Born from necessity, because of the massive amount of information required to describe biological organisms and processes.

Page 10: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Severe Acute Respiratory Syndrome (SARS)

• SARS is a respiratory illness caused by a previously unrecognized coronavirus; first appeared in Southern China in Nov. 2002.

• Between Nov. 2002 and July 2003, there were 8,098 cases worldwide and 774 fatalities (WHO).

• The global outbreak was over by late July 2003. A few new cases have arisen sporadically since then in China.

• There is currently no vaccine or cure available.

Page 11: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)
Page 12: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)
Page 13: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)
Page 14: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Fig. 2 from Rota et al.

Page 15: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Phylogenetic analysis of coronavirus proteins

Fig. 2 from Rota et al.

Page 16: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)
Page 17: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Conserved motifs in coronavirus S proteins.

Fig. 2 from Rota et al.

Page 18: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Exercise!

Look up the SARS genome on the NCBI website: www.ncbi.nlm.nih.gov

Notice that you get 2 hits on the Genome database!

Page 19: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

The (ever expanding) Entrez System

EntrezEntrez

PopSet

Structure

PubMed

Books

3D Domains

Taxonomy

GEO/GDS

UniGene

Nucleotide

Protein

Genome

OMIM

CDD/CDART

Journals

SNP

UniSTS

PubMed Central

Page 20: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

NCBI Databases

• Databases are indexed for quick and efficient searching.

• Databases are cross-linked to each other.

Page 21: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Exercise!

• Search the Entrez “Protein” database with the keyword “interleukin”.

• Follow the link, then look at the different report formats.

• Also try a search of “Protein” with “interleukin AND human [orgn]”.

Page 22: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Course Overview

Lecture 1: Introduction

• Instructors• Grading policy• Motivating problem• Course overview• Molecular basis of cellular processes• Historical timeline

Page 23: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Lecture 2: General principles of DNA/RNA structure and stability

• Physico-chemical properties of nucleic acids• RNA folding and structure prediction• Gene identification• Genome analysis

Lecture 3: General principles of protein structure and stability

• Physico-chemical properties of proteins• Prediction of protein secondary structure• Protein domains and prediction of domain boundaries• Protein structure-function relationships

Page 24: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Lecture 4: Sequence alignment algorithms

• The alignment problem• Pairwise sequence alignment algorithms• Multiple sequence alignment algorithms• Sequence profiles and profile alignment methods• Alignment statistics

Page 25: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Lecture 5: Computational aspects of protein structure, part I

• Protein folding problem• Problem of protein structure prediction• Homology modeling• Protein design• Prediction of functionally important sites

Lecture 6: Computational aspects of protein structure, part II

• Structure-structure alignment algorithms• Significance of structure-structure similarity• Protein structure classification

Page 26: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Lecture 7: Bioinformatics databases

• Sequence and sequence alignment formats, data exchange• Public sequence databases• Sequence retrieval and examples• Public protein structure databases• Lab exercises

Lecture 8: Bioinformatics database search tools

• Sequence database search tools• Structure database search tools• Assessment of results, ROC analysis• Lab exercises

Page 27: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Lecture 9: Phylogenetic analysis, part I

• Molecular basis of evolution• Taxonomy and phylogenetics• Phylogenetic trees and phylogenetic inference• Software tools for phylogenetic analysis

Lecture 10: Phylogenetic analysis, part II

• Accuracies and statistical tests of phylogenetic trees• Genome comparisons• Protein structure evolution

Page 28: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Lecture 11: Experimental techniques for macromolecular analysis

• Sequencing, PCR• Protein crystallography• Mass spectroscopy• Microarrays• RNA interference

Page 29: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Lecture 12: Systems biology

• Genomic circuits• Modeling complex integrated circuits• Protein-protein interaction• Metabolic networks

Lecture 13: Review

Page 30: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Molecular Biology Background

• Cells – general structure/organization

• Molecules – that make up cells

• Cellular processes – what makes the cell alive

Page 31: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Two Cell Organizations

• Prokaryotes – lack nucleus, simpler internal structure, generally quite smaller

• Eukaryotes – with nucleus (containing DNA) and various organelles

Page 32: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)
Page 33: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)
Page 34: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Selected organelles…

• Nucleus – contains chromosomes/DNA

• Mitochondria – generate energy for the cell, contains mitochrondrial DNA

• Ribosomes – where translation from mRNA to proteins take place (protein synthesis machinery)

• Lysosomes – where protein degradation takes place

Page 35: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Cells can become specialized…

Page 36: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Three domains of life

• Prokarya

Bacteria

Archaea

• Eukarya

Eukaryotes

Page 37: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Universal phylogenetic tree.

Fig. 1 from:N.R. Pace, Science 276(1997) 734-740.

Page 38: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Molecules in the cell

• Proteins – catalyze reactions, form structures, control membrane permeability, cell signaling, recognize/bind other molecules, control gene function

• Nucleic acids – DNA and RNA; encode information about proteins

• Lipids – make up biomembranes

• Carbohydrates – energy sources, energy storage, constituents of nucleic acids and surface membranes

• Other small molecules – e.g. ATP, water, ions, etc.

Page 39: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Exercise!

Retrieve a protein structure from the SARS coronavirus from the NCBI website; you can use: www.ncbi.nlm.nih.gov/Structure/

Look at the structure for the SARS protease using Cn3D.

Page 40: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

The Central Dogma of Molecular Biology

Page 41: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)
Page 42: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)
Page 43: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline

1859 Darwin publishes On the Origin of Species…

1865 Mendel’s experiments with peas show that hereditary traits are passed on to offspring in discrete units.

1869 Meischer isolates DNA.

1895 Rőntgen discovers X-rays.

1902 Sutton proposes the chromosome theory of heredity.

Page 44: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)

1911 Morgan and co-workers establish the chromosome theory of heredity, working with fruit flies.

1943 Astbury observes the first X-ray pattern of DNA.

1944 Avery, MacLeod, and McCarty show that DNA transmits heritable traits (not proteins!).

1951 Pauling and Corey predict the structure of the alpha-helix and beta-sheet.

Page 45: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)

1953 Watson and Crick propose the double helix model for DNA based on X-ray data from Franklin and Wilkins.

1955 Sanger announces the sequence of the first protein to be analyzed, bovine insulin.

1955 Kornberg and co-workers isolate the enzyme DNA polymerase (used for copying DNA, e.g. in PCR).

1958 The first integrated circuit is constructed by Kilby at Texas Instruments.

Page 46: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)1960 Perutz and Kendrew obtain the first X-ray structures of

proteins (hemoglobin and myoglobin).

1961 Brenner, Jacob, and Meselson discover that mRNA transmits the information from the DNA in the nucleus to the cytoplasm.

1965 Dayhoff starts the Atlas of Protein Sequence and Structure.

1966 Nirenberg, Khorana, Ochoa and colleagues crack the genetic code!

1970 The Needleman-Wunsch algorithm for sequence comparison is published.

Page 47: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)

1972 Dayhoff develops the Protein Sequence Database (PSD).

1972 Berg and colleagues create the first recombinant DNA molecule.

1973 Cohen invents DNA cloning.

1975 Sanger and others (Maxam, Gilbert) invent rapid DNA sequencing methods.

Page 48: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)

1980 The first complete gene sequence for an organism (Bacteriophage FX174) is published. The genome consists of 5,386 bases coding 9 proteins.

1981 The Smith-Waterman algorithm for sequence alignment is published.

1981 IBM introduces its Personal Computer to the market.

1982 The GenBank sequence database is created at Los Alamos National Laboratory.

Page 49: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)1983 Mullis and co-workers describe the PCR reaction.

1985 The FASTP algorithm is published by Lipman and Pearson.

1986 The SWISS-PROT database is created.

1986 The Human Genome Initiative is announced by DOE.

1988 The National Center for Biotechnology Information (NCBI) is established at the National Library of Medicine in Bethesda.

Page 50: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)

1992 Human Genome Systems, in Gaithersburg, MD, is founded by Haseltine.

1992 The Institute for Genomic Research (TIGR) is established by Venter in Rockville, MD.

1995 The Haemophilus influenzea genome is sequenced (1.8 Mb).

1996 Affymetrix produces the first commercial DNA chips.

Page 51: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)

1988 The FASTA algorithm for sequence comparison is published by Pearson and Lipman.

1990 Official launch of the Human Genome Project.

1990 The BLAST program by Altschul et al., is published.

1991 The CERN research institute in Geneva announces the creation of the protocols which make up the World Wide Web.

Page 52: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)

1996 The yeast genome is sequenced; the first complete eukaryotic genome.

1996 Human DNA sequencing begins.

1997 The E. coli genome is sequenced (4.6 Mb, approx. 4k genes).

1998 The C. elegans genome is sequenced (97 Mb, approx. 20k genes); the first genome of a multicellular organism.

Page 53: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)

1998 Venter founds Celera in Rockville, MD.

1998 The Swiss Institute of Bioinformatics is established in Geneva.

1999 The HGP completes the first human chromosome (no. 22).

2000 The Drosophila genome is completed.

Page 54: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)

Timeline (cont.)

2000 Human chromosome no. 21 is completed.

2001 A draft of the entire human genome (3,000 Mb) is published.

2003 The Human Genome is “completed”! Approx. 30,000 genes (estimated).

Page 55: CS 177 Introduction to Bioinformatics Fall 2005 Instructor: Anna Panchenko (hcnap2003@yahoo.com)hcnap2003@yahoo.com Instructor: Tom Madej (tom_ncbi@yahoo.com)