18
1 Bioinformatics Handling and analysis of data obtained from current biomedical / gene technology methods Interdisciplinary science Biology Mathematics Computer science Medical genomics and bioinformatics Biological sequences DNA -> mRNA -> protein Information resources in biomedicine Sequence analysis Sequence alignments Database searches for sequence similarity Finding genes in genomes Finding disease genes Linkage analysis Medical genomics and bioinformatics Microarray data analysis gene expression - mRNA abundance Molecular genetic and cytogenetic analysis in the clinic RNA bioinformatics microRNAs and prediction of target mRNAs Medical genomics and bioinformatics Proteomics Large scale analysis of protein content Molecular phylogeny Sequences in virology and microbiology Introduction to bioinformatics Information resources Tore Samuelsson Nov 2009 Flow of genetic information DNA RNA transcript splicing mature mRNA protein protein structure -> biological function 56,000 protein structures 8,000,000 protein sequences 100 x 10 6 sequences corresponding to partial mRNAs ~ 250 x 10 9 nt

Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

1

Bioinformatics

Handling and analysis of data obtained fromcurrent biomedical / gene technology methods

Interdisciplinary science• Biology• Mathematics• Computer science

Medical genomics and bioinformatics

Biological sequencesDNA -> mRNA -> protein

Information resources in biomedicineSequence analysis

Sequence alignmentsDatabase searches for sequence similarityFinding genes in genomes

Finding disease genes Linkage analysis

Medical genomics and bioinformatics

Microarray data analysisgene expression - mRNA abundance

Molecular genetic and cytogenetic analysis in the clinic

RNA bioinformatics microRNAs and prediction of target mRNAs

Medical genomics and bioinformatics

ProteomicsLarge scale analysis of protein content

Molecular phylogeny Sequences in virology and microbiology

Introduction to bioinformatics

• Information resources

Tore Samuelsson Nov 2009

Flow of genetic information

DNA

RNA transcript

splicing

mature mRNA

protein

protein structure -> biological function

56,000 protein structures

8,000,000 protein sequences

100 x 106 sequences correspondingto partial mRNAs

~ 250 x 109 nt

Page 2: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

2

Nature Nov 6 2008

Archon X prize

$10 million to the first Team ... to sequence 100 human genomes within 10 days or less ...at a cost of no more than $10,000 per genome.

Margaret Dayhoff

The early days of sequence databases

Genome sequencingusing a shotgunapproach

Page 3: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

3

DDBJ (Japan) NCBI, NIH, US Genbank

EMBL (EBI, UK )

- DNA sequence databases

Genbank (www.ncbi.nlm.nih.gov)EMBL (European Molecular Biology Laboratory,

www.ebi.ac.uk)

EMBL and Genbank formats

EMBL format

ID LISOD standard; DNA; PRO; 756 BP.XXAC X64011; S78972;XXSV X64011.1XXDT 28-APR-1992 (Rel. 31, Created)DT 30-JUN-1993 (Rel. 36, Last updated, Version 6)XXDE L.ivanovii sod gene for superoxide dismutaseXXKW sod gene; superoxide dismutase.XXOS Listeria ivanoviiOC Bacteria; Firmicutes; Bacillus/Clostridium group;OC Bacillus/Staphylococcus group; Listeria.XXRN [1]RX MEDLINE; 92140371.RA Haas A., Goebel W.;RT "Cloning of a superoxide dismutase gene from Listeria ivanovii byRT functional complementation in Escherichia coli and characterization of theRT gene product.";RL Mol. Gen. Genet. 231:313-322(1992).XXRN [2]RP 1-756RA Kreft J.;RT ;RL Submitted (21-APR-1992) to the EMBL/GenBank/DDBJ databases.RL J. Kreft, Institut f. Mikrobiologie, Universitaet Wuerzburg, Biozentrum AmRL Hubland, 8700 Wuerzburg, FRGXXDR SWISS-PROT; P28763; SODM_LISIV.XX

FH Key Location/QualifiersFHFT source 1..756FT /db_xref="taxon:1638"FT /organism="Listeria ivanovii"FT /strain="ATCC 19119"FT RBS 95..100FT /gene="sod"FT terminator 723..746FT /gene="sod"FT CDS 109..717FT /db_xref="SWISS-PROT:P28763"FT /transl_table=11FT /gene="sod"FT /EC_number="1.15.1.1"FT /product="superoxide dismutase"FT /protein_id="CAA45406.1"FT /translation="MTYELPKLPYTYDALEPNFDKETMEIHYTKHHNIYVTKLNEAVSGFT HAELASKPGEELVANLDSVPEEIRGAVRNHGGGHANHTLFWSSLSPNGGGAPTGNLKAAFT IESEFGTFDEFKEKFNAAAAARFGSGWAWLVVNNGKLEIVSTANQDSPLSEGKTPVLGLFT DVWEHAYYLKFQNRRPEYIDTFWNVINWDERNKRFDAAK"XXSQ Sequence 756 BP; 247 A; 136 C; 151 G; 222 T; 0 other;

cgttatttaa ggtgttacat agttctatgg aaatagggtc tatacctttc gccttacaat 60gtaatttctt ttcacataaa taataaacaa tccgaggagg aatttttaat gacttacgaa 120ttaccaaaat taccttatac ttatgatgct ttggagccga attttgataa agaaacaatg 180gaaattcact atacaaagca ccacaatatt tatgtaacaa aactaaatga agcagtctca 240ggacacgcag aacttgcaag taaacctggg gaagaattag ttgctaatct agatagcgtt 300

CDS join(1886..1922,2272..2319,3563..3675,4750..4878)

* to represent a coding sequence on the complementary strand of DNA:CDS complement(1159..2577)

Examples of typical feature table elements

* to represent a coding sequence that is constructedfrom a range of exons:

Common sequence formats

1. EMBL release format2. Genbank (ASN.1)3. FASTA format :

>X12345 Y098TR gene CGTATCTTACGAGCTACTACGAGGTCTTATCGGACGAGCGACT...

Page 4: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

4

Two major types of DNA / nucleotide / base sequences found in databases such as GenBank and EMBL

* Genomic , arising from sequencing of DNA material isolated from cells

* ESTs , arising from projects to determine what mRNAs are produced in an certain organism or in a certaintype of cell within a multicellular organism.

DNA

mRNA

EST (Expressed Sequence Tag)

Expressed Sequence Tags (ESTs) correspond to partial mRNA sequences, they are sequences of cDNA which have been reverse-transcribed from mRNA

Short sequences (~500-1000 bases), each is result of single sequencing experiment -> high frequency of errors

Applications:

1) Used to answer questions like: What genes in a specific cell or tissue are expressed ?

2) Identification of coding regions in genomicsequences

3) Discovery of new genes

Redundancy at GenBank=> RefSeq

Many sequences are represented more than once in GenBank

2003 RefSeq collection : curated secondary databasenon-redundantselected organisms

•Genome DNA (assemblies)•Transcripts (RNA)•Protein

RefSeq vs GenBank

Access via Nucl. and Protein dbAccess via NCBI Nucleotide db

Proteins and transcripts identified and linkedProteins identified and linked

Akin to review articlesAkin to primary literature

Exclusive NCBI databaseData exchange among INDSC members

Limitied to model organismsNo limit to species included

Records can contradict each other

Single records for each moleculer of major organismsMultiple records from same loci common

NCBI reivses as new data emergeOnly author can revise

NCBI creates from existing dataAuthor submits

CuratedNot curated

RefSeqGenBank

http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook

Trace Archive2001 NCBI and EMBL/ENSEMBL

purpose collect raw data at sequencing centers worldwidePERMANENT repository of single-pass reads

Traces Pieces of a Puzzlebetween 300 and 1,000 nucleotides

vital hunt for polymorphisms in gene sequences linked to disease (human DNA)linked to virulence (viral DNA)

dbSNP : detailed info > 25 million SNPs

Insigths to the impact of genetic variation on health

http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?

2009: 2 x 109 single pass reads

First genomes to be sequenced

1995, TIGR (www.tigr.org)

Hemophilus influenzae 1.83 MBMycoplasma genitalium 0.58 MB

Genome projects

Page 5: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

5

Sequenced eukaryotic genomesMB Genes

Bacteria 0.6 - 7.5 500-7,000

S. cerevisiae 12 6,000

S. pombe 13 6,000

Worm, Caenorhabditis elegans 97 20,000

Fly, Drosophila melanogaster 120 14,000

Plant, Arabidopsis thaliana 110 26,000

Fish, Fugu rubripes 365 22,000

Mus musculus 3000 24,000

H. sapiens 3200 23,000

Why are genome sequences and comparative genomics useful?

• Many non-human organisms are important model systems

• Comparative genomics useful in gene identification, identification of regulatory elements etc.

• Evolution of genes, proteins and organisms

Variation between individuals

2007 Craig Venter

2008James WatsonCancer patient, normal and cancer tissue Yoruba, Ibadan, NigeriaHan Chinese

SNPs ~3 x 106

Insertion/deletion polymorphisms 105-106

Structural variants/copy number variation103-104

Variation between individuals

Page 6: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

6

Flow of genetic information

DNA

RNA transcript

splicing

mature mRNA

protein

protein structure -> biological function

56,000 protein structures

8,000,000 protein sequences

100 x 106 sequences correspondingto partial mRNAs

~ 250 x 109 nt

The SWISS-PROT Protein Sequence Data Bank (www.ebi.ac.uk ) is a database of protein sequences produced collaboratively by Amos Bairoch (University of Geneva) and the EBI. It contains high-quality annotation, is non-redundant, and cross-referenced to many other databases.

SWISS-PROT is accompanied by TrEMBL, a computer-annotated supplement to SWISS-PROT. TrEMBL contains the translations of all coding sequences (CDS) present in the EMBL Nucleotide Sequence Database not yet integrated into SWISS-PROT.

Uniprot : Swissprot + TrEMBL

Sequence entries in Feb 2009Uniprot 7,568,118 Swissprot 410,518TrEMBL 7,157,600

Genbank NCBI protein db 24,133,189

Protein sequence databases

ID PRIO_HUMAN STANDARD; PRT; 253 AA.AC P04156;DT 01-NOV-1986 (REL. 03, CREATED)DT 01-NOV-1986 (REL. 03, LAST SEQUENCE UPDATE)DT 01-NOV-1997 (REL. 35, LAST ANNOTATION UPDATE)DE MAJOR PRION PROTEIN PRECURSOR (PRP) (PRP27-30) (PRP33-35C) (ASCR).GN PRNP.OS HOMO SAPIENS (HUMAN).OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;OC EUTHERIA; PRIMATES.RN [1]RP SEQUENCE FROM N.A.RX MEDLINE; 86300093.RA KRETZSCHMAR H.A., STOWRING L.E., WESTAWAY D., STUBBLEBINE W.H.,RA PRUSINER S.B., DEARMOND S.J.;RL DNA 5:315-324(1986).RN [2]RP SEQUENCE OF 8-253 FROM N.A.RX MEDLINE; 86261778.RA LIAO Y.-C.J., LEBO R.V., CLAWSON G.A., SMUCKLER E.A.;RL SCIENCE 233:364-367(1986).RN [3]RP VARIANT AMYLOID GSS, SEQUENCE OF 58-85 AND 111-150.RX MEDLINE; 91160504.RA TAGLIAVINI F., PRELLI F., GHISO J., BUGIANI O., SERBAN D.,RA PRUSINER S.B., FARLOW M.R., GHETTI B., FRANGIONE B.;RL EMBO J. 10:513-519(1991).RN [4]RP REVIEW ON VARIANTS.RX MEDLINE; 93372867.RA PALMER M.S., COLLINGE J.;RL HUM. MUTAT. 2:168-173(1993).

CC -!- FUNCTION: THE FUNCTION OF PRP IS NOT KNOWN. PRP IS ENCODED IN THECC HOST GENOME AND IS EXPRESSED BOTH IN NORMAL AND INFECTED CELLS.CC -!- SUBUNIT: PRP HAS A TENDENCY TO AGGREGATE YIELDING POLYMERS CALLEDCC "RODS".CC -!- SUBCELLULAR LOCATION: ATTACHED TO THE MEMBRANE BY A GPI-ANCHOR.CC -!- DISEASE: PRP IS FOUND IN HIGH QUANTITY IN THE BRAIN OF HUMANS ANDCC ANIMALS INFECTED WITH NEURODEGENERATIVE DISEASES KNOWN ASCC TRANSMISSIBLE SPONGIFORM ENCEPHALOPATHIES OR PRION DISEASES, LIKE:CC CREUTZFELDT-JACOB DISEASE (CJD), GERSTMANN-STRAUSSLER SYNDROMECC (GSS), FATAL FAMILIAL INSOMNIA (FFI) AND KURU IN HUMANS; SCRAPIECC IN SHEEP AND GOAT; BOVINE SPONGIFORM ENCEPHALOPATHY (BSE) INCC CATTLE; TRANSMISSIBLE MINK ENCEPHALOPATHY (TME); CHRONIC WASTINGCC DISEASE (CWD) OF MULE DEER AND ELK; FELINE SPONGIFORMCC ENCEPHALOPATHY (FSE) IN CATS AND EXOTIC UNGULATE ENCEPHALOPATHYCC (EUE) IN NYALA AND GREATER KUDU. THE PRION DISEASES ILLUSTRATECC THREE MANIFESTATIONS OF CNS DEGENERATION: (1) INFECTIOUS (2)CC SPORADIC AND (3) DOMINANTLY INHERITED FORMS. TME, CWD, BSE, FSE,CC EUE ARE ALL THOUGHT TO OCCUR AFTER CONSUMPTION OF PRION-INFECTEDCC FOODSTUFFS.CC -!- DISEASE: CJD OCCURS PRIMARILY AS A SPORADIC DISORDER (1 PERCC MILLION), WHILE 10-15% ARE FAMILIAL. ACCIDENTAL TRANSMISSION OFCC CJD TO HUMANS APPEARS TO BE IATROGENIC (CONTAMINATED HUMAN GROWTHCC HORMONE (HGH), CORNEAL TRANSPLANTATION, ELECTROENCEPHALOGRAPHICCC ELECTRODE IMPLANTATION. . .). EPIDEMIOLOGIC STUDIES HAVE FAILED TOCC IMPLICATE THE INGESTION OF INFECTED ANNIMAL MEAT IN THECC PATHOGENESIS OF CJD IN HUMAN. THE TRIAD OF MICROSCOPIC FEATURESCC THAT CHARACTERIZE THE PRION DISEASES CONSISTS OF (1) SPONGIFORMCC DEGENERATION OF NEURONS, (2) SEVERE ASTROCYTIC GLIOSIS THAT OFTENCC APPEARS TO BE OUT OF PROPORTION TO THE DEGREE OF NERF CELL LOSS,CC AND (3) AMYLOID PLAQUE FORMATION. CJD IS CHARACTERIZED BYCC PROGRESSIVE DEMENTIA AND MYOCLONIC SEIZURES, AFFECTING ADULTS INCC MID-LIFE. SOME PATIENTS PRESENT SLEEP DISORDERS, ABNORMALITIES OFCC HIGH CORTICAL FUNCTION, CEREBELLAR AND CORTICOSPINAL DISTURBANCES.CC THE DISEASE ENDS IN DEATH AFTER A 3-12 MONTHS ILLNESS.CC -!- DISEASE: GSS IS A HETEROGENEOUS DISORDER AND WAS DEFINED AS ACC "SPINOCEREBELLAR ATAXIA WITH DEMENTIA AND PLAQUELIKE DEPOSITS".CC GSS INCIDENCE IS LESS THAN 2 PER 100 MILLION.CC -!- DISEASE: KURU IS TRANSMITTED DURING RITUALISTIC CANNIBALISM, AMONGCC NATIVES OF THE NEW GUINEA HIGHLANDS. PATIENTS EXHIBIT VARIOUSCC MOVEMENT DISORDERS LIKE CEREBELLAR ABNORMALITIES, RIGIDITY OF THECC LIMBS, AND CLONUS. EMOTIONNAL LABILITY IS PRESENT, AND DEMENTIA ISCC CONSPICUOUSLY ABSENT. DEATH USUALLY OCCURS FROM 3 TO 12 MONTHCC AFTER ONSET.CC -!- SIMILARITY: TO OTHER PRP.CC -!- DATABASE: NAME=HotMolecBase; NOTE=PrP entry;CC WWW="http://bioinformatics.weizmann.ac.il/hotmolecbase/entries/prp.htm".

Protein sequence databases

FT SIGNAL 1 22FT CHAIN 23 230 MAJOR PRION PROTEIN.FT PROPEP 231 253 REMOVED IN MATURE FORM (BY SIMILARITY).FT LIPID 230 230 GPI-ANCHOR (BY SIMILARITY).FT CARBOHYD 181 181 PROBABLE.FT CARBOHYD 197 197 PROBABLE.FT DISULFID 179 214 BY SIMILARITY.FT DOMAIN 51 91 5 X 8 AA TANDEM REPEATS OF P-H-G-G-G-W-G-FT Q.FT REPEAT 51 59 1.FT REPEAT 60 67 2.FT REPEAT 68 75 3.FT REPEAT 76 83 4.FT REPEAT 84 91 5.FT VARIANT 102 102 P -> L (IN GSS).FT VARIANT 105 105 P -> L (IN GSS).FT VARIANT 117 117 A -> V (LINKED TO DEVELOPMENT OFFT DEMENTING GSS).FT VARIANT 129 129 M -> V (DETERMINES THE DISEASE PHENOTYPEFT IN PATIENTS WHO HAVE A PRP MUTATION ATFT CODON 178: PATIENTS WITH MET DEVELOP FFI,FT THOSE WITH VAL DEVELOP CJD).FT VARIANT 178 178 D -> N (IN FFI AND CJD).FT VARIANT 180 180 V -> I (IN CJD).FT VARIANT 198 198 F -> S (IN A ATYPICAL FORM OF GSS WITHFT NEUROFIBRILLARY TANGLES).FT VARIANT 200 200 E -> K (IN CJD).FT VARIANT 210 210 V -> I (IN CJD).FT VARIANT 217 217 Q -> R (IN GSS WITH NEUROFIBRILLARYFT TANGLES).FT VARIANT 232 232 M -> R (IN CJD).FT CONFLICT 118 118 MISSING (IN REF. 2).SQ SEQUENCE 253 AA; 27661 MW; FD5373AD CRC32;

MANLGCWMLV LFVATWSDLG LCKKRPKPGG WNTGGSRYPG QGSPGGNRYP PQGGGGWGQPHGGGWGQPHG GGWGQPHGGG WGQPHGGGWG QGGGTHSQWN KPSKPKTNMK HMAGAAAAGAVVGGLGGYML GSAMSRPIIH FGSDYEDRYY RENMHRYPNQ VYYRPMDEYS NQNNFVHDCVNITIKQHTVT TTTKGENFTE TDVKMMERVV EQMCITQYER ESQAYYQRGS SMVLFSSPPVILLISFLIFL IVG

//

Protein sequence databases

Protein sequence databases can be accessed through:

• Uniprot (www.ebi.uniprot.org/)

• Entrez

Page 7: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

7

UniProt - record

They are all the result of experimental work

* X ray crystallography* NMR

Three dimensional structures of proteins,DNA and RNA are collected in the Protein Data Bank (PDB)

Page 8: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

8

Example of PDB entry

HEADER HORMONE 30-OCT-92 1BPH 1BPH 2COMPND INSULIN (CUBIC) IN 0.1M SODIUM SALT SOLUTION AT PH9 1BPH 3SOURCE BOVINE (BOS $TAURUS) PANCREAS 1BPH 4AUTHOR O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 5REVDAT 2 31-OCT-93 1BPHA 1 REMARK HET FORMUL 1BPHA 1REVDAT 1 15-JAN-93 1BPH 0 1BPH 6JRNL AUTH O.GURSKY,J.BADGER,Y.LI,D.L.D.CASPAR 1BPH 7JRNL TITL CONFORMATIONAL CHANGES IN CUBIC INSULIN CRYSTALS 1BPH 8JRNL TITL 2 IN THE PH RANGE 7-11 1BPH 9JRNL REF BIOPHYS.J. V. 63 1210 1992 1BPH 10JRNL REFN ASTM BIOJAU US ISSN 0006-3495 030 1BPH 11REMARK 1 1BPH 12REMARK 1 REFERENCE 1

ATOM 1 N GLY A 1 13.994 47.196 31.798 1.00 35.87 1BPH 129ATOM 2 CA GLY A 1 14.277 46.226 30.708 1.00 38.67 1BPH 130ATOM 3 C GLY A 1 15.574 45.507 31.085 1.00 31.18 1BPH 131ATOM 4 O GLY A 1 16.078 45.660 32.217 1.00 22.60 1BPH 132ATOM 5 N ILE A 2 16.088 44.766 30.126 1.00 28.39 1BPH 133ATOM 6 CA ILE A 2 17.342 44.034 30.404 1.00 23.76 1BPH 134ATOM 7 C ILE A 2 18.526 44.939 30.686 1.00 25.29 1BPH 135ATOM 8 O ILE A 2 19.425 44.457 31.392 1.00 18.74 1BPH 136ATOM 9 CB ILE A 2 17.571 43.072 29.158 1.00 27.36 1BPH 137ATOM 10 CG1 ILE A 2 18.638 42.049 29.605 1.00 18.03 1BPH 138ATOM 11 CG2 ILE A 2 17.859 43.936 27.903 1.00 25.54 1BPH 139ATOM 12 CD1 ILE A 2 18.914 40.930 28.590 1.00 17.07 1BPH 140ATOM 13 N VAL A 3 18.619 46.195 30.192 1.00 24.42 1BPH 141ATOM 14 CA VAL A 3 19.774 47.080 30.436 1.00 30.26 1BPH 142ATOM 15 C VAL A 3 19.952 47.453 31.895 1.00 19.08 1BPH 143ATOM 16 O VAL A 3 21.018 47.421 32.561 1.00 28.15 1BPH 144ATOM 17 CB VAL A 3 19.719 48.274 29.462 1.00 33.87 1BPH 145ATOM 18 CG1 VAL A 3 20.847 49.225 29.754 1.00 30.40 1BPH 146ATOM 19 CG2 VAL A 3 19.868 47.724 28.044 1.00 24.51

3D viewersSeveral free programs for viewing protein and nucleic 3D structures:

Cn3D www.ncbi.nlm.nih.gov/Entrez

UCSF Chimera www.cgl.ucsf.edu/chimera/

DS Visualizer www.accelrys.com/products/downloads/ds_visualizer/

Rasmol & Protein explorerwww.umass.edu/microbio/rasmol/

Chime www.umass.edu/microbio/chime/getchime.htm

DS Visualizer

Page 9: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

9

* Entrez

* Genome browsers-Santa Cruz

Accessing molecular biology information

NCBI is the most heavily site in biomedicine.

300,000

200,000

100,000

NCBI Web Traffic – 1997-2006

400,000

January 1998

500,000

600,000

700,000

January 1999

January 2000

January 2001

January 2002

January 2003

January 2004

January 2005

January 2006

722,000 Unique IPs a Day

91 Million Web Hits a Day

3200 Peak Web Hits a Second

1.5 Terabytes FTP a Day

1.8 Million Unique Users a Day

Page 10: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

10

Title

Page 11: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

11

Added title words "gene" and "complete"

26 exons

Page 12: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

12

NCBI Cn3D viewer

OMIM - Online Mendelian Inheritance in Mandatabase of human genes andgenetic disorders

NCBI - Taxonomy browser

NCBI - Taxonomy browser

Accessing molecular biology information

* Entrez

* Genome browsers-Santa Cruz

Page 13: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

13

Santa Cruz browser - genome.ucsc.edu- chromosome 18

Zoom in on a particular gene/locus

Example : beta-globin

A subunit of hemoglobin. Hemoglobin is composed of2 alfa- and 2 beta-subunits

Chromosome 11 - HBB locus (text search : “beta globin”) Chromosome 11 - HBB locus

(text search : “beta globin”)

Betaglobingene

Zoomed in on HBB

Protein coding region

untranslated region

intronArrows show polarity

Page 14: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

14

HBB betaHBD delta (minor form of hemoglobin)HBG1 A-gamma (fetal hemoglobin)HBG2 G-gamma (fetal hemoglobin)HBE1 epsilon (embryonic hemoglobin)

Configuration of ‘tracks’

Chromosome 11 - HBB locus (text search : “beta globin”)

LINEsSINEs

Comparativeanalysis -similarity toother species

Betaglobingenes

Finding a region of interest in the genome

* Text search (“beta globin”)

* BLAT search (based on sequence similarity)

Jim Kent

BLAT with the HBB amino acid sequence

Page 15: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

15

One of the BLAT hits seem to be a pseudogene

Podocin - a kidney specific protein

Protein genes

PseudogenesRepetitive element - SINES/LINESCgG islandsVariation between individuals - SNPs

Gene expression data

Examples of information available at UCSC browser

ENSEMBL www.ensembl.org

Page 16: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

16

ENSEMBL www.ensembl.orgChromosome 18

Sequencing methods

1977 Walter Gilbert – A. Maxam (chemical modification)

Sequencing by enzymatic synthesis 1975 F. Sanger (chain termination)1984 Ligation based (SOLID, Applied Biosystems)1988 Pyrosequencing (454, Roche)1994 Reversible dye terminators (Solexa, Illumina)

454 – pyrosequencing (Roche)Detects the activity of DNA polymerase with a chemiluminescent enzymeby synthesizing the complementary strand.

Schematic representation of the progress of the enzyme reaction in solid-phase pyrosequencing

Ronaghi M Genome Res. 2001;11:3-11

©2001 by Cold Spring Harbor Laboratory Press

Page 17: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

17

Pyrogram of the raw data obtained from liquid-phase pyrosequencing

Ronaghi M Genome Res. 2001;11:3-11

©2001 by Cold Spring Harbor Laboratory Press

Solexa (Illumina) reversible terminator sequencing

ss DNA Enzymatically synthesize its complementary strand Detect fluorescence of one nucleotide at a timeRemove the blocking group (reversible terminator)Polymerization of another nucleotide

GCAGCTATTACGGCTATCTGACCGTCGATAAT

GT AC

G

TAC

G

terminatordNTPs

Sequencing by ligation(SOLID - Applied Biosystems)

The method:

It is based on sequential ligation of dye labeled oligonucleotideprobes whereby each probe queries two base positions at a time

DNA ligase rather than polymerase

The system uses 4 fluorescent dyes to enconde for the 16 possible two base combinations

Multiple ligation cycles of probe hybridization, ligation, imaging an analysis are preformed

The resulting product is the removed

The process is repeated for 5 more extension rounds with primershybridized to position n-1, n-2, etc in th adaptor.

http://www3.appliedbiosystems.com/AB_Home/applicationstechnologies/SOLiDSystemSequencing/index.htm

2-base color encoding data

1 dye = 4 possible di-nucelotides

2 bases are interrogated in each ligation reaction providing increased specificity

Primer round 1

Page 18: Bioinformatics Medical genomics and bioinformaticsbio.biomedicine.gu.se/courses/ht09/genomics/db_lecture1... · 2011. 8. 8. · 1 Bioinformatics Handling and analysis of data obtained

18

Primer round 2

Total of 5 primer rounds

Each sequence is interrogated twice in different reactionsimproves the signal to noise ratio

Decoding

Color space

Possible dinucleotides

Base zero Decoded sequence

Base space sequence