Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

2019.10.01. Bioinformatics 2

Bioinformatics 2 − 1st lecture

Prof. László Poppe

BME Department of Organic Chemistry and Technology

Bioinformatics – proteomics

Lecture and computer room practice

2 Bioinformatics 22019.10.01.

Bioinformatics – What is it ?

Bioinformatics:

In broad sense: storage, analysis and explanation of biological

information by the aid of computational methods.

In more restricted sense: handling, analyzing and explaining

biological sequence and structure data.



Bioinformatics

More detailed definition - Oxford English Dictionary:

(Molecular) bio – informatics: bioinformatics is conceptualising

biology in terms of molecules (in the sense of physical chemistry) and

applying "informatics techniques" (derived from disciplines such as

applied maths, computer science and statistics) to understand and

organise the information associated with these molecules, on a large

scale. In short, bioinformatics is a management information system

for molecular biology and has many practical applications.



Bioinformatics – Computational biology

More detailed definitions - definition Committee, National Institute of Mental Health

Bioinformatics: Research, development, or application ofcomputational tools and approaches for expanding the use ofbiological, medical, behavioral or health data,including those toacquire, store, organize, archive, analyze, or visualize such data.

Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling andcomputational simulation techniques to the study of biological,behavioral, and social systems.


Definition and history of bioinfomatics

Sequence analysis – nucleotide and protein sequences and their relationships, pairwise and multiple alingment, phylogenetic analysis

Prediction of the secondary structure from sequence

Domain analysis, function prediction from sequence

Relationship between genetics and structure data and molecular function or role in metabolism

Aspects of protein structure data. Role of various interactions in maintaining structure. Structural classes of proteins.

Methods of protein structure determination (protein crystallography, NMR)

Methods for modeling 3D structure of proteins. Basic methods of homology modeling: template based and ab initio methods

Interaction of the proteins with small molecules and biological macromolecules, dimnamics of proteins

Databases related to proteomics (nucleotide and protein sequence databases, structure databases, function related databases)

Bioinformatics programs and program systems for proteomics applications (alone standing and Web-based applications)

Practical applications of bioinformatics

Bioinformatics 2 – Summary of themes


Bioinformatics – Early beginnings

1951 – Pauling & Corey: structure of alfa-helix and beta-sheet

1953 – Watson & Crick: DNA double helices structure (based on Franklin & Wilkins X-ray structure).

1954 – Perutz's group: heavy atom method for solving the phase problem in protein crystallography

1955 – F. Sanger: the first protein structure (bovine insulin).

1958 – J. Kilby (Texas Instr.): First integrated circuit / ARPA (Advanced Research Projects Agency, USA) is established

1962 – Pauling’s theory on molecular evolution

1965 – Margaret Dayhoff‘: Atlas of Protein Sequences

1969 – ARPANET: connection of computers of UCSB (Stanford) and UCLA (University of Utah)

1970 – Needleman-Wunsch algorithm: sequence comparison.

1971 – Ray Tomlinson (BBN): e-mail

1972 – Paul Berg’s group: the first recombinant DNA molecule

1973 – Brookhaven Protein DataBank (PDB) announcement / Robert Metcalfe (Harvard University) -Ethernet.

1974 – Vint Cerf & Robert Khan: the "internet" and Transmission Control Protocol (TCP).

1975 – Microsoft Co. (Bill Gates & Paul Allen) / 2D elektrophoresis (P. H. O'Farrell)

1976 – Unix-To-Unix Copy Protocol (UUCP) - Bell Labs / Southern Blot technique (E. M. Southern).

1977 – Brookhaven PDB full description / DNA sequencing (A. Maxam, W. Gilbert & F. Sanger) andsoftware (Staden)

1978 – The Usenet connection (T. Truscott, J. Ellis & S. Bellovin).

http://www.dayhoff.cc/


1980 – Sequence of a full gene (FX174 - 5386 base pairs / 9 proteins) / Multidimensional NMR for protein structure determination (Kumar, A.; Ernst, R.R.; Wüthrich, K.).

1981 – Smith-Waterman algorithm (sequence alignment) / IBM - Personal Computer (PC) / Sequence motif concept (Doolittle)

1982 – Genetics Computer Group (GCG) - Wisconsin Suite molecular biology tools / GenBank Release 3 / Lambda phage genome sequence

1983 – Compact Disc (CD) / Sequence database searching algorithm (Wilbur-Lipman) / DNA clone(cosmid) libraries / PCR (Polymerase Chain Reaction): DNA analysis is enabled

1984 – Jon Postel: Domain Name System (DNS) / Macintosh (Apple Computer)

1985 – FASTP/FASTN algorithm / Human Genome Initiative idea is borning

1986 – Human Genome Initiative is established / "Genomics" term / SWISS-PROT database is established

1987 – Yeast artifical chromosome (YAC) / E. coli mapping / PERL (Practical Extraction Report Language)/ NIH NIGMS – beginnings of genome projects

1988 – National Center for Biotechnology Information (NCBI) is established / EMBnet network for database distribution / Human Genome Intiative starts / FASTA algorithm (Pearson and Lupman)

1989 – Oxford Molceular Group, Ltd.(OMG) is established (Anaconds, Asp, Cameleon – molecular modeling, drug design, protein design).

Bioinformatics – Beginnings


Bioinformatics – Near past

1990 – BLAST program (Altschul, et.al.) / (M. Levitt, C. Lee): Look & SegMod (molecular modeling and protein design) / HGP plan - USA Congress (start of a 15 year long project)

1991 – Genf (CERN) - World Wide Web / Expressed sequence tags (ESTs) / Human chromosome map database (GDB) is established

1992 – Humane genome – Low resolution genetic map

1993 – IMAGE consortium – co-ordinated cDNA gene sequencing and mapping / LBNL - novel transposon-aided chromosome-seqencing / GRAIL Internet based sequence-interpretation service (ORNL)

1994 – Netscape Co (Navigator) / PRINTS database: protein motifs (Attwood & Beck) / EMBL European Bioinformatics Institute / Second-generation DNA clone libraries of all human chromosomes

1995 – The first bacterial (Haemophilus influenzea) genome (1.8) sequence ( Fleischmann et al) / Sequencing the smallest bacterium (Mycoplasma genitalium) – the least number of genes for independent life

1996 – The genome of baker’s yeast (Sacharomyces cerevisiae 12.1 Mb) / Prosite database (Bairoch, et.al) / Affymetrix – the first commercial DNA chip

1997 – The E. coli (4.7 Mbp) genome / National Human Genome Research Institute (NHGRI)

1998 – Swiss Institute of Bioinformatics is established

1999 – The first full sequence of a human chromosome


Bioinformatics – Recent past

2000 – Bacterial (Pseudomonas aeruginosa, 6.3 Mbp), plant (A. thaliana, 100 Mb) and insect (Drosophila melanogaster, 180 Mbp) genome sequences / További humán kromoszómák szekvenálása

2001 – Human genome (3000 Mbp) is published / Full sequencing of several human chromosomesaccording to the high level standards of the Human Genome Project

2002 – Structural Bioinformatics and GeneFormatics are unified / Mouse Genome Sequencing Consortium –shot-gun sequence of the mouse genome

2003 – The Human Genome Project is finished

2004 – Rat Genome Sequencing Consortium – genome of brown rat (Rattus norvegicus)

2008 – Start of the 1000 Genomes Project – The aim of the project is to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. This marks thestart of „ Personalised Medicine”

2013 – The Nobel Prize in Chemistry 2013 (M. Karplus, M. Levitt, A. Warshel) „for the development of multiscale models for complex chemical systems”

2016 - 1000 Genomes Project: more than 30,000 x coverage of the human genome. There is currently no bioinformatics tool to look for in the full data mass.

.


Neutral-Apolar 3-letter 1-letter

Glycine Gly G

L-Alanine Ala A

L-Valine Val V

L-Izoleucine Ile I

L-Leucine Leu L

L-Phenylalanine Phe F

L-Proline Pro P

L-Metionine Met M

Neutral-Polar

L-Serine Ser S

L-Threonine Thr T

L-Tyrosine Tyr Y

L-Triptophane Trp W

L-Asparagine Asn N

L-Glutamine Gln Q

L-Cysteine Cys C

Acidic

L-Aspartic acid Asp D

L-Glutamic acid Glu E

Basic

L-Lysine Lys K

L-Arginine Arg R

L-Histidine His H


ProteinsStructure - Folding

Protein structures – organization levels

Primary structure („folding-free” state - sequence)

Secondary structure (stable local conformations: -helices, -sheets)

Terctiary structure (global chain conformation: domains, subunits)

Quaternary structure (multiple chain conformations)

Intra- and intermolcular disulfide bonds

12 Bioinformatics 22019.10.01.DNA aminoacid sequence ”folding”

Primary structureMNKKEWEEKYVKPLLERSPERKKEFKTSSGIVVDRLYTPEDVEIDYENKL

GYPGVYPFTRGVYPTMYRGRLWTMRQYAGFGTAEETNRRYRYLLEQGQTG

LSVAFDLPTQIGYDSDHPMALGEVGKVGVAIDTIEDMEILFNGIPLGKVS

TSMTINSTCAQILSMYVAVAEKQGVERANLRGTVQNDMLKEYIARGTYIF

PPEPSLRLATDIIMFCAKEMPKWNSISISGYHMEEAGATPVQEVAFTLAD

GITYVEKVIERGMDVDSFAPRLSFFFAAGNNFLEEIAKFRAARRLWARIM

KERFNAKNPRSMMLRFHVQTAGCTLTAQQPENNIVRVALQALAAVLGGCQ

SLHTNSFDEALCLPTEKAVRIALRTQQIIAEESGVADVVDPLGGSYYIEW

LTDRIEEEAMKYIEKIDEMGGMIKAIESGYVQREIQKSAYEKQKAIDEGE

ITVVGVNKYQIEEEIQIELLRVDKAVVEKQIRRLQEFRKNRDAKKVEEAL

RLRKAAEKEDENLMPYVLDAVKARATLGEMTDALRDVFGEFRAPEIF

(ie. the amino acid sequence)

ProteinsPrimary and secondary structure

Secondary structure


ProteinsTertiary and quaternary structure

Tertiary and quaternary structure active conformation

Enzymes (catalytic proteins):

positive catalysis (fit of the substrate to the catalytic parts of the active site)

negative catalysis (protection of the substrate, biological ”protecting group”)

Tertiary structure Quaternary structure


ProteinsThe active site of enzymes

The active site of carboxypeptidase A; (a) shematic representation of the active site; (b) the active site of the protein with Cbz-Gly-Phe substrate

(as it is assumed to occupy the active site).


Processes following protein expressionProtein folding and degradation


The genetic codeStructure of DNA


The genetic codeStructure of DNA


Gene expressionThe central dogma


Gene expressionTranscription


The standard genetic codeRedundancy in protein – DNA direction


Gene expressionThe transcription – translation processes


Gene expressionThe translation process


Gene expressionReading frames – Importance of start-stop codons


Gene multiplicationIn vitro multiplication of DNA by PCR

The PCR cycle:

1. - DNA melting (~90oC)

2. – Replication of the

complementary strand

(synthesis at ~70oC by a

thermostable polymerase)

Primers

(large excess)

and dNTP’s.

(deoxynucleotide

triphosphates)

Single strand DNA

Result:

exponential multiplication


Gene multiplicationCloning and multiplication of genes

Recombinant cell

Wild type host cell

Cleaved plasmid

Plasmid with the

desired gene

Desired gene


Aims of bioinformatics

Creation and maintenence of databases. Organization of the data

layout so that researchers can easily retrieve and extend the existing

information.

Development of methods and procedures for analyzing data. The data

are useless without analysis.

Application of the developed tools and methods for data analysis and

biological interpretation of the results.


Types of biomolecular information andbioinformatics methods

Source of data Size of data Bioinformatics topics

Crude DNA−sequences ~201 million sequences - 235 billion

ases (gene) [GenBank]

[+488 million sequences,

2.165 billion bases (WGS: whole

genom shotgun)] date: 09. 2017.

· Coding and non−coding regions

· Introns and exons

· Gene products predictions

· Forensic analysis

Protein sequences 89.9 million sequences (UniProtKB)

(~ 300 amino acids, in each)

(0.56 million Swiss-Prot + 89.4

million TrEMBL) date: 09. 2017.

· Sequence alignment

· Multiple alignments

· Conserved sequence motifs

Macromolecular structures

(DNA, RNA, protein)

~133 thousand structures

(~ 1000 atomic coordinates in each)

(RCSB PDB) date: 09. 2017.

· 3D structure alignments

· Protein geometrics

· Surface, volume, shape

calculation

· Intermolecular interactions

· Molecular simulations (energy,

molecular motions, docking)


Types of biomolecular information andbioinformatics methods

Source of data Size of data Bioinformatics topics

Genomes ~ 25 300 full genomes (ca 1,6×106 -

3×109 bases in each)

(NCBI Genome)

[ ~ 193 000 published raw genomes]

(NCBI WGS) date: 09. 2017.

· Repetitions

· Structure - gene relationships

· Phylogenetic analysis

· Genome sized projects

(eg. metabolic pathways)

· Disease - gene relationships

Gene expression data ~ 88 000 gene expression datasets

(NCBI GEO) date: 09. 2017.

(one of the largest: ca. 20 time points

for the ca. 6000 genes of yeast

· Expression pattern correlations

· Relationship of expression with

structural and biochemical data

Other:

Literature

Metabolic pathways

~27 million articles (Medline)

~45 milllion references (CAplus)

518 metabolic maps

~ with 533 000 references (KEGG)

date: 09. 2017.

· Electronic libraries / automatic

literature surveys

· Knowledge bases

· Reaction pathway simulations


Data Grouping based on similarities

repetitive sequences in the genomes

genes can be grouped according to function (eg. enzyme activity or metabolic pathways)

different proteins often have similar sequences

the number of basic structures of proteins is limited ( according to estimations: maximum 10,

000)

Based on real biological similarities, much of the information can be sorted into groups

Biological systems consist of a finite number of component parts


Pattern recognition and prediction

The two basic operations of the bioinformatics are pattern recognition and prediction

Pattern recognition: finding similarities

Search for a conserved feature which is characteristic to a certain function / structure

base on proteins with already known similar function / structure

Use of the recognised feature to identify function / structure of novel sequences

Condition: the novel sequence should belong to a protein of a cetain degree of alredy

known similarity

Prediction:

Prediction of function or structure: based on similarity or by ab initio methods

The basic wish of bioinformatics – structure predicted from sequence


MNKKEWEEKYVKPLLERSPERKKEFKTSSGIVVDRLYTPEDVEIDYENKL

GYPGVYPFTRGVYPTMYRGRLWTMRQYAGFGTAEETNRRYRYLLEQGQTG

LSVAFDLPTQIGYDSDHPMALGEVGKVGVAIDTIEDMEILFNGIPLGKVS

TSMTINSTCAQILSMYVAVAEKQGVERANLRGTVQNDMLKEYIARGTYIF

PPEPSLRLATDIIMFCAKEMPKWNSISISGYHMEEAGATPVQEVAFTLAD

GITYVEKVIERGMDVDSFAPRLSFFFAAGNNFLEEIAKFRAARRLWARIM

KERFNAKNPRSMMLRFHVQTAGCTLTAQQPENNIVRVALQALAAVLGGCQ

SLHTNSFDEALCLPTEKAVRIALRTQQIIAEESGVADVVDPLGGSYYIEW

LTDRIEEEAMKYIEKIDEMGGMIKAIESGYVQREIQKSAYEKQKAIDEGE

ITVVGVNKYQIEEEIQIELLRVDKAVVEKQIRRLQEFRKNRDAKKVEEAL

RLRKAAEKEDENLMPYVLDAVKARATLGEMTDALRDVFGEFRAPEIF

Problems of structure prediction from sequence

Folding: the amino acid sequence determines

the spatial structure, but still do not understand

how

Only the secondary structure can be predicted

by limited reliability

It remains so in the near future


Differences in 2D – 3D data

The gap between the known protein sequences and structures of known proteins increases in

time

Large information deficit – bioinformatics

may play an important role

Ca. 2000 more sequences as 3D structures

Kb. 2107 known sequences but only ca.

100,000 unique 3D structures

The gap is continuously increasing (genome

programs) [almost 1 novel sequence / second

but only about 10 novel structures / day]


Genome projects

Genome sequencing „BAC to BAC” sequencing

„whole genom shotgun” seguencing

Completed genomes (ca. 25 000 full genomes); running genome projects (~85 000):

• Yeast

• Caenorhabditis elegans (worm)

• Drosophila melanogaster (common fruit fly)

• Arabidopsis thaliana (mouse-ear cress)

• Human

Completed genome projects:

• ca. 14 600 prokaryotes

• ca. 2 500 eukaryotes

• ca. 200 archaea

• ca. 7 400 of viruses and phages

GOLD: https://gold.jgi.doe.gov/


Sequence analysis

The most important bioinformatics method: search for new sequences belonging to

proteins of unknown structure / function which are similar to sequences of proteins with

known structure / function.

Sequence alignment

Sequence identity: percentage of the same amino acid pairs in the aligned sequences

With decrease of sequence identity the portability of function / structure decreases


Sequence alignment


Importance of the degree of sequence matching

Degree of sequence mathing:

<30 %: inadequate models

30-60 %: adequate models, with uncertain

regions

>60 %: high quality models, with less than

1Å average deviation of C from

experimental structures

Fibroblast growth factor model (based on

rat GF ceratinocita structure, 40% identity)

compared to the experimental structure (X-

ray data).

M. J. Foster: Micron 2002, 33, 365-384.


Two types of homology

Orthology:

Orthologs are genes that are related by vertical descent from a common ancestor and

encode proteins with the same function in different species. They serve the same

function in the two species.

Example: carboxyl esterases (ie. their genes) in humans and pigs.

Paralogy:

Paralogs are homologous genes that have evolved by duplication and code for protein

with similar, but not identical functions.

Example: the enzymes of the histidine biosynthesis (their genes) in humans (the are very

similar in structure, but catalyze different reactions).

Types of homology


• Structure: conserved rregions vs. variable regions

• Alignment methods

• Structure refinement – (over-refinement)

• Evaluation of the structure quality (PROCHECK / WhatIf ...)

1CPC 2FAL

The structural fit (threading) problem

A high degree of structural similarity

can be observed at low sequence

matching

Comparison:

structures of the C-fikociamine

(1CPC) and mioglobin of sea hare

(2FAL)

M. J. Foster: Micron 2002, 33, 365-384.


EMBL (EMBL-EBI, etc.)

NCBI (Medline, Genbank, etc.)

Expasy (UniProtKB/SwissProt, etc)

Bioinformatics Websites

http://www.ebi.ac.uk/

http://www.ncbi.nlm.nih.gov/

http://www.expasy.ch/


Bioinformatics databaseshttp://www.oxfordjournals.org/our_journals/nar/database/c/

2017 NAR Database Summary Paper Category List

Nucleotide Sequence Databases

RNA sequence databases

Protein sequence databases

Structure Databases

Genomics Databases (non-vertebrate)

Metabolic and Signaling Pathways

Human and other Vertebrate Genomes

Human Genes and Diseases

Microarray Data and other Gene Expression Databases

Proteomics Resources

Other Molecular Biology Databases

Organelle databases

Plant databases

Immunological databases

Cell biology

http://www.oxfordjournals.org/our_journals/nar/database/c/

http://www.oxfordjournals.org/nar/database/cat/1















Documents

Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function