40
2019.10.01. Bioinformatics 2 Bioinformati cs 2 1 st lecture Prof. László Poppe BME Department of Organic Chemistry and Technology Bioinformatics proteomics Lecture and computer room practice

Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

2019.10.01. Bioinformatics 2

Bioinformatics 2 − 1st lecture

Prof. László Poppe

BME Department of Organic Chemistry and Technology

Bioinformatics – proteomics

Lecture and computer room practice

Page 2: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

2 Bioinformatics 22019.10.01.

Bioinformatics – What is it ?

Bioinformatics:

In broad sense: storage, analysis and explanation of biological

information by the aid of computational methods.

In more restricted sense: handling, analyzing and explaining

biological sequence and structure data.

Page 3: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

3 Bioinformatics 22019.10.01.

Bioinformatics – What is it ?

Bioinformatics

More detailed definition - Oxford English Dictionary:

(Molecular) bio – informatics: bioinformatics is conceptualising

biology in terms of molecules (in the sense of physical chemistry) and

applying "informatics techniques" (derived from disciplines such as

applied maths, computer science and statistics) to understand and

organise the information associated with these molecules, on a large

scale. In short, bioinformatics is a management information system

for molecular biology and has many practical applications.

Page 4: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

4 Bioinformatics 22019.10.01.

Bioinformatics – What is it ?

Bioinformatics – Computational biology

More detailed definitions - definition Committee, National Institute of Mental Health

Bioinformatics: Research, development, or application ofcomputational tools and approaches for expanding the use ofbiological, medical, behavioral or health data,including those toacquire, store, organize, archive, analyze, or visualize such data.

Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling andcomputational simulation techniques to the study of biological,behavioral, and social systems.

Page 5: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

5 Bioinformatics 22019.10.01.

Definition and history of bioinfomatics

Sequence analysis – nucleotide and protein sequences and their relationships, pairwise and multiple alingment, phylogenetic analysis

Prediction of the secondary structure from sequence

Domain analysis, function prediction from sequence

Relationship between genetics and structure data and molecular function or role in metabolism

Aspects of protein structure data. Role of various interactions in maintaining structure. Structural classes of proteins.

Methods of protein structure determination (protein crystallography, NMR)

Methods for modeling 3D structure of proteins. Basic methods of homology modeling: template based and ab initio methods

Interaction of the proteins with small molecules and biological macromolecules, dimnamics of proteins

Databases related to proteomics (nucleotide and protein sequence databases, structure databases, function related databases)

Bioinformatics programs and program systems for proteomics applications (alone standing and Web-based applications)

Practical applications of bioinformatics

Bioinformatics 2 – Summary of themes

Page 6: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

6 Bioinformatics 22019.10.01.

Bioinformatics – Early beginnings

1951 – Pauling & Corey: structure of alfa-helix and beta-sheet

1953 – Watson & Crick: DNA double helices structure (based on Franklin & Wilkins X-ray structure).

1954 – Perutz's group: heavy atom method for solving the phase problem in protein crystallography

1955 – F. Sanger: the first protein structure (bovine insulin).

1958 – J. Kilby (Texas Instr.): First integrated circuit / ARPA (Advanced Research Projects Agency, USA) is established

1962 – Pauling’s theory on molecular evolution

1965 – Margaret Dayhoff‘: Atlas of Protein Sequences

1969 – ARPANET: connection of computers of UCSB (Stanford) and UCLA (University of Utah)

1970 – Needleman-Wunsch algorithm: sequence comparison.

1971 – Ray Tomlinson (BBN): e-mail

1972 – Paul Berg’s group: the first recombinant DNA molecule

1973 – Brookhaven Protein DataBank (PDB) announcement / Robert Metcalfe (Harvard University) -Ethernet.

1974 – Vint Cerf & Robert Khan: the "internet" and Transmission Control Protocol (TCP).

1975 – Microsoft Co. (Bill Gates & Paul Allen) / 2D elektrophoresis (P. H. O'Farrell)

1976 – Unix-To-Unix Copy Protocol (UUCP) - Bell Labs / Southern Blot technique (E. M. Southern).

1977 – Brookhaven PDB full description / DNA sequencing (A. Maxam, W. Gilbert & F. Sanger) andsoftware (Staden)

1978 – The Usenet connection (T. Truscott, J. Ellis & S. Bellovin).

Page 7: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

7 Bioinformatics 22019.10.01.

1980 – Sequence of a full gene (FX174 - 5386 base pairs / 9 proteins) / Multidimensional NMR for protein structure determination (Kumar, A.; Ernst, R.R.; Wüthrich, K.).

1981 – Smith-Waterman algorithm (sequence alignment) / IBM - Personal Computer (PC) / Sequence motif concept (Doolittle)

1982 – Genetics Computer Group (GCG) - Wisconsin Suite molecular biology tools / GenBank Release 3 / Lambda phage genome sequence

1983 – Compact Disc (CD) / Sequence database searching algorithm (Wilbur-Lipman) / DNA clone(cosmid) libraries / PCR (Polymerase Chain Reaction): DNA analysis is enabled

1984 – Jon Postel: Domain Name System (DNS) / Macintosh (Apple Computer)

1985 – FASTP/FASTN algorithm / Human Genome Initiative idea is borning

1986 – Human Genome Initiative is established / "Genomics" term / SWISS-PROT database is established

1987 – Yeast artifical chromosome (YAC) / E. coli mapping / PERL (Practical Extraction Report Language)/ NIH NIGMS – beginnings of genome projects

1988 – National Center for Biotechnology Information (NCBI) is established / EMBnet network for database distribution / Human Genome Intiative starts / FASTA algorithm (Pearson and Lupman)

1989 – Oxford Molceular Group, Ltd.(OMG) is established (Anaconds, Asp, Cameleon – molecular modeling, drug design, protein design).

Bioinformatics – Beginnings

Page 8: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

8 Bioinformatics 22019.10.01.

Bioinformatics – Near past

1990 – BLAST program (Altschul, et.al.) / (M. Levitt, C. Lee): Look & SegMod (molecular modeling and protein design) / HGP plan - USA Congress (start of a 15 year long project)

1991 – Genf (CERN) - World Wide Web / Expressed sequence tags (ESTs) / Human chromosome map database (GDB) is established

1992 – Humane genome – Low resolution genetic map

1993 – IMAGE consortium – co-ordinated cDNA gene sequencing and mapping / LBNL - novel transposon-aided chromosome-seqencing / GRAIL Internet based sequence-interpretation service (ORNL)

1994 – Netscape Co (Navigator) / PRINTS database: protein motifs (Attwood & Beck) / EMBL European Bioinformatics Institute / Second-generation DNA clone libraries of all human chromosomes

1995 – The first bacterial (Haemophilus influenzea) genome (1.8) sequence ( Fleischmann et al) / Sequencing the smallest bacterium (Mycoplasma genitalium) – the least number of genes for independent life

1996 – The genome of baker’s yeast (Sacharomyces cerevisiae 12.1 Mb) / Prosite database (Bairoch, et.al) / Affymetrix – the first commercial DNA chip

1997 – The E. coli (4.7 Mbp) genome / National Human Genome Research Institute (NHGRI)

1998 – Swiss Institute of Bioinformatics is established

1999 – The first full sequence of a human chromosome

Page 9: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

9 Bioinformatics 22019.10.01.

Bioinformatics – Recent past

2000 – Bacterial (Pseudomonas aeruginosa, 6.3 Mbp), plant (A. thaliana, 100 Mb) and insect (Drosophila melanogaster, 180 Mbp) genome sequences / További humán kromoszómák szekvenálása

2001 – Human genome (3000 Mbp) is published / Full sequencing of several human chromosomesaccording to the high level standards of the Human Genome Project

2002 – Structural Bioinformatics and GeneFormatics are unified / Mouse Genome Sequencing Consortium –shot-gun sequence of the mouse genome

2003 – The Human Genome Project is finished

2004 – Rat Genome Sequencing Consortium – genome of brown rat (Rattus norvegicus)

2008 – Start of the 1000 Genomes Project – The aim of the project is to sequence the genomes of a large number of people, to provide a comprehensive resource on human genetic variation. This marks thestart of „ Personalised Medicine”

2013 – The Nobel Prize in Chemistry 2013 (M. Karplus, M. Levitt, A. Warshel) „for the development of multiscale models for complex chemical systems”

2016 - 1000 Genomes Project: more than 30,000 x coverage of the human genome. There is currently no bioinformatics tool to look for in the full data mass.

.

Page 10: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

10 Bioinformatics 22019.10.01.

Neutral-Apolar 3-letter 1-letter

Glycine Gly G

L-Alanine Ala A

L-Valine Val V

L-Izoleucine Ile I

L-Leucine Leu L

L-Phenylalanine Phe F

L-Proline Pro P

L-Metionine Met M

Neutral-Polar

L-Serine Ser S

L-Threonine Thr T

L-Tyrosine Tyr Y

L-Triptophane Trp W

L-Asparagine Asn N

L-Glutamine Gln Q

L-Cysteine Cys C

Acidic

L-Aspartic acid Asp D

L-Glutamic acid Glu E

Basic

L-Lysine Lys K

L-Arginine Arg R

L-Histidine His H

Page 11: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

11 Bioinformatics 22019.10.01.

ProteinsStructure - Folding

Protein structures – organization levels

Primary structure („folding-free” state - sequence)

Secondary structure (stable local conformations: -helices, -sheets)

Terctiary structure (global chain conformation: domains, subunits)

Quaternary structure (multiple chain conformations)

Intra- and intermolcular disulfide bonds

Page 12: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

12 Bioinformatics 22019.10.01.DNA aminoacid sequence ”folding”

Primary structureMNKKEWEEKYVKPLLERSPERKKEFKTSSGIVVDRLYTPEDVEIDYENKL

GYPGVYPFTRGVYPTMYRGRLWTMRQYAGFGTAEETNRRYRYLLEQGQTG

LSVAFDLPTQIGYDSDHPMALGEVGKVGVAIDTIEDMEILFNGIPLGKVS

TSMTINSTCAQILSMYVAVAEKQGVERANLRGTVQNDMLKEYIARGTYIF

PPEPSLRLATDIIMFCAKEMPKWNSISISGYHMEEAGATPVQEVAFTLAD

GITYVEKVIERGMDVDSFAPRLSFFFAAGNNFLEEIAKFRAARRLWARIM

KERFNAKNPRSMMLRFHVQTAGCTLTAQQPENNIVRVALQALAAVLGGCQ

SLHTNSFDEALCLPTEKAVRIALRTQQIIAEESGVADVVDPLGGSYYIEW

LTDRIEEEAMKYIEKIDEMGGMIKAIESGYVQREIQKSAYEKQKAIDEGE

ITVVGVNKYQIEEEIQIELLRVDKAVVEKQIRRLQEFRKNRDAKKVEEAL

RLRKAAEKEDENLMPYVLDAVKARATLGEMTDALRDVFGEFRAPEIF

(ie. the amino acid sequence)

ProteinsPrimary and secondary structure

Secondary structure

Page 13: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

13 Bioinformatics 22019.10.01.

ProteinsTertiary and quaternary structure

Tertiary and quaternary structure active conformation

Enzymes (catalytic proteins):

positive catalysis (fit of the substrate to the catalytic parts of the active site)

negative catalysis (protection of the substrate, biological ”protecting group”)

Tertiary structure Quaternary structure

Page 14: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

14 Bioinformatics 22019.10.01.

ProteinsThe active site of enzymes

The active site of carboxypeptidase A; (a) shematic representation of the active site; (b) the active site of the protein with Cbz-Gly-Phe substrate

(as it is assumed to occupy the active site).

Page 15: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

15 Bioinformatics 22019.10.01.

Processes following protein expressionProtein folding and degradation

Page 16: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

16 Bioinformatics 22019.10.01.

The genetic codeStructure of DNA

Page 17: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

17 Bioinformatics 22019.10.01.

The genetic codeStructure of DNA

Page 18: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

18 Bioinformatics 22019.10.01.

Gene expressionThe central dogma

Page 19: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

19 Bioinformatics 22019.10.01.

Gene expressionTranscription

Page 20: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

20 Bioinformatics 22019.10.01.

The standard genetic codeRedundancy in protein – DNA direction

Page 21: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

21 Bioinformatics 22019.10.01.

Gene expressionThe transcription – translation processes

Page 22: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

22 Bioinformatics 22019.10.01.

Gene expressionThe translation process

Page 23: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

23 Bioinformatics 22019.10.01.

Gene expressionReading frames – Importance of start-stop codons

Page 24: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

24 Bioinformatics 22019.10.01.

Gene multiplicationIn vitro multiplication of DNA by PCR

The PCR cycle:

1. - DNA melting (~90oC)

2. – Replication of the

complementary strand

(synthesis at ~70oC by a

thermostable polymerase)

Primers

(large excess)

and dNTP’s.

(deoxynucleotide

triphosphates)

Single strand DNA

Result:

exponential multiplication

Page 25: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

25 Bioinformatics 22019.10.01.

Gene multiplicationCloning and multiplication of genes

Recombinant cell

Wild type host cell

Cleaved plasmid

Plasmid with the

desired gene

Desired gene

Page 26: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

26 Bioinformatics 22019.10.01.

Aims of bioinformatics

Creation and maintenence of databases. Organization of the data

layout so that researchers can easily retrieve and extend the existing

information.

Development of methods and procedures for analyzing data. The data

are useless without analysis.

Application of the developed tools and methods for data analysis and

biological interpretation of the results.

Page 27: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

27 Bioinformatics 22019.10.01.

Types of biomolecular information andbioinformatics methods

Source of data Size of data Bioinformatics topics

Crude DNA−sequences ~201 million sequences - 235 billion

ases (gene) [GenBank]

[+488 million sequences,

2.165 billion bases (WGS: whole

genom shotgun)] date: 09. 2017.

· Coding and non−coding regions

· Introns and exons

· Gene products predictions

· Forensic analysis

Protein sequences 89.9 million sequences (UniProtKB)

(~ 300 amino acids, in each)

(0.56 million Swiss-Prot + 89.4

million TrEMBL) date: 09. 2017.

· Sequence alignment

· Multiple alignments

· Conserved sequence motifs

Macromolecular structures

(DNA, RNA, protein)

~133 thousand structures

(~ 1000 atomic coordinates in each)

(RCSB PDB) date: 09. 2017.

· 3D structure alignments

· Protein geometrics

· Surface, volume, shape

calculation

· Intermolecular interactions

· Molecular simulations (energy,

molecular motions, docking)

Page 28: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

28 Bioinformatics 22019.10.01.

Types of biomolecular information andbioinformatics methods

Source of data Size of data Bioinformatics topics

Genomes ~ 25 300 full genomes (ca 1,6×106 -

3×109 bases in each)

(NCBI Genome)

[ ~ 193 000 published raw genomes]

(NCBI WGS) date: 09. 2017.

· Repetitions

· Structure - gene relationships

· Phylogenetic analysis

· Genome sized projects

(eg. metabolic pathways)

· Disease - gene relationships

Gene expression data ~ 88 000 gene expression datasets

(NCBI GEO) date: 09. 2017.

(one of the largest: ca. 20 time points

for the ca. 6000 genes of yeast

· Expression pattern correlations

· Relationship of expression with

structural and biochemical data

Other:

Literature

Metabolic pathways

~27 million articles (Medline)

~45 milllion references (CAplus)

518 metabolic maps

~ with 533 000 references (KEGG)

date: 09. 2017.

· Electronic libraries / automatic

literature surveys

· Knowledge bases

· Reaction pathway simulations

Page 29: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

29 Bioinformatics 22019.10.01.

Data Grouping based on similarities

repetitive sequences in the genomes

genes can be grouped according to function (eg. enzyme activity or metabolic pathways)

different proteins often have similar sequences

the number of basic structures of proteins is limited ( according to estimations: maximum 10,

000)

Based on real biological similarities, much of the information can be sorted into groups

Biological systems consist of a finite number of component parts

Page 30: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

30 Bioinformatics 22019.10.01.

Pattern recognition and prediction

The two basic operations of the bioinformatics are pattern recognition and prediction

Pattern recognition: finding similarities

Search for a conserved feature which is characteristic to a certain function / structure

base on proteins with already known similar function / structure

Use of the recognised feature to identify function / structure of novel sequences

Condition: the novel sequence should belong to a protein of a cetain degree of alredy

known similarity

Prediction:

Prediction of function or structure: based on similarity or by ab initio methods

The basic wish of bioinformatics – structure predicted from sequence

Page 31: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

31 Bioinformatics 22019.10.01.

MNKKEWEEKYVKPLLERSPERKKEFKTSSGIVVDRLYTPEDVEIDYENKL

GYPGVYPFTRGVYPTMYRGRLWTMRQYAGFGTAEETNRRYRYLLEQGQTG

LSVAFDLPTQIGYDSDHPMALGEVGKVGVAIDTIEDMEILFNGIPLGKVS

TSMTINSTCAQILSMYVAVAEKQGVERANLRGTVQNDMLKEYIARGTYIF

PPEPSLRLATDIIMFCAKEMPKWNSISISGYHMEEAGATPVQEVAFTLAD

GITYVEKVIERGMDVDSFAPRLSFFFAAGNNFLEEIAKFRAARRLWARIM

KERFNAKNPRSMMLRFHVQTAGCTLTAQQPENNIVRVALQALAAVLGGCQ

SLHTNSFDEALCLPTEKAVRIALRTQQIIAEESGVADVVDPLGGSYYIEW

LTDRIEEEAMKYIEKIDEMGGMIKAIESGYVQREIQKSAYEKQKAIDEGE

ITVVGVNKYQIEEEIQIELLRVDKAVVEKQIRRLQEFRKNRDAKKVEEAL

RLRKAAEKEDENLMPYVLDAVKARATLGEMTDALRDVFGEFRAPEIF

Problems of structure prediction from sequence

Folding: the amino acid sequence determines

the spatial structure, but still do not understand

how

Only the secondary structure can be predicted

by limited reliability

It remains so in the near future

Page 32: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

32 Bioinformatics 22019.10.01.

Differences in 2D – 3D data

The gap between the known protein sequences and structures of known proteins increases in

time

Large information deficit – bioinformatics

may play an important role

Ca. 2000 more sequences as 3D structures

Kb. 2107 known sequences but only ca.

100,000 unique 3D structures

The gap is continuously increasing (genome

programs) [almost 1 novel sequence / second

but only about 10 novel structures / day]

Page 33: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

33 Bioinformatics 22019.10.01.

Genome projects

Genome sequencing „BAC to BAC” sequencing

„whole genom shotgun” seguencing

Completed genomes (ca. 25 000 full genomes); running genome projects (~85 000):

• Yeast

• Caenorhabditis elegans (worm)

• Drosophila melanogaster (common fruit fly)

• Arabidopsis thaliana (mouse-ear cress)

• Human

Completed genome projects:

• ca. 14 600 prokaryotes

• ca. 2 500 eukaryotes

• ca. 200 archaea

• ca. 7 400 of viruses and phages

GOLD: https://gold.jgi.doe.gov/

Page 34: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

34 Bioinformatics 22019.10.01.

Sequence analysis

The most important bioinformatics method: search for new sequences belonging to

proteins of unknown structure / function which are similar to sequences of proteins with

known structure / function.

Sequence alignment

Sequence identity: percentage of the same amino acid pairs in the aligned sequences

With decrease of sequence identity the portability of function / structure decreases

Page 35: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

35 Bioinformatics 22019.10.01.

Sequence alignment

Page 36: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

36 Bioinformatics 22019.10.01.

Importance of the degree of sequence matching

Degree of sequence mathing:

<30 %: inadequate models

30-60 %: adequate models, with uncertain

regions

>60 %: high quality models, with less than

1Å average deviation of C from

experimental structures

Fibroblast growth factor model (based on

rat GF ceratinocita structure, 40% identity)

compared to the experimental structure (X-

ray data).

M. J. Foster: Micron 2002, 33, 365-384.

Page 37: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

37 Bioinformatics 22019.10.01.

Two types of homology

Orthology:

Orthologs are genes that are related by vertical descent from a common ancestor and

encode proteins with the same function in different species. They serve the same

function in the two species.

Example: carboxyl esterases (ie. their genes) in humans and pigs.

Paralogy:

Paralogs are homologous genes that have evolved by duplication and code for protein

with similar, but not identical functions.

Example: the enzymes of the histidine biosynthesis (their genes) in humans (the are very

similar in structure, but catalyze different reactions).

Types of homology

Page 38: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

38 Bioinformatics 22019.10.01.

• Structure: conserved rregions vs. variable regions

• Alignment methods

• Structure refinement – (over-refinement)

• Evaluation of the structure quality (PROCHECK / WhatIf ...)

1CPC 2FAL

The structural fit (threading) problem

A high degree of structural similarity

can be observed at low sequence

matching

Comparison:

structures of the C-fikociamine

(1CPC) and mioglobin of sea hare

(2FAL)

M. J. Foster: Micron 2002, 33, 365-384.

Page 39: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

39 Bioinformatics 22019.10.01.

EMBL (EMBL-EBI, etc.)

NCBI (Medline, Genbank, etc.)

Expasy (UniProtKB/SwissProt, etc)

Bioinformatics Websites

Page 40: Bioinformatics 2 1 lecture - BME Szerves Kémia …Bioinformatics 2 Data Grouping based on similarities repetitive sequences in the genomes genes can be grouped according to function

40 Bioinformatics 22019.10.01.

Bioinformatics databaseshttp://www.oxfordjournals.org/our_journals/nar/database/c/

2017 NAR Database Summary Paper Category List

Nucleotide Sequence Databases

RNA sequence databases

Protein sequence databases

Structure Databases

Genomics Databases (non-vertebrate)

Metabolic and Signaling Pathways

Human and other Vertebrate Genomes

Human Genes and Diseases

Microarray Data and other Gene Expression Databases

Proteomics Resources

Other Molecular Biology Databases

Organelle databases

Plant databases

Immunological databases

Cell biology