Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
BIOINFORMATICS
Klinisk kemisk diagnostik - 2017
What is bioinformatics?
In general, Bioinformatics is the sum of the computational approaches to analyze, manage, and store biological data. Includes the usage of statistical techniques, applied mathematics and the development of different algorithms. Bioinformatics is used in analyzing genomes, proteomes (protein sequences), three-dimensional modeling of biomolecules and biologic systems, etc.
Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
(definition Committee, National Institute of Mental Health)
Milestones in Bioinformatics
1950 1960 1970 1980 1990 2000 2010
1955 – F. Sanger First protein sequence (insulin)
1965 – M. Dayhoff (“mother of bioinformatics”) Atlas of protein sequences
1970 – Needleman-Wunsch algorithm for sequence comparison
1977 – DNA sequencing and software to analyze it (Staden)
1988 – FASTA algorithm
1990 – BLAST fast sequence similarity search
1996 – Yeast genome
2003 – Human Genome Project completed
1981 – Smith-Waterman algorithm for sequence alignment
1986 – SWISS-PROT
1988 – National Center for Biotechnology Information (NCBI)
2001 – Publication of the Human genome
1982 – Phage lambda genome sequenced
1982 – GeneBank release 3 - public
1994 – EMBL - European Bioinformatics Institute
1995 – First bacterial genome
1999 – First human chromosome sequenced
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
16000"
18000"1984"
1986"
1987"
1988"
1989"
1990"
1991"
1992"
1993"
1994"
1995"
1996"
1997"
1998"
1999"
2000"
2001"
2002"
2003"
2004"
2005"
2006"
2007"
2008"
2009"
2010"
2011"
2012"
2013"
numbe
r'of'a
r+cles'in'Pub
Med
'
year'
Number of bioinformatics related publications in PubMed
The aims of bioinformatics
The primary goal is to increase understanding of biological processes.
• Development of new algorithms, statistical measures and computer programs for the evaluation of large datasets. (tex. methods to locate genes within a sequence, predict protein structure from sequence, etc.)
• Implementation of the developed algorithms, programs in data evaluation and interpretation of the results.
• Construction and improvement of publicly available databases.
Types of biological information and bioinformatics methods
Origin Size Bioinformatics areas
DNA sequences
175 million sequences
180 billion bases
- sequence alignment, genome assembly - gene prediction, genome annotation
Protein sequences
45 million sequences
- sequence alignment - identification of conserved sequence motifs
Macromolecular structures 100 000 structures
- 3D structure alignment, prediction - molecule modeling
- interaction prediction
Genomes
9000 genomes (178 eukaryotic)
- phylogenetic analysis - genome-wide association studies
- oncogenomics
Gene expression data
different time points/treatments for a
number of genes of different organisms
- expression pattern recognition, clustering, disease relations
- correlation between gene and protein expression
Classification and homology
Based on similarity a huge part of information can be sorted out into groups. This is the basis for several bioinformatics methods. Examples: • Repetitive sequences in the genome • Gene classification based on function • Sequence similarity of different proteins • A limited number of protein structures are exist
Homolog general term, indicates genes or proteins that are evolutionary related (can be either orthologs or paralogs ) Ortholog for orthologs (ortho=exact), the homology is the result of speciation, i.e. same exact gene in different organisms Paralog for paralogs (para=in parallel), the homology is the result of a gene duplication, i.e. similar proteins, potentially within the same organism
Bioinformatics areas
• Genomics • shotgun sequencing, sequence assembly • gene prediction • phylogenetic analysis • genome-wide association studies
• Gene expression analysis
• Proteomics • structure prediction
• Biological networks
Genomics I. – shotgun sequencing, sequence assembly
Shotgun sequencing is used for sequencing long DNA strands. DNA is broken up randomly into numerous small segments, which are sequenced. After several rounds of fragmentation and sequencing computer programs are used to assemble the overlapping ends of different reads into a continuous sequence.
Genome assembly is a difficult computational problem, it works by taking all the pieces and aligning them to one another, and detecting all places where two of the short sequences, or reads, overlap. These overlapping reads can be merged, and the process continues. Repeats (large numbers of identical sequences) in the genomes make gene assembly more difficult. Shotgun sequencing was one of the technologies that was responsible for enabling full genome sequencing.
The process of identifying the regions of genomic DNA that encode genes. (includes protein-coding genes, RNA genes and other functional elements such as regulatory regions) Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
Important steps: • filtering out of non-coding regions and repeats • detection of functional places (pattern recognition) like initiation,
termination etc. • detection of open reading frame
Methods: • empirical methods • ab initio methods • combined methods
Genomics II. – gene prediction
Gene prediction software: • GLIMMER (for prokaryotes) (https://ccb.jhu.edu/software/glimmer/) • GeneMark (http://exon.gatech.edu/GeneMark/) • GENSCAN (http://genes.mit.edu/GENSCAN.html) • Augustus (http://bioinf.uni-greifswald.de/augustus/) • mGene (http://mgene.org) • StarORF (http://star.mit.edu/orf/)
ab initio gene prediction is an intrinsic method based on gene content and signal detection.
Prokaryotes
• genes have specific and well-understood promoter sequences
• the sequence coding for a protein occurs as one contiguous open reading frame (ORF) with a lengths of many hundred or thousands of base pairs
• protein-coding DNA has certain periodicities and other statistical properties
Eukaryotes
• promoter and other regulatory signals are more complex and less well-understood (two classic examples are CpG islands and binding sites for a poly(A) tail)
• a particular protein-coding sequence is divided into several parts (exons), separated by non-coding sequences (introns) (splicing)
Genome-wide association studies (GWAS) are a relatively new way to identify genes involved in human disease. GWAS typically focuses on single nucleotide polymorphisms (SNPs) that occur more frequently in people with a particular disease than in people without the disease. It is a non-candidate driven approach, since it investigates the entire genome.
• compares two large groups of individuals, one healthy control group and one case group affected by a disease
• all individuals are genotyped for the majority of common known SNPs • the odds ratio is calculated (ratio of the odds of disease for individuals
having a specific allele and the odds of disease for individuals who do not have that same allele)
• p-value for the significance of the odds ratio is calculated (chi-squared test)
Odds ratio that is significantly differ from 1 shows that a SNP is associated with the disease.
Genomics III. – genome-wide association studies
The graphical interpretation of the GWAS results is Manhattan plot.
The plot shows the negative logarithm of the P-value as a function of genomic location.
GWA studies focus only on common genetic variants, since their assumption is that common genetic variation plays a large role in explaining the heritable variation of common disease.
GWA studies typically perform the first analysis in a discovery cohort, followed by validation of the most significant SNPs in an independent validation cohort.
Gene expression profiling is the measurement of the expression of thousands of genes simultaneously, to create a global picture of cellular function.
The sequence tells us what the cell could possibly do, while the expression profile tells us what it is actually doing at a particular time point.
Techniques for gene expression measurement • DNA microarray - measures the relative activity of previously
identified target genes • serial analysis of gene expression (SAGE) - produce a snapshot
of the mRNA population in the sample in the form of small tags that correspond to fragments of those transcripts
• RNA-seq (RNA sequencing) - uses the capabilities of next-generation sequencing to reveal a snapshot of RNA presence and quantity at a given time point
Gene expression analysis
Comparing gene expression of two samples
mRNA present only in the
control sample
mRNA equally expressed in both samples
mRNA present only in the
treated sample
DNA microarrays are used to measure the expression levels of large numbers of genes simultaneously. A DNA chip is a collection of microscopic DNA spots (short gene sections) attached to a solid surface. • Each spot contains a specific DNA
sequence (probes). • The probes are used to hybridize with a
labeled cDNA sample. • Probe-target hybridization is detected
and quantified by detection of the labeled targets.
genes transcribed in control cells
genes transcribed equally in both
cells
low gene expression
genes transcribed in treated cells
Each protein exists as an unfolded polypeptide or random coil when translated. Then it folds into a characteristic and functional three-dimensional structure. 3D structure is determined by the AA sequence (Anfinsen's dogma). The correct three-dimensional structure is essential to function, although some parts of functional proteins may remain unfolded. Failure to fold into the intended shape usually produces inactive proteins.
Neurodegenerative diseases are resulted from the accumulation of misfolded (incorrectly folded) proteins. (Alzheimer´s, Parkinson´s diseases)
Many allergies are caused by the folding of the proteins, for the immune system does not produce antibodies. Folding
Proteomics – structure prediction
Levels of protein structure
level description stabilized by
primary amino acid sequence peptide bonds
secondary
formation of α-helices and β-
sheets in a polypeptide
hydrogen bonds between groups along the peptide backbone
tertiary
overall three-dimensional shape of a polypeptide
interactions between R-groups, and R-
groups and peptide backbone
quaternary shape produced by combinations of polypeptides
interactions between R-groups and between peptide backbones of different polypeptides
Secondary structure prediction is aimed to predict the local secondary structures of proteins based only on knowledge of their amino acid sequence. The prediction consists of assigning regions of the amino acid sequence as likely alpha helices, beta strands, or turns.
Specialized algorithms have been developed for the detection of specific well-defined patterns such as transmembrane helices and coiled coils in proteins. The best modern methods of secondary structure prediction in proteins reach about 80% accuracy. Online prediction tools: PsiPred server CFSSP server
Tertiary structure prediction is even more challenging and remains extremely difficult. 1. Comparative protein modeling It uses previously solved structures as starting points, or templates.
a. Homology modeling – a method, where a known homologous is used to predict the structure of a new protein.
b. Protein threading (fold recognition) - a method to model those proteins which have the same fold as proteins of known structures, but do not have a known homologous.
2. De novo physics-based modeling It is an algorithmic process by which protein tertiary structure is predicted from the amino acid sequence (primary structure). Prediction is based on general principles that direct protein folding energetics and/or statistical tendencies of conformational features, without the use of explicit templates.
Software tools: SWISS-MODEL – homology modeling RAPTOR – protein threading software I-TASSER – fold recognition method ROBETTA – ab initio modeling
Database: Protein Data Bank (PDB)
Complex biological systems can be represented and analyzed as computable networks. (ecosystems can be modeled as networks of interacting species or a protein can be modeled as a network of amino acids)
Modeling biological systems is a significant task of systems biology. Computational systems biology aims to develop and use efficient algorithms, data structures, visualization and communication tools for modeling of biological systems.
Basic components of a network: nodes: units in the network edges: interactions between the units
Important properties of a network: degree (or connectivity): the number of edges that connect a node betweenness: a measure of how central a node is in a network
Biological networks
edge node
Molecular interactions can occur between molecules belonging to different biochemical families (proteins, nucleic acids, carbohydrates, lipids, etc.) and also within a given family. Whenever such molecules are connected by physical interactions, they form molecular interaction networks. • protein–protein interaction network • gene-regulatory network (protein–DNA interaction) - formed by
transcription factors, chromatin regulatory proteins, and their target genes • metabolic networks - metabolites, i.e. chemical compounds in a cell, are
converted into each other by enzymes • signaling networks Interactome mapping • Experimental methods – from experimental data such as affinity purification • Predicting PPIs - interactome from one organism are used to predict
interactions among homologous proteins in another organism • Text mining of PPIs – systematic extraction of interaction networks directly
from the scientific literature
Interactome
Network and pathway databases STRING - a database of known and predicted protein-protein interactions (EMBL) KEGG PATHWAY Database (Univ. of Kyoto) Reactome - human biological pathways, ranging from metabolic processes to hormonal signalling (Ontario Institute for Cancer Research (OICR), New York University Medical Centre (NYUMC), European Bioinformatics Institute (EBI))
Bioinformatics in practice
Databases are essential for bioinformatics research and applications. There are a huge number of available databases covering almost everything from DNA and protein sequences, molecular structures, to phenotypes and biodiversity. There are meta-databases that incorporate data compiled from multiple other databases. Some others are specialized, such as those specific to an organism. Interconnectivity in between the different databases is essential.
Databases
Bioinformatics organizations NCBI – National Center for Biotechnology Information EMBL-EBI – European Molecular Biology Laboratory – European Bioinformatics Institute SIB – Swiss Institute of Bioinformatics
These centers host a number of publicly open, free to use life science resources, including biomedical databases and analysis tools.
Bibliographic database – MEDLINE PubMed – free search engine comprises more than 24 million citations for biomedical literature
Databases
GeneBank
5.00E+02'
5.00E+03'
5.00E+04'
5.00E+05'
5.00E+06'
5.00E+07'
5.00E+08'
Dec082
'Ap
r084'
Aug085'
Dec086
'Ap
r088'
Aug089'
Dec090
'Ap
r092'
Aug093'
Dec094
'Ap
r096'
Aug097'
Dec098
'Ap
r000'
Aug001'
Dec002
'Ap
r004'
Aug005'
Dec006
'Ap
r008'
Aug009'
Dec010
'Ap
r012'
Aug013'
GeneBank'0'Sequences'
UniProtKB
The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.
neXtProt neXtProt is developed in collaboration between the SIB Swiss Institute of Bioinformatics and Geneva Bioinformatics (GeneBio) SA. neXtProt will be a comprehensive human-centric discovery platform, offering its users a perfect integration of protein-related data.
Thank You!