Bioinformatics Basics
CyrusCourtesy from LO Leung Yau’s original presentation
Outline
Biological Background Cell Protein DNA & RNA Central Dogma Gene Expression
Bioinformatics Sequence Analysis Phylogentic Trees Data Mining
Biological Background – Cell
Basic unit of organisms Prokaryotic Eukaryotic
A bag of chemicals Metabolism controlled
by various enzymes Correct working needs
Suitable amounts of various proteins
Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)
Biological Background – Protein Polymer of 20 types of
Amino Acids Folds into 3D structure Shape determines the
function Many types
Transcription Factors Enzymes Structural Proteins …
Picture taken from http://en.wikipedia.org/wiki/Proteinhttp://en.wikipedia.org/wiki/Amino_acid
Biological Background – DNA & RNA DNA
Double stranded Adenine, Cytosine, Guani
ne, Thymine A-T, G-C Those parts coding for pr
oteins are called genes RNA
Single stranded Adenine, Cytosine, Guani
ne, Uracil
Picture taken from http://en.wikipedia.org/wiki/Gene
Biological Background – Genes Genes – protein coding regions
3 nucleotides code for one amino acid
There are also start and stop codons
Biological Background—in a nutshell Abstractions
Functional Units: Proteins
Templates: RNAs
Blueprints: DNAs
Templates: RNAs
Blueprints: DNAs
Not only the information (data), but also the control signals about what and how much data is to be sentProteins (TFs) so help
Biological Background – Sequences Abstractions
Sequences
acatggccgatcaggctgtttttgtgtgcctgtttttctattttacgtaaatcaccctgaacatgtTTGCATCAacctactggtgatgcacctttgatcaatacattttagacaaacgtggtttttgagtccaaagatcagggctgggttgacctgaatactggatacagggcatataaaacaggggcaaggcacagactc
FT intron <1..28FT /gene="CREB"FT /number=3FT /experiment="experimental evidence…FT recorded"FT exon 29..174FT /gene="CREB"FT /number=4FT /experiment="experimental evidence…FT recorded"FT intron 175..>189FT /gene="CREB"FT /number=4
Annotations
Visualizations
Biological Background – DNA RNA Protein
Picture taken from http://en.wikipedia.org/wiki/Gene
gene
Biological Background – DNA RNA Protein
Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding
sites (TFBS).
Other functions
Transcription FactorsBinding sites
GenesPromoter regions
Complex Interactions between Genes, TFs and TFBSs
Biological Background – DNA RNA Protein
Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding
sites (TFBS).
Other functions
Transcription FactorsBinding sites
GenesPromoter regions
Gene Expression Microarray Data High throughput Measures RNA level Relies on A-T, G-C
pairing Can monitor expression
of many genes
Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment
Gene Expression Microarray Data
Picture taken from http://en.wikipedia.org/wiki/DNA_microarray
Genes
Time points/Condiditions
Colors: Expression (RNA) Levels
Bioinformatics—Sequence Analysis Alignments
a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences
http://en.wikipedia.org/wiki/Sequence_alignment
Bioinformatics—Sequence Analysis Pair-wise alignments
Method: dynamic programming!
No penalty for the consecutive ‘-’s before and after the sequence to be aligned
\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC3220 Lectures
Bioinformatics—Sequence Analysis Multiple (global) sequence alignment
Also dynamic programming (but can’t scale up!)
Bioinformatics—Sequence Analysis Multiple local sequence alignment
i.e. Motif (pattern) discovery
>seq1acatggccgatcagctggtttttgtgtgcctgtttctgaatc>seq2ttctattttacgtaaatcagcttgaacatgtacctactggtg>seq3atgcacctttgatcaataccagctagacaaacgtgtgttg>seq4agtccaaagatcagggctggctgaatactggatcagct>seq5cagctacagggcatataaaggggcaaggcacagactc
Such overrepresented patterns are often important components (e.g. TFBSs if the sequences are promoters of similar genes).
TFBSs are the controlling key holes in gene regulation!
DNA motifs
Similar DNA fragments across individuals and/or species TFBS Motifs: DNA fragments similar to “TATAA” are common in order to
make genes functioning Expensive and time-consuming to try a large set of candidates in biological
experiments
Transcription
RNA
Translation
Protein
TATAA
TFBS (controlling)
Gene(functioning)
TF
Transcription Factor
DNA
Motif discovery
CGATTGAf
Similar controlled functionse.g. cancer gene activities
Maximized
TFBS Motif Discovery
SNP (single nucleotide polymorphism) Motif Discovery
…
DNA from different people
Normal
Disease!
AA
A
C
CC
TTT
G
GG
A T
C G
…
……
…
f NormalDisease!
distinguish
Maximized
Bioinformatics—Data mining
Classification To predict! Pre-processing—tidy up your materials! Feature selection—the key points to go over Classifier—the thinking style/manner of how to combine the
key points and get some answer Training—your practice of your thinking manner with
answers known Validation—mock quiz to evaluate what you’ve learnt from
the training Testing—your examination!
\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class1.pdf
Underfitting & Overfitting
TRANSFAC Project
TF-Transcription Factors, important regulatorsTFBS-Transcription Factor Binding Site, major regulatory elementsTRANSFAC-The most representative DB for TFs and TFBSs
Modeling: statistical models, representations, Markov chains; Discovery: stochastic searching, indexing (suffix trees)
1
Relationship: TF-TFBS; TFBS-Gene… (understanding, prediction)Mining: text mining, approximate matching
2
Annotations: accurate wet-lab candidates (reduced labor and costs);Computation: large scale data processing; parallel computing
3
Representative Publications
[1] Gang Li, Tak-Ming Chan, Kwong-Sak Leung and Kin-Hong Lee, A Cluster Refinement Algorithm for Motif Discovery, IEEE/ACM Transaction on Computational Biology and Bioinformatics (accepted)
[2] Tak-Ming Chan, Kwong-Sak Leung, Kin-Hong Lee, TFBS identification based on genetic algorithm with combined representations and adaptive post-processing. Bioinformatics, 2008, 24(3), pp. 341-349
Bioinformatics—Data mining
Evaluation (scores!) Confusion Matrix Binary Classification
Performance Evaluation Metrics Accuracy Sensitivity/Recall/TP Rate Specificity/TN Rate Precision/PPV …
\\Pc91106\Old_FYP\Bioinformatics for FYPs\CSC5180 Data Mining Notes\c3class3.pdf
FNFPTNTP
TNTP
FNTP
TP
FPTP
TP
FPTN
TN
Bioinformatics—Data mining
Evaluation ROC (Receiver Operating Characteristics) Trade-off between positive hits (TP) and false alar
ms (FP)
Not The End
Your corresponding tutor will have more project-specific stuff to tell you
Thanks Q & A