Click here to load reader

CZ5225: Modeling and Simulation in Biology Lecture 2: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected]

Embed Size (px)

Citation preview

  • CZ5225: Modeling and Simulation in Biology

    Lecture 2: Gene Expression Profiles and Microarray Data Analysis

    Prof. Chen Yu Zong

    Tel: 6874-6877Email: [email protected]://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUS

  • Biology and CellsAll living organisms consist of cells. Humans have trillions of cells. Yeast - one cell.Cells are of many different types (blood, skin, nerve), but all arose from a single cell (the fertilized egg)Each* cell contains a complete copy of the genome (the program for making the organism), encoded in DNA.

  • DNADNA molecules are long double-stranded chains; 4 types of bases are attached to the backbone: adenine (A), guanine (G), cytosine (C), and thymine (T). A pairs with T, C with G.A gene is a segment of DNA that specifies how to make a protein. Human DNA has about 25-35K genes; Rice about 50-60K but shorter genes.

  • Exons and Intronsexons are coding DNA (translated into a protein), which are only about 2% of human genome introns are non-coding DNA, which provide structural integrity and regulatory (control) functions exons can be thought of program data, while introns provide the program logicHumans have much more control structure than rice

  • Gene ExpressionCells are different because of differential gene expression. About 40% of human genes are expressed at one time.Gene is expressed by transcribing DNA into single-stranded mRNAmRNA is later translated into a proteinMicroarrays measure the level of mRNA expression

  • Molecular Biology Overview CellNucleusChromosomeProteinGene (DNA)Gene (mRNA), single strandcDNA

  • Gene ExpressionGenes control cell behavior by controlling which proteins are made by a cell

    House keeping genes vs. cell/tissue specific genes

    Regulation:

    Transcriptional (promoters and enhancers)Post Transcriptional (RNA splicing, stability, localization -small non coding RNAs)

  • Gene ExpressionRegulation:

    Translational (3UTR repressors, poly A tail)

    Post Transcriptional (RNA splicing, stability, localization -small non coding RNAs)

    Post Translational (Protein modification: carbohydrates, lipids, phosphorylation, hydroxylation, methlylation, precursor protein)cDNA

  • Gene Expression MeasurementmRNA expression represents dynamic aspects of cell mRNA expression can be measured with latest technologymRNA is isolated and labeled with fluorescent proteinmRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser

  • Traditional MethodsNorthern BlottingSingle RNA isolatedProbed with labeled cDNA

    RT-PCRPrimers amplify specific cDNA transcripts

  • Microarray TechnologyMicroarray:New Technology (first paper: 1995)Allows study of thousands of genes at same time

    Glass slide of DNA molecules Molecule: string of bases (25 bp 500 bp) uniquely identifies gene or unit to be studied

  • Gene Expression MicroarraysThe main types of gene expression microarrays:

    Short oligonucleotide arrays (Affymetrix)cDNA or spotted arrays (Brown/Botstein).Long oligonucleotide arrays (Agilent Inkjet);Fiber-optic arrays...

  • Fabrications of MicroarraysSize of a microscope slideImages: http://www.affymetrix.com/

  • Differing ConditionsUltimate Goal:Understand expression level of genes under different conditions

    Helps to:Determine genes involved in a diseasePathways to a diseaseUsed as a screening tool

  • Gene ConditionsCell types (brain vs. liver)Developmental (fetal vs. adult)Response to stimulusGene activity (wild vs. mutant)Disease states (healthy vs. diseased)

  • Expressed GenesGenes under a given conditionmRNA extracted from cellsmRNA labeledLabeled mRNA is mRNA present in a given conditionLabeled mRNA will hybridize (base pair) with corresponding sequence on slide

  • Two Different Types of MicroarraysCustom spotted arrays (up to 20,000 sequences)cDNAOligonucleotide

    High-density (up to 100,000 sequences) synthetic oligonucleotide arraysAffymetrix (25 bases)SHOW AFFYMETRIX LAYOUT

  • Custom ArraysMostly cDNA arrays2-dye (2-channel)RNA from two sources (cDNA created)Source 1: labeled with red dyeSource 2: labeled with green dye

  • Two Channel MicroarraysMicroarrays measure gene expressionTwo different samples:Control (green label)Sample (red label)Both are washed over the microarrayHybridization occurs Each spot is one of 4 colors

  • Microarray Technology

  • Microarray Image AnalysisMicroarrays detect gene interactions: 4 colors: Green: high controlRed: High sampleYellow: EqualBlack: None

    Problem is to quantify image signals

  • Single Color MicroarraysPrefabricated Affymetrix (25mers)

    CustomcDNA (500 bases or so)Spotted oligos (70-80 bases)

  • Microarray AnimationsDavidson University:http://www.bio.davidson.edu/courses/genomics/chip/chip.html

    Imagecyte:http://www.imagecyte.com/array2.html

  • Basic idea of MicroarrayConstructionPlace array of probes on microchipProbe (for example) is oligonucleotide ~25 bases long that characterizes gene or genomeEach probe has many, many clonesChip is about 2cm by 2cmApplication principlePut (liquid) sample containing genes on microarray and allow probe and gene sequences to hybridize and wash away the rest Analyze hybridization pattern

  • Microarray analysisOperation Principle:

    Samples are tagged with flourescentmaterial to show pattern of sample-probe interaction (hybridization)

    Microarray may have 60K probe

  • Gene Expression DataGene expression data on p genes for n samplesGenesmRNA samplesGene expression level of gene i in mRNA sample j=Log (Red intensity / Green intensity)Log(Avg. PM - Avg. MM)sample1sample2sample3sample4sample5 1 0.46 0.30 0.80 1.51 0.90...2-0.10 0.49 0.24 0.06 0.46...3 0.15 0.74 0.04 0.10 0.20...4-0.45-1.03-0.79-0.56-0.32...5-0.06 1.06 1.35 1.09-1.09...

  • Some possible applicationsSample from specific organ to show which genes are expressed

    Compare samples from healthy and sick host to find gene-disease connection

    Probes are sets of human pathogens for disease detection

  • Huge amount of data from single microarrayIf just two color, then amount of data on array with N probes is 2N

    Cannot analyze pixel by pixel

    Analyze by pattern cluster analysis

  • Major Data Mining TechniquesLink AnalysisAssociations DiscoverySequential Pattern DiscoverySimilar Time Series Discovery

    Predictive ModelingClassificationClustering

  • Strengthens signal when averages are taken within clusters of genes (Eisen)

    Useful (essential ?) when seeking new subclasses of cells, tumours, etc.

    Leads to readily interpreted figuresCluster Analysis: Grouping Similarly Expressed Genes,Cell Samples, or Both

  • Some clustering methods and softwarePartitioningK-Means, K-Medoids, PAM, CLARA HierarchicalCluster, HACBIRCHCUREROCKDensity-based CAST, DBSCANOPTICSCLIQUEGrid-basedSTINGCLIQUEWaveClusterModel-basedSOM (self-organized map)COBWEBCLASSITAutoClassTwo-way ClusteringBlock clustering

  • Assessment of various methodsAlgorithmic Approaches to Clustering Gene Expression Data, Ron Shamir School of Computer Science, Tel-Aviv University Tel-Avivhttp://citeseer.nj.nec.com/shamir01algorithmic.htmlConclusion: hierarchical clustering exceptional

  • Partitioning

  • Density-based clustering

  • Hierarchical (used most often)

  • Hierarchical Clustering: grouping similarly expressed genesgeneSampleA0.60.2 00.7 .. ..0.3B0.40.9 00.5 .. ..0.8C0.20.80.30.2 .. ..0.7 .

    Gene Expression Profile Analysis 1 2 3 4 .. ..1000

  • After Clusteringgenesample .. 3 1 4 .. 21000A .. 00.60.7 ..0.20.3B .. 00.40.5 ..0.90.8C ..0.30.20.2 ..0.80.7 .

    Gene Expression Profile Analysis

  • Eisen et al.Proc. Natl. Acad. Sci. USA 95 (1998) data clustered randomized row column both time

  • Distance measurements

    Correlation coefficients

    Association coefficients

    Probabilistic similarity coefficientsTypes of Similarity Measurements

  • Correlation CoefficientsThe most popular correlation coefficient is Pearson correlation coefficient (1892)correlation between X={X1, X2, , Xn} and Y={Y1, Y2, , Yn}

    wheresXYsXY is the similaritybetween X & Y

  • Use of Similarity for Tree ConstructionNormalize similarity so that =1 Then have nxn similarity matrix S whose diagonal elements are 1Define distance matrix by (for example) D = 1 S Diagonal elements of D are 0Now use distance matrix to built tree (using some tree-building software recall lecture on Phylogeny)sXX

  • A dendrogram (tree) for clustered genesLet p = number of genes.1. Calculate within class correlation.2. Perform hierarchical clustering which will produce (2p-1) clusters of genes.3. Average within clusters of genes.4 Perform testing on averages of clusters of genes as if they were single genes.12345Cluster 6=(1,2)Cluster 7=(1,2,3)Cluster 8=(4,5)Cluster 9= (1,2,3,4,5)E.g. p=5

  • A real case Nature Feb, 2000Paper by Allzadeh. A et al

    Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

  • Validation Techniques: Huberts StatisticsX=[X(i, j)] and Y=[Y(i, j)] are two n n matrixX(i, j) similarity of gene i and gene j

    Huberts statistic represents the point serial correlation

    where M = n (n - 1) / 2A higher value of represents the better clustering quality.if genes i and j are in same cluster,otherwise

  • Discovering sub-groups

  • Time Course DataGene Expression is Time-Dependent

  • Sample of time course of clustered genes

  • LimitationsCluster analyses:Usually outside the normal framework of statistical inferenceLess appropriate when only a few genes are likely to changeNeeds lots of experiments

    Single gene tests:May be too noisy in general to show muchMay not reveal coordinated effects of positively correlated genes.Hard to relate to pathways

  • Useful LinksAffymetrix www.affymetrix.comMichael Eisen Lab at LBL (hierarchical clustering software Cluster and Tree View (Windows)) rana.lbl.gov/Review of Currently Available Microarray Software www.the-scientist.com/yr2001/apr/profile1_010430.html

    ArrayExpress at the EBI http://www.ebi.ac.uk/arrayexpress/Stanford MicroArray Database http://genome-www5.stanford.edu/Yale Microarray Database http://info.med.yale.edu/microarray/Microarray DB www.biologie.ens.fr/en/genetiqu/puces/bddeng.html

    3 Actually a number of clustering methods have been proposed. Ill go through some representative types in the following slides. The paritioning method is to divide the data items into a user-specified number of clusters according to their spatial relationships. The most well-known method is k-means.

    Hierch method works by grouping similar objects level by level, resutling in a hierach structure. The main feature of density-based method is the consideration of the density of objects in a cluster in addition to the spatial relationship between objects. So this kind of methods perform pretty well in screening out some outliers. One Representative method is CAST.

    The grid-based method tries to partition the sample space of data objects into cubes for clustering. This kind of methods perform very efficiently in high dimensional data space. The paritioning method is to divide the data items into a user-specified number of clusters according to their spatial relationships. The most well-known method is k-means.

    Hierch method works by grouping similar objects level by level, resutling in a hierach structure. For example, we may find gene #1 and gene#4 behave more similarly, while gene #2 and gene#1000 have more similar expressions. So gene#1 and #4 should be classified into a same group while gene#2 and #1000 in another group.

    So lets see more clearly the goal of microarray mining. Suppose we design a microarray chip with 1000 genes numbered as gene#1 to gene#1000. Then we conduct experiments, say A, B, C and so on on these genes, and result in the expression profile like this.

    The goal of microarray mining is to identify which genes have similar expression patterns over the conducted experiments or conversely, which experiments got similar results over the tested genes. So we may call this multi-conditions experssion analsysis.1467 basically, this mining task can be achieved by using clustering techniques. As shown in this sample clustering results, Each curve represents the expression of a gene over conducted experiments, and all genes are classify into six clusters. As you can see, the genes in a same group have very similar pattern, while each group differs from another pretty much. basically, this mining task can be achieved by using clustering techniques. As shown in this sample clustering results, Each curve represents the expression of a gene over conducted experiments, and all genes are classify into six clusters. As you can see, the genes in a same group have very similar pattern, while each group differs from another pretty much.