Where we start?
• We will start with the Biological Problem• Translate that to what the data looks like• Think about issues with pre-processing data• Ways to Analyze data
– Using existing methods– Adapting existing methods– Using newer ideas
Some Basic Biology
Lets get familiar with some terms:• Cell• Nucleus• DNA• Genome• Gene• Central Dogma of Molecular Biology
• DISCLAIMER: My Biology is VERY rudimentary, so don’t count on it TOO much.
Biological Hierarchies
Molecule
Cell
Tissue
Organ
Organism
For Example
• Organism: human• Organ: say liver• Cell• Organelles: say nucleus ribosome • Macromolecules
– DNA: 22+XX(Y) chromosomes:3x10^9bp– RNA: ~2000 molecules– Proteins ~ 30,000-50,000
• Building blocks: Nucleotides (ATGC) and amino acids
What is a Cell?
Cell Nucleus
DNA
• Genetic material (DNA) is present in the nucleus, as a DNA-protein complex called chromatin.
• The DNA is present as a number of discrete units known as chromosomes.
• Each DNA strand wraps around groups of small protein molecules called histones, forming a series of bead-like structures, called nucleosomes, connected by the DNA strand.
DNA
Genome
The sum of all information contained in the DNA for any living thing. The sequence of all the nucleotides in all the chromosomes of an organism.
Gene
• A hereditary unit consisting of a sequence of DNA that occupies a specific location on a chromosome and determines a particular characteristic in an organism.
• The nucleus of each eukaryotic (nucleated) cell has a complete set of genes.
• Each gene provides a blueprint for the synthesis (via RNA) of enzymes and other proteins and specifies when these substances are to be made.
• Genes undergo mutation when their DNA sequence changes.
Gene: More Facts• Genes govern both the structure and metabolic functions of
the cells, and thus of the entire organism and, when located in reproductive cells, they pass their information to the next generation.
• Chemically, each gene consists of a specific sequence of DNA building blocks called nucleotides. Each nucleotide is composed of three subunits: a nitrogen-containing compound, a sugar, and phosphoric acid. Genes may vary in their precise makeup from person to person, including, for example, one nucleotide in a certain location in some people but another nucleotide in that location in others.
Genes: More Facts
• Geometrically, the gene is a double helix formed by the nucleotides.
• Gene loci are often interspersed with segments of DNA that do not code for proteins; these segments are termed “junk DNA.”
• When junk DNA occurs within a gene, the coding portions are called exons and the noncoding (junk) portions are called introns.
• Junk DNA makes up 97% of the DNA in the human genome, and, despite its name, is necessary for the proper functioning of the genes.
Some more facts about genes:
Almost every cell of the body of any organism contains identical genes.
• · Only a fraction of these genes are "expressed"(turned on) and these confer unique properties to each cell type.
• · Scientists study the kinds and amounts of expressed genes in a cell, which in turn provides insights into how the cell responds to its changing needs.
• · Gene expression is a highly complex and tightly regulated process that allows a cell to respond dynamically both to environmental stimuli and to its own changing needs.
Central dogma of molecular biology
• Each gene is transcribed (at the appropriate time) from DNA into mRNA, which then leaves the nucleus and is translated into the required protein.
• Any gene which is active in this way at any particular time is said to be expressed.
• THIS IS CRUCIAL TO REMEMBER FOR MICROARRAYS
Breakthrough: Sequencing
• Sequencing: DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine—in a strand of DNA.
• So when the genome was sequenced there was a flurry of research in this area.
Some Questions that are being asked:
• What genes contribute to cancer ? • Are these genes similar in mice, rat, humans and other
species?• What genes are involved in depression ?• What genes respond to cocaine ?• What genes are present in a particular cancer cell type and
not in others ?• How do humans think as opposed to monkey thoughts ?
(given 99.2% genome homology)
The National Center for Biotechnology Information states that:
• The proper and harmonious expression of a large number of genes is a critical component of normal growth and development and the maintenance of proper health.
• Disruptions or changes in gene expression are responsible for many diseases.
How DO we answer the questions asked?
• What is the BEST way to study Genes?
• How can we effectively answer questions related to genes?
• Should we focus on a FEW genes and look at it through time or conditions, have a focused study?
• Should we look at many genes at once (sometimes the whole genome) and compare them all across conditions?
Forward and Reverse genetics approaches in biology
Biological System
(Organism)
Building Blocks
(Genes/Molecules)
Reverse Genetics Approach
(Bioinformatics): Discover all genes that are different in cancer cells as compared to
control. (n=300) (t=1month)
Forward Genetics Approach
(Experiments)
e.g. the ras oncogene
Hypothesis: Specific alterations in genes lead to cancer. What
are these genes? (t=10 years/ lab)
Reverse Genetics Approach
• Requires almost complete information of the genome: Sequences annotated and stored in a database
• THIS WILL BE OUR FOCUS FOR THIS CLASS.
• Hence we will have many genes (few conditions) and often few replicates.
• This falls under the general heading of GENOMICS.
Biological Information Processing
• DATA: Genomics• Storage and Retrieval: Database• Summary,Analysis and Visualization :
Statistics
Outline for this class• What types of data are we interested in?
– Microarrays– RNA seq– GWAS
• What types of experiments do they come from?• What are the similarities and difference?• What statistical models and methods are used to understand the
structure of the data• What statistical techniques are used to analyze this data?• Assumptions about distributions• Pitfalls about the three different data types.
Types of Data
• To decide whether I do Microarray or RNA-seq experiment the following has to be taken into account:
• Potential Deciding Factors:– What genome info do I have?– How much money do I have?– What statistical methods are we familiar with?
• Potential Goals are also important in the decision:• Goal is?
– Differential Expression– Absolute Quantification– Discovering Novel Genes– Isoform Expressions– Low Level Expressions– Alternative Splicing
Common to both platforms
• Data are more or less reproducible• Both are subject to high background noises• Both supposedly have high correlation to gene content• Similar Statistical methods• Both subject to biases
Pros and Cons
• Microarray: Pros– Reliable robust, around for a while– Easily automated– Some consensus on statistical
analysis– Quick turnaround– Lower Cost
• Cons:– Dependent on prior knowledge– Cannot detect structural forms– No isoforms or low level
expressions detected– Relative expression NOT absolute
quantification
• RNA-seq: Cons– NOT Reliable robust, new– HARD to automate– Little consensus on statistical analysis– LONG turnaround– Higher Cost
• Pros:– NOT dependent on prior knowledge– Can detect structural forms alternate
splicing– Isoforms or low level expressions
detected– Absolute quantification– Increased dynamic range
How they are different
• Microarray measures• PROBE intensity• NEED to know the
sequence prior to experiemnt
• RNA-seq measures PROBE count in terms of the number of reads for a particular sequence
• SO new sequences can be found