43
Special Topics in Computer Science: Bioinformatics & Computational Biolog Bioinformatics & Computational Biology A special lecture on molecular biology for computer scientists P G computer scientists P ING G ONG S ENIOR S CIENTIST , P H .D. B ADGER T ECHNICAL S ERVICES , LLC B ADGER T ECHNICAL S ERVICES , LLC E NVIRONMENTAL L ABORATORY US A RMY E NGINEER R ESEARCH AND D EVELOPMENT C ENTER D EVELOPMENT C ENTER

Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Special Topics in Computer Science: Bioinformatics & Computational BiologBioinformatics & Computational Biology

A special lecture on molecular biology for computer scientists

P G

computer scientists

P I N G G O N G

S E N I O R S C I E N T I S T , P H . D .

B A D G E R T E C H N I C A L S E R V I C E S , L L CB A D G E R T E C H N I C A L S E R V I C E S , L L C

E N V I R O N M E N T A L L A B O R A T O R Y

U S A R M Y E N G I N E E R R E S E A R C H A N DD E V E L O P M E N T C E N T E RD E V E L O P M E N T C E N T E R

Page 2: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Outline

Introduction: what is the goal? Human genomic projects Molecular biology for computational scientists Life from a taxonomy viewpoint Life from a taxonomy viewpoint Cell biology 101 Genetics, genomics and epigenomics

G i l ti Gene expression regulation Biochemistry

Molecular biology technologiesgy g Traditional molecular biology techniques Genomics Proteomics Proteomics Metabolomics

Page 3: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Outline

Introduction: what is the goal? Human genomic projects Molecular biology for computational scientists Life from a taxonomy viewpoint Life from a taxonomy viewpoint Cell biology 101 Genetics, genomics and epigenomics

G i l ti Gene expression regulation Biochemistry

Molecular biology technologiesgy g Traditional molecular biology techniques Genomics Proteomics Proteomics Metabolomics

Page 4: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Introduction: what are our expectations?

General goal: fundamental knowledge for the computer scientist to lay a foundation of mutual communication with scientist to lay a foundation of mutual communication with biologists Biologist: identify/propose a specific biological problem and define

the expected solution/answert e e pected so ut o /a swe Entering the post-genomics era, biologists are more than ever

dependent on computational scientists Computational scientist: understand the biological problem and

abstract/conceptualize it (convert into a computational problem)abstract/conceptualize it (convert into a computational problem) Computational Biology & Bioinformatics is an interdisciplinary field

that requires a computer scientist to get oriented and understand some of the commonly used terms in biology (in particular, genetics,

ll bi l l l bi l d d i bi h l i )cell biology, molecular biology, and modern omics biotechnologies) Keep in mind: in biology there are no rules without

exception and biologists are wary of generalization, while computational scientist favor generality and abstractions computational scientist favor generality and abstractions. So, be aware of the boundaries of generalization.

Page 5: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Outline

Introduction: what is the goal? Human genomic projects Molecular biology for computational scientists Life from a taxonomy viewpoint Life from a taxonomy viewpoint Cell biology 101 Genetics, genomics and epigenomics

G i l ti Gene expression regulation Biochemistry

Molecular biology technologiesgy g Traditional molecular biology techniques Genomics Proteomics Proteomics Metabolomics

Page 6: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Human Genome Project (1990-2003)

The Human Genome Project (HGP) is an international scientific research project with a primary goal of determining the sequence of chemical base pairs which make up human DNA, and of identifying and mapping the total genes of the human genome from both a physical and functional standpoint It remains the largest collaborative biological projecthuman genome from both a physical and functional standpoint. It remains the largest collaborative biological project.

The first official funding for the Project originated with the US Department of Energy’s Office of Health and Environmental Research, headed by Charles DeLisi, and was in the Reagan Administration’s 1987 budget submission to the Congress. It subsequently passed both Houses. The Project was planned for 15 years (1990-2005).

In 1990, the two major funding agencies, DOE and NIH, developed a memorandum of understanding in order to coordinate plans and set the clock for initiation of the Project to 1990 At that time David Galas was Director of the coordinate plans, and set the clock for initiation of the Project to 1990. At that time David Galas was Director of the renamed “Office of Biological and Environmental Research” in the U.S. Department of Energy’s Office of Science, and James Watson headed the NIH Genome Program. In 1993 Aristides Patrinos succeeded Galas, and Francis Collinssucceeded James Watson, and assumed the role of overall Project Head as Director of the U.S. National Institutes of Health (NIH) National Human Genome Research Institute. A working draft of the genome was announced in 2000 and a complete one in 2003, with further, more detailed analysis still being published.p 3, , y g p

A parallel project was conducted outside of government by the Celera Corporation, or Celera Genomics, which was formally launched in 1998. Most of the government-sponsored sequencing was performed in universities and research centers from the United States, the United Kingdom, Japan, France, Germany and Spain. Researchers continue to identify protein-coding genes and their functions; the objective is to find disease-causing genes and possibly use the information to develop more specific treatments. It also may be possible to locate patterns in gene expression, which could help physicians glean insight into the body's emergent properties.

The HGP originally aimed to map the nucleotides contained in a human haploid reference genome (more than three billion). Several groups have announced efforts to extend this to diploid human genomes including the International HapMap Project, Applied Biosystems, Perlegen, Illumina, J. Craig Venter Institute, Personal Genome Project, and Roche-454.

The "genome" of any given individual is unique; mapping "the human genome" involves sequencing multiple variations of each gene. The project did not study the entire DNA found in human cells; some heterochromatic areas (about 8% of the total genome) remain un-sequenced.

More info at http://en.wikipedia.org/wiki/Human_Genome_Project

Page 7: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Human Genome Project

Chromatin is found in two varieties: euchromatin and heterochromatin. Originally, the two forms were distinguished cytologically by how intensely they stained - the euchromatin is less intense, while heterochromatin stains intensely, indicating tighter packing. Heterochromatin is usually localized to the periphery of the nucleus.

Heterochromatin is a tightly packed form of DNA, which comes in different varieties. These varieties lie on a continuum between the two extremes of constitutive and facultative heterochromatin, both playing a role in gene expression. Constitutive acu tat e ete oc o at , bot p ay g a o e ge e e p ess o Co st tut eheterochromatin can affect the genes near them (position-effect variegation) and where facultative heterochromatin is the result of genes that are silenced through a mechanism such as histone methylation or siRNA through RNAi. Constitutive heterochromatin is usually repetitive and forms structural functions such as centromeres or telomeres, in addition to acting as an attractor for other gene-expression or repression signals. Facultative heterochromatin is not repetitive and although it shares the compact structure of constitutive heterochromatin, facultative heterochromatin can, under specific developmental or environmental signaling cues, lose its condensed structure and become transcriptionally active.

Heterochromatin mainly consists of genetically inactive satellite sequences, and many y g y q , ygenes are repressed to various extents, although some cannot be expressed in euchromatin at all. Both centromeres and telomeres are heterochromatic, as is the Barr body of the second, inactivated X-chromosome in a female.

Heterochromatin is often associated with the di and tri-methylation of H3K9. Despite this early dichotomy recent evidence in both animals and plants has suggested thatthis early dichotomy, recent evidence in both animals and plants has suggested that there are more than two distinct heterochromatin states, and it may in fact exist in four or five 'states', each marked by different combinations of epigenetic marks.

The nucleus of a human cell showing the location of heterochromatin More info at http://en.wikipedia.org/wiki/Heterochromatin

Page 8: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Human Genomics

The Human Genome Project (1990-2003) A 13 year effort coordinated by the U S Department of Energy and the National Institutes of A 13-year effort coordinated by the U.S. Department of Energy and the National Institutes of

Health. (1990-2003) Determined the sequences of the 3 B base pairs that make up human genome Identified all the approximately 20,000-25,000 genes

The International HapMap (Haplotype Map) Project (2002-2005) A haplotype in genetics is a combination of alleles (DNA sequences) at adjacent locations (loci) on

the chromosome that are transmitted together. The HapMap is a catalog of common genetic variants that occur in human beings. It describes what

these variants are, where they occur in our DNA, and how they are distributed among people within populations and among populations in different parts of the world.Th P j i d i d id i f i h h h li k i The Project is designed to provide information that other researchers can use to link genetic variants to the risk for specific illnesses, which will lead to new methods of preventing, diagnosing, and treating disease.

Sampled 270 people from six countries. One SNP in every 1,200 bases on average (10 million common SNPs in human genome)

Th 1000 G P j t (2008 t) The 1000 Genome Project (2008-present) An international collaboration to produce an extensive public catalog of human genetic variation,

including SNPs and structural variants, and their haplotype contexts. This resource will support genome-wide association studies and other medical research studies.

The genomes of about 2500 unidentified people from about 25 populations around the world will be sequenced using next generation sequencing technologies so as to provide a deep be sequenced using next-generation sequencing technologies so as to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype, e.g., between genetic variation and disease.

Page 9: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Human Cancer Genomics

Cancer is the most common genetic disease. One in three people in the Western world develop cancer and one in five die of the disease. Cancer affects people at all ages with the risk for most types increasing with age There are at least 200 forms of cancer and many more subtypes All cancers increasing with age. There are at least 200 forms of cancer, and many more subtypes. All cancers occur due to abnormalities or errors in DNA sequence that cause cells to grow uncontrolled. Identifying the changes in each cancer’s complete set of DNA – its genome – and understanding how such changes interact to drive the disease will lay the foundation for improving cancer prevention, early detection and treatment.

Cancer caused by somatic mutations: Throughout life, the genome within cells of the human body is Ca ce caused by so at c utat o s: oug out e, t e ge o e t ce s o t e u a body s exposed to mutagens and suffers mistakes in replication. These corrosive influences result in progressive, subtle divergence of the DNA sequence in each cell from that originally constituted in the fertilized egg.

Occasionally, one of these somatic mutations alters the function of a critical gene, providing a growth advantage to the cell in which it has occurred and resulting in the emergence of an expanded clone deri ed from this cell Acq isition of additional m tations and conseq ent a es of clonal e pansion derived from this cell. Acquisition of additional mutations, and consequent waves of clonal expansion result in the evolution of the mutinous cells that invade surrounding tissues and metastasise.

The Cancer Genome Atlas (TCGA) (2006-present) (http://cancergenome.nih.gov/) Mission: TCGA is a comprehensive and coordinated effort to accelerate the understanding of the

molecular basis of cancer through the application of genome analysis technologies, including large-l iscale genome sequencing.

Overarching goal: improve our ability to diagnose, treat and prevent cancer. The Cancer Genome Project (http://www.sanger.ac.uk/research/projects/cancergenome/)

The Cancer Genome Project is using the human genome sequence and high-throughput mutation detection techniques to identify somatically acquired sequence variants/mutations and hence identify q y y q q / ygenes critical to the development of human cancers. This initiative will ultimately provide the paradigm for the detection of germline mutations in non-neoplastic human genetic diseases through genome-wide mutation detection approaches.

Page 10: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Summary: Human Genome Projects

HGPs marked the beginning & extension of an omics era that is dominated by thean omics era that is dominated by the revolutionary advancement of genomics technologies, especially genome-wide sequencing techs.

The work on interpretation of genome data is still in its initial stages. It is anticipated that detailed knowledge of the human genome will provide new avenues for advances in medicineprovide new avenues for advances in medicine and biotechnology. The etiologies for cancers, Alzheimer's disease and other areas of clinical interest are considered likely to benefit from ygenome information and possibly may lead in the long term to significant advances in their diagnosis and cure.

The analysis of similarities between DNA sequences from different organisms is also opening new avenues in the study of evolution.

Page 11: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Outline

Introduction: what is the goal? Human genomic projects Molecular biology for computational scientists Life from a taxonomy viewpoint Life from a taxonomy viewpoint Cell biology 101 Genetics, genomics and epigenomics

G i l ti Gene expression regulation Biochemistry

Molecular biology technologiesgy g Traditional molecular biology techniques Genomics Proteomics Proteomics Metabolomics

Page 12: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Life from A Taxonomy Viewpoint

Estimates of the number of currently extant species range from 5 million to 50 million (May, 1988).5 ( y, 9 )

Classification is traditionally based on morphology or phenotype. In recent years, these traditional taxonomies have been shaken by information gained from analyzing genes directly, as well as by the discovery of an entirely new class of organisms that live in hot discovery of an entirely new class of organisms that live in hot, sulphurous environments in the deep sea.

Some facts about forms of life: There are at least 300,000 different

kinds of beetles alone, and probably 50,000 species of tropical trees.

Vertebrates (animals with backbones) Vertebrates (animals with backbones) make up only about 3% of the species in the world.

Basic organisms divisions: unicellular –the Archaea, the Bacteria and some of the Eucarya (e.g., yeast); multicellular –most of the Eucarya.

Page 13: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Outline

Introduction: what is the goal? Human genomic projects Molecular biology for computational scientists Life from a taxonomy viewpoint Life from a taxonomy viewpoint Cell biology 101 Genetics, genomics and epigenomics

G i l ti Gene expression regulation Biochemistry

Molecular biology technologiesgy g Traditional molecular biology techniques Genomics Proteomics Proteomics Metabolomics

Page 14: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Cell biology 101

Cell is the basic structural and functional unit of all k li i iknown living organisms.

Organisms can be classified as unicellular (consisting of a single cell; including most bacteria) or multicellular(including plants and animals).

There are two types of cells: eukaryotic and prokaryotic. Prokaryotic cells are usually independent, while eukaryotic cells are often found in multicellularorganisms.g

Viruses consist of just a small amount of genetic material surrounded by a protein coat, and rely on the biochemical machinery of their host cell to survive and reproduce.reproduce.

Page 15: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Table 1: Comparison of features of prokaryotic and eukaryotic cells

Prokaryotes Eukaryotes

Typical organisms bacteria, archaea protists, fungi, plants, animals

p p y y

Typical size ~ 1–10 µm ~ 10–100 µm (sperm cells, apart from the tail, are smaller)

Type of nucleus nucleoid region; no real nucleus real nucleus with double membrane

DNA circular (usually) linear molecules (chromosomes) with histone proteinsDNA circular (usually) linear molecules (chromosomes) with histone proteins

RNA-/protein-synthesis coupled in cytoplasm RNA-synthesis inside the nucleus

protein synthesis in cytoplasm

Ribosomes 50S+30S 60S+40S

Cytoplasmatic structure very few structures highly structured by endomembranes and a cytoskeleton

Cell movement flagella made of flagellin flagella and cilia containing microtubules; lamellipodia and filopodia containing actin

Mitochondria none one to several thousand (though some lack mitochondria)

Chloroplasts none in algae and plants

Organization usually single cells single cells, colonies, higher multicellular organisms with specialized cellsspecialized cells

Cell division Binary fission (simple division) Mitosis (fission or budding)Meiosis

Page 16: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Cell division

Cell division is the process by which a parent cell divides into two or more daughter cells. Cell division is

ll ll f l ll l hg

usually a small segment of a larger cell cycle. This type of cell division in eukaryotes is known as mitosis, and leaves the daughter cell capable of dividing again. The corresponding sort of cell division in prokaryotes is known as binary fission. In another type of cell division present only in eukaryotes, called meiosis, a cell is

tl t f d i t t d t permanently transformed into a gamete and may not divide again until fertilization. Right before the parent cell splits, it undergoes DNA replication.

For simple unicellular organisms such as the amoeba, one cell division is equivalent to reproduction-- an entire new organism is created. On a larger scale,

i i ll di i i f mitotic cell division can create progeny from multicellular organisms, such as plants that grow from cuttings. Cell division also enables sexually reproducing organisms to develop from the one-celled zygote, which itself was produced by cell division from gametes. And after growth, cell division allows for continual construction and repair of the organism A continual construction and repair of the organism. A human being's body experiences about 10,000 trillion cell divisions in a lifetime.

The primary concern of cell division is the maintenance of the original cell's genome. Before division can occur, the genomic information which is stored in h t b li t d d th d li t d chromosomes must be replicated, and the duplicated

genome separated cleanly between cells. A great deal of cellular infrastructure is involved in keeping genomic information consistent between "generations".

Page 17: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Cell growth & division

During the cell cycle, major bulk parameters such as volume, dry mass, total protein, and total RNA double and such growth is a fundamental property of the cell cycletotal RNA double and such growth is a fundamental property of the cell cycle.

The patterns of growth in volume and total protein or RNA are not well understood. There have been two popular models, based on an exponential increase and linear increase.

The attainment of a critical size triggers the periodic events of the cycle such as the S i d d it i Thi t l t h t ti ff t th t i t i t t period and mitosis. This control acts as a homeostatic effector that maintains a constant

"average" cell size at division through successive cycles in a growing culture. It is a vital link coordinating cell growth with periodic events of the cycle. A size control is present in all the systems and appears to operate near the start of S or of mitosis when the cell has reached a critical size, but the molecular mechanism by which size is measured remains b th b d h ll A i l i i ht b f th ll t d t t iti l both obscure and a challenge. A simple version might be for the cell to detect a critical concentration of a gene.

The relationship between cell division and cell size

Page 18: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Level of Organization

LEVEL 1 – Cells Are the basic unit of structure and function in living things.

ifi f i i hi h i May serve a specific function within the organism Examples – blood cells, nerve cells, bone cells, etc.

LEVEL 2 – Tissues Made up of cells that are similar in structure and function and which work together to perform a

specific activityspecific activity Examples – blood, nervous, bone, etc. Humans have 4 basic tissues: connective, epithelial, muscle,

and nerve. LEVEL 3 – Organs

Made up of tissues that work together to perform a specific activity E l h t b i ki t Examples – heart, brain, skin, etc.

LEVEL4 – Organ Systems Groups of two or more tissues that work together to perform a specific function for the organism. Examples – circulatory system, nervous system, skeletal system, etc. The Human body has 11 organ systems - circulatory, digestive, endocrine, excretory (urinary), The Human body has 11 organ systems circulatory, digestive, endocrine, excretory (urinary),

immune(lymphatic), integumentary, muscular, nervous, reproductive, respiratory, and skeletal. LEVEL 5 – Organisms

Entire living things that can carry out all basic life processes. Meaning they can take in materials, release energy from food, release wastes, grow, respond to the environment, and reproduce.

Usually made up of organ systems but an organism may be made up of only one cell such as Usually made up of organ systems, but an organism may be made up of only one cell such as bacteria or protist.

Examples – bacteria, amoeba, mushroom, sunflower, human. LEVELS 6/7/8 – Population/Community/Ecosystem

Page 19: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Outline

Introduction: what is the goal? Human genomic projects Molecular biology for computational scientists Life from a taxonomy viewpoint Life from a taxonomy viewpoint Cell biology 101 Genetics, genomics and epigenomics

G i l ti Gene expression regulation Biochemistry

Molecular biology technologiesgy g Traditional molecular biology techniques Genomics Proteomics Proteomics Metabolomics

Page 20: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Genetics 101

The central dogma of molecular biology was first stated by Francis Crick in 1958 and re-stated in a Nature paper published in 1970.

Dogma is the established belief or doctrine held by a religion, or a particular group or organization. It is authoritative and not tobe disputed, doubted, or diverged from, by the practitioners or believers. Although it generally refers to religious beliefs that are accepted without reason or evidence, they can refer to acceptable opinions of philosophers or philosophical schools, public decrees, or issued decisions of political authorities.

In his autobiography, What Mad Pursuit, Crick wrote about his choice of the word dogma and some of the problems it caused him: "I called this idea the central dogma, for two reasons, I suspect. I had already used the obvious word hypothesis in the sequence hypothesis, and in addition I wanted to suggest that this new assumption was more central and more powerful. ... As it turned out, the use of the word dogma caused almost more trouble than it was worth.... Many years later Jacques Monod pointed out to me that I did not appear to understand the correct use of the word dogma, which is a belief that cannot be do bted I did apprehend this in a ag e sort of a b t since I tho ght that all religio s beliefs ere itho t fo ndation I seddoubted. I did apprehend this in a vague sort of way but since I thought that all religious beliefs were without foundation, I used the word the way I myself thought about it, not as most of the world does, and simply applied it to a grand hypothesis that, however plausible, had little direct experimental support.“

Page 21: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Genetics 101

Central dogma of molecular biology: the flow of genetic information (inheritance)

General transfers of biological information

Special transfers of biological information

Direct translation from DNA to protein has been demonstrated in a cell-free system (i.e. in a test tube), using extracts from E. coli that

t i d ib b t t i t t ll Th ll f tbiological information contained ribosomes, but not intact cells. These cell fragments could express proteins from foreign DNA templates, and neomycin

was found to enhance this effect

Page 22: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

DNA, Gene and Genome

Deoxyribonucleic acid (DNA) is a nucleic acid containing the genetic instructions used in the development and functioning of all known living organisms (with the exception of RNA viruses)development and functioning of all known living organisms (with the exception of RNA viruses). The DNA segments carrying this genetic information are called genes.

DNA consists of two long polymers of simple units called nucleotides, with backbones made of sugars and phosphate groups joined by ester bonds. These two strands run in opposite directions to each other and are therefore anti-parallel.

Page 23: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

DNA, Gene and Genome

A gene is a molecular unit of heredity of a living organism. It is a name given to some t t h f DNA d RNA th t d f l tid f RNA h i th t hstretches of DNA and RNA that code for a polypeptide or for an RNA chain that has a

function in the organism. Living beings depend on genes, as they specify all proteins and functional RNA chains. Genes hold the information to build and maintain an organism's cells and pass genetic traits to offspring, although some organelles (e.g. mitochondria) are self-

li ti d t d d f b th i ' DNAreplicating and are not coded for by the organism's DNA.

Page 24: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

DNA, Gene and Genome

Eukaryotic organisms (animals, plants, fungi, and protists) store most of their DNA inside the cell nucleus and some of their DNA in organelles, such as mitochondria or chloroplasts. In contrast, prokaryotes (bacteria and archaea) store their DNA only in the cytoplasm.y y p

The DNA segments carrying this genetic information are called genes. Within cells DNA is organized by supercoiling/twisting into long structures called chromosomes. During cell division these chromosomes are duplicated in the process of DNA replication, providing each cell its own complete set of chromosomes.

Each of us has enough DNA to reach from here to the sun and back, ,more than 300 times. How is all of that DNA packaged so tightly into chromosomes and squeezed into a tiny nucleus?

Annunziato, A. (2008) DNA packaging: Nucleosomes and chromatin Naturechromatin. Nature Education 1(1).

Page 25: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Genomics & Epigenomics

Genomics is a discipline in genetics concerned with the study of the genomes of organisms The field includes efforts to determine the entire DNA sequence of organisms. The field includes efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping.

The first genomes to be sequenced were those of a virus and a mitochondrion, and were done by Fred Sanger. His group established techniques of sequencing, and were done by Fred Sanger. His group established techniques of sequencing, genome mapping, data storage, and bioinformatic analyses in the 1970-1980s.

Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome. The field is analogous to genomics and proteomics, which are the study of the genome and proteome of a cell.

Epigenetic modifications are reversible modifications on a cell’s DNA or histones that affect gene expression without altering the DNA sequence Two of histones that affect gene expression without altering the DNA sequence. Two of the most characterized epigenetic modifications are DNA methylation and histone modification. Epigenetic modifications play an important role in gene expression and regulation, and are involved in numerous cellular processes such as in differentiation/development and tumorigenesis.

Page 26: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Outline

Introduction: what is the goal? Human genomic projects Molecular biology for computational scientists Life from a taxonomy viewpoint Life from a taxonomy viewpoint Cell biology 101 Genetics, genomics and epigenomics

G i l ti Gene expression regulation Biochemistry

Molecular biology technologiesgy g Traditional molecular biology techniques Genomics Proteomics Proteomics Metabolomics

Page 27: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Gene Expression Regulation

Virtually every cell in your body contains a complete y y y y pset of genes

But they are not all turned on in every tissue Each cell in your body expresses only a small subset

of genes at any time During development different cells express different

sets of genes in a precisely regulated fashionG l i h l l f i i Gene regulation occurs at the level of transcription to translation

Page 28: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Gene Expression Regulation

Prokaryotic gene structure (operon)

Eukaryotic gene structure

Page 29: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

RNA vs. DNA

Ribonucleic acid, or RNA, is part of a group of molecules known as the nucleic acids, which are one of the four major macromolecules (along with lipids, carbohydrates and proteins) essential f ll k f f lif Lik DNA RNA i d f l h i f t ll d for all known forms of life. Like DNA, RNA is made up of a long chain of components called nucleotides. RNA contains the sugar ribose, while DNA contains the slightly different sugar deoxyribose (a type of ribose that

lacks one oxygen atom) RNA has the nucleobase uracil while DNA contains thymine Unlike DNA most RNA molecules are single-stranded and can adopt very complex three-dimensional structures Unlike DNA, most RNA molecules are single stranded and can adopt very complex three dimensional structures.

Synthesis of RNA is usually catalyzed by an enzyme—RNA polymerase I, II, III—using DNA as a template. (Plants use RNA polymerase IV and V to synthesize siRNAs) Messenger RNA (mRNA) is the RNA that carries information from DNA to the ribosome, the sites of protein

synthesis (translation) in the cell. Transcription of mRNA is catalyzed by RNA polymerase II. Transfer RNA (tRNA) is a small RNA chain of about 80 nucleotides that transfers a specific amino acid to a growing p g g

polypeptide chain at the ribosomal site of protein synthesis during translation. It has sites for amino acid attachment and an anticodon region for codon recognition that binds to a specific sequence on the messenger RNA chain through hydrogen bonding. It is synthesized by RNA polymerase III.

Ribosomal RNA (rRNA) is the catalytic component of the ribosomes. Eukaryotic ribosomes contain four different rRNA molecules: 18S, 5.8S, 28S and 5S rRNA. Three of the rRNA molecules are synthesized in the nucleolus by RNA polymerase I, and one (5S) is synthesized elsewhere by RNA polymerase III.N t i di i RNA f ll RNA f t th t l t RNA i i Non-protein coding microRNA, a group of small RNAs of ~20 nt that regulate mRNA expressionis.

Page 30: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Gene Expression Regulation:Epigenetic DNA ModificationEpigenetic DNA Modification

Page 31: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Gene Expression Regulation: DNA Mutation

Mutations are changes in a genomic sequence: the DNA sequence of a cell's genome or the DNA or RNA sequence of a irus These random sequences can genome or the DNA or RNA sequence of a virus. These random sequences can be defined as sudden and spontaneous changes in the cell. Mutations are caused by radiation, viruses, transposons and mutagenic chemicals, as well as errors that occur during meiosis or DNA replication. They can also be induced g p yby the organism itself, by cellular processes such as hypermutation

Mutagenic chemicals: Hydroxylamine NH2OH, Base analogs (e.g. BrdU), Alkylating agents (e.g. N-ethyl-N-nitrosourea), Agents that form DNA adducts ( h t i A t b lit ) DNA i t l ti t ( thidi(e.g. ochratoxin A metabolites), DNA intercalating agents (e.g. ethidiumbromide), DNA crosslinkers, Oxidative damage, and Nitrous acid

Page 32: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Gene Expression Regulation:Transcription Elongation & Termination p g

Page 33: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Gene Expression Regulation:Transcription Factorp

A transcription factor (sometimes called a sequence-specific DNA-binding factor) is a q p g )protein that binds to specific DNA sequences, thereby controlling the flow (or transcription) of genetic information from DNA to mRNA. Transcription factors perform this function alone or with other perform this function alone or with other proteins in a complex, by promoting (as an activator), or blocking (as a repressor) the recruitment of RNA polymerase (the enzyme that performs the transcription of genetic information from DNA to RNA) to specific information from DNA to RNA) to specific genes.

Domain architecture example: Lactose Repressor (LacI). The N-terminal DNA binding domain (labeled) of lac repressor binds the its target DNA sequence (gold) in the major groove using a helix-turn-helix motif. Effectormolecule binding (green) occurs in the core domain (labeled) ain the core domain (labeled), a signal sensing domain. This triggers an allosteric response mediated by the linker region (labeled).

Page 34: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Gene Expression Regulation:Post-Transcriptional Modificationp

The pre-mRNA molecule undergoes three main modifications. These modifications are 5' capping, 3' polyadenylation, and RNA splicing, which occur in the cell nucleus 5 pp g, 3 p y y , p g,before the RNA is translated.

Capping of the pre-mRNA involves the addition of 7-methylguanosine (m7G) to the 5' end. The 5' cap has 4 main functions: 1.Regulation of nuclear export; 2.Prevention of degradation by exonucleases; 3.Promotion of translation; 4.Promotion of 5' proximal intron excisionproximal intron excision

The pre-mRNA processing at the 3' end of the RNA molecule involves cleavage of its 3' end and then the addition of about 250 adenine residues to form a poly(A) tail. The cleavage and adenylation reactions occur if a polyadenylation signal sequence (5'- AAUAAA-3') is located near the 3' end of the pre-mRNA molecule, which is (5 3 ) 3 p ,followed by another sequence, which is usually (5'-CA-3'). The second signal is the site of cleavage. A GU-rich sequence is also usually present further downstream on the pre-mRNA molecule.

Page 35: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Gene Expression Regulation:Alternative Splicingp g

Alternative splicing produces two p g pprotein isoforms

Generalized splicing repression modelS li i i ( l h RNP ) Splicing repressor proteins (mostly hnRNPs) bind to splicing silencers in exons (ESS) and introns (ISS). They inhibit the binding of U1 snRNP to the donor site, and of U2AFs and the U2 snRNP to the acceptor site and branch p bpoint. This results in skipping of the exon(yellow)

Generalized splicing activation model Splicing activator proteins (mostly SR Splicing activator proteins (mostly SR

proteins) bind to splicing enhancers in exons(ESE) and introns (ISE). They assist in the binding of U1 snRNP to the donor site and of U2AFs and U2 snRNP to the acceptor site and branch pointbranch point.

hnRNP = heterogeneous nuclear ribonucleoprotein; snRNP = small nuclear ribonucleoproteins; SR = serine/arginine-rich proteinESS = exonic splicing silencer; ISS = intronic splicing silencer; ESE = exonic splicing enhancer; ISE = intronic splicing enhancer

Page 36: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Gene Expression Regulation:mRNA Translation

In theory, the regulation of gene expression could be achieved by producing all possible mRNA

eIF = eukaryotic initiation factor

producing all possible mRNA species in every cell and selecting which were translated into protein in each individual cell type. But there is evidence that different cell types have very different cytoplasmic RNA

CPSF = cleavage & polyadenylationspecificity factor

very different cytoplasmic RNA populations, indicating that this extreme model is incorrect.

Nonetheless, the regulation of translation, such that a particular mRNA is translated into protein in one situation and into protein in one situation and not another, does occur in some special cases.

A prominent case is fertilization. In an unfertilized egg protein synthesis is slow, but upon fertilization of the egg but upon fertilization of the egg by a sperm a tremendous increase in the rate of protein synthesis occurs. This increase does not require the production of new mRNAs after fertilization.fertilization.

Page 37: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Gene Expression Regulation:Post-translational Modification

Protein Folding Pathway: Molecular Chaperoneg y p

Page 38: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Summary – Gene expression regulation

Bi l i l t hi hl Biological systems are highly organized and complex. At the cellular level, gene expression is cellular level, gene expression is controlled and regulated by many factors.

Page 39: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Outline

Introduction: what is the goal? Human genomic projects Molecular biology for computational scientists Life from a taxonomy viewpoint Life from a taxonomy viewpoint Cell biology 101 Genetics, genomics and epigenomics

G i l ti Gene expression regulation Biochemistry

Molecular biology technologiesgy g Traditional molecular biology techniques Genomics Proteomics Proteomics Metabolomics

Page 40: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Biochemistry: the Chemical Elements of Life

Biochemistry, sometimes called biological chemistry, is the study of chemical processes within, and relating to, living organisms. By controlling information flow through biochemical signaling and the flow of chemical energy through metabolism biochemical processes cause the and the flow of chemical energy through metabolism, biochemical processes cause the complexity of life.

The main focus of today’s pure biochemistry is in understanding how biological molecules give rise to the processes that occur within living cells, which in turn relates greatly to the study and understanding of whole organismsunderstanding of whole organisms.

Biochemistry is closely related to molecular biology, the study of the molecular mechanisms by which genetic information encoded in DNA is able to result in the processes of life. Depending on the exact definition of the terms used, molecular biology can be thought of as a branch of biochemistry or biochemistry as a tool with which to investigate and study molecular biologybiochemistry, or biochemistry as a tool with which to investigate and study molecular biology.

Much of biochemistry deals with the structures, functions and interactions of biological macromolecules, such as proteins, nucleic acids, carbohydrates and lipids, which provide the structure of cells and perform many of the functions associated with life.Th h i f h ll l d d h i f ll l l d i Th The chemistry of the cell also depends on the reactions of smaller molecules and ions. These can be inorganic, for example water and metal ions, or organic, for example the amino acids which are used to synthesize proteins.

The mechanisms by which cells harness energy from their environment via chemical reactions k b li h fi di f bi h i li d i il i di iare known as metabolism. The findings of biochemistry are applied primarily in medicine,

nutrients, and agriculture. In medicine, biochemists investigate the causes and cures of disease. In nutrients, they study how to maintain health and study the effects of nutritional deficiencies. In agriculture, biochemists investigate soil and fertilizers, and try to discover ways to improve

lti ti t d t t l

Page 41: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

KEGG metabolic pathway

Page 42: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Take-Home Message

Biology and its sub disciplines such as molecular biology Biology and its sub-disciplines such as molecular biology are extremely complicated.

Biologists have uncovered many mysteries (rules and l t h i ) d l i l d regulatory mechanisms) underlying normal and

abnormal, developing and mature, healthy and diseased forms of life, especially in humans.With th id f d i t l t h l i With the aid of modern experimental technologies, biologists and computational scientist are poised to accelerate the decoding process of life in the earth.

Further reading: Lawrence Hunter. 1993. Chapter 1. Molecular Biology for Computer

S i ti t I A tifi i l I t lli & M l l Bi lScientists. In: Artificial Intelligence & Molecular Biology.

Page 43: Special Topics in Computer Science: Bioinformatics ...orca.st.usm.edu/~nwang/teaching/Bioinformatics... · revolutionary advancement of genomics technologies, especially genome-wide

Questions?