36
CS882: Advanced Topics in Bioinformatics History and Frontier of Bioinformatics

CS882: Advanced Topics in Bioinformaticsbinma/cs882/1.pdf · –Proteomics • Bioinformatics is too broad an area to be fully surveyed in a ... • Students choose between a seminar

Embed Size (px)

Citation preview

CS882: Advanced Topics in Bioinformatics

History and Frontier of Bioinformatics

Experience is a dear teacher, but fools will learn at no other.

-- Benjamin Franklin

Why Study History?

This Course

• I will try to introduce bioinformatics problems in the context of history. – Developments in biology that lead to bioinformatics – Sequence Comparison – Genome sequencing – Proteomics

• Bioinformatics is too broad an area to be fully surveyed in a course. – This course is a sample of the works

• From the course, I hope you can learn – Many bioinformatics research problems. – How bioinformatics area evolved. – How to choose research problems. – How to formulate interesting, useful, and solvable problems.

Evaluation • Students choose between a seminar or a course project • Seminars

– Read several (10+) related research articles, – Write a survey, and comment on the significance and impact of the

papers. – Why (bother)? Who (did it)? What (was achieved)? When? How? So

what (the impacts)? – Predict the future developments.

• Projects – A few small course projects available for selection. – Coding involved. – Write a report.

• Both need to do a presentation. • Evaluation:

– participation (20%) – verbal presentation (40%) – written report (40%)

Ways to Find Survey Topics • A research area (or a sub-area)

– Sequence comparison

– Genome sequencing

– Proteomics

– Phylogeny

– Gene expression

– Protein-protein interaction network

– Haplotype

– Protein structures

• A research problem

– Find a research paper (RECOMB 2013, 2014)

– Read its references.

– Read citations of the important references (Google scholar useful)

– Survey the history and development of that research problems.

Research Projects • Select one of the following

– Study the genome and proteome similarity between organisms.

– Succinct representation of redundant data (compression isn’t the only goal. Better access important).

• EST

• Protein

• NGS

• I’ll come up with more next week…

A QUICK REVIEW OF BACKGROUND Important Biology Advancements that Leads to Bioinformatics

Central Dogma

Ancient Time

• Our ancestors know something about genetics.

• Inheritance: Things produce children just like themselves.

• Selective breeding.

Image credit: The Cartoon Guide to Genetics by Gonick and Wheelis

Gregor Mendel (1822– 1884)

A scientist and a monk from Austria.

Mendel studied Pea plants.

Genotypes Control Phenotypes

Homozygous Heterozygous

• In this case, there are two genotypes of the same gene, A and a, for the height of the pea plants.

• Each individual has two copies of the gene, from two parents.

• A dominates a.

Hybridization

Homozygous Heterozygous

Descendants of Heterozygous Pea

A Comment

• Experiment Data Knowledge

• Early days the data is small enough to be processed by a human.

• But today it requires computer – that is bioinformatics.

Rediscovery of Mendel's Work

• Mendel’s work was ignored by the world , until it was rediscovered around 1900, by Hugo de Vries and Carl Correns.

– 16 years after Mendel died.

Darwin (1809 – 1882) “I have called this principle, by which each slight variation, if useful, is preserved, by the term Natural Selection.”

Evolution

Phylogeny Trees

• In the past people dig the fossil to study the evolution.

• Now use characteristics (e.g. DNA sequence) of today’s species to computationally construct the evolution history.

Chromosomes

Walter Sutton (left) and Theodor Boveri (right) independently developed the chromosome theory of inheritance in 1902.

Human has 23 pairs of Chromosomes.

Genetic Map

If two genes are on the same chromosome, they tend to inherit together.

(AB, ab) x (AB, ab) will not give Ab, if there’s no cross-over.

Chromosomes Crossover

With this model, can you suggest a way to do genetic map (or linkage map)?

DNA

Base Pairs

• 4 different nucleotides in DNA.

– A, C, G, T

• A single strand is a sequence of A,C,G,T.

• The other complementary strand can be computed: A-T, and C-G. These are base pairs.

DNA Replication

Before the Discovery of DNA

• In 1869, DNA was first isolated by the Swiss physician Friedrich Miescher. He called it "nuclein” because it’s in nuclei of the cell.

• In 1878, Albrecht Kossel isolated the non-protein component of "nuclein", nucleic acid, and later isolated its five primary nucleobases.

• In 1927 Nikolai Koltsov proposed that inherited traits would be inherited via a "giant hereditary molecule" which would be made up of "two mirror strands that would replicate in a semi-conservative fashion using each strand as a template".

First Confirmation of DNA’s Role

• 1928, Griffith concluded that the type II-R had been "transformed" into the lethal III-S strain by a "transforming principle" that was somehow part of the dead III-S strain bacteria

• 1944, Oswald Avery, Colin MacLeod, and Maclyn McCarty, confirmed that DNA was the “transforming principle”.

Griffith’s Experiment

Discovery of DNA Structure

• 1952, Rosalind Franklin and Raymond Gosling used X-ray crystallography to help visualize the structure of DNA.

• 1953, James D. Watson and Francis Crick suggested the first correct double-helix model of DNA structure.

• In 1962, after Franklin's death, Watson, Crick, and Wilkins jointly received the Nobel Prize in Physiology or Medicine.

Rosalind Franklin

DNA sequencing

• Sanger Sequencing was developed in 1977 and soon became the method of choice.

ATACTCAC…. DNA to be sequenced

Grew the other strand using target DNA as a templete

• Monomers: A, C, G, T, and a modified A. • If a growth uses the modified A, then the

growth stops.

Sanger Sequencing

Do the experiment for all four bases, and separate different lengths with gel electrophoresis.

Popularity of DNA Sequencing

• Private sector played important role

• Applied Biosystem made the first automated DNA sequencer and a lot of money.

ABI 3130

Applied Biosystems

• May 1981, the company was founded by two scientist/engineer from Hewlett Packard, Sam Eletr and Andre Marion.

• 1982, first commercial instrument, the Model 470A Protein Sequencer. 40 employees, $400K revenue.

• 1983, employees = 80, IPO, revenues= US$5.9 million. Model 380A DNA Synthesizer. Licensed automated sequencing technology using fluorescent dyes from CIT.

• 1984, revenue US$18 million, 200 employees.

• 1985, revenue US$35 million.

• 1986, revenue US$52 million. The release of the Model 370A DNA Sequencing System, using fluorescent tags, revolutionized gene discovery.

• 1987, revenues US$85 million, 788 employees.

• 1988, revenue US$132 million, 1000 employees. In that year for the first time, genetic science reached the milestone of being able to identify individuals by their DNA.

• 1989, revenue reached nearly $160 million.

• 1990, the U.S. government approved financing to support the Human Genome Project.

• 1993, acquired by Perkin Elmer.

Human Genome Project

• The Human Genome Project (HGP) is an international project with a primary goal of sequencing and annotate human genome.

– October 1990, launch.

– 2003, finished sequencing and initial analysis.

– Funded by public funding, >3 Billion dollars spent.

Celera Corporation

• Founded 1998 by PE Corporation and Dr. J. Craig Venter.

• Craig Venter sequenced Yeast genome at TIGR (The Institute of Genetic Research)

• Competing with the public effort on finishing human genome.

• 2003, finished human at almost the same time (Venter announced the victory).

• Data: – Public: 13 years, 3 billion$

– Celera: 5 years, 300 million$

• Celera has access to prior knowledge.

Competing in Bioinformatics

Gene Myers vs. Jim Kent

• “In a short time it will be hard to realize how we managed without the sequence data. Biology will never be the same again.”

-- N. Williams. Closing in on the complete yeast genome sequence. Science, 268:1560-61. 1995.

The rise of Bioinformatics!

Genome Sequenced, So What? • It turns out that genome sequencing isn’t the end of the story. • People called it the post-genome era after 2003. • It was a landmark but didn’t solve all problems. • First of all, your genome and my genome are different.

– 1000 genome project.

• Second, genes are expressed differently at different time/cells/conditions. – Gene expression (microarray) – A flash in the pan.

• Thirdly, proteins are not only expressed differently, but also modified. – Proteomics (mass spectrometry). – (HUPO) Human Proteomics Organization. – Proteomics started to produce some “biomarkers”

• Metabolomics, Glycomics … – Life is very complex.

Wrap Up

• Pre-bioinformatics developments in biology

• Emerging of bioinformatics – Bioinformatics deals with data

• Initial impacts of bioinformatics

• Many more years of challenging problems for bioinformatics. – Triggered by new measurements technology

– Accelerate the developments of new measurements.