28
Sequence Analysis

Sequence Analysis

  • Upload
    fritz

  • View
    49

  • Download
    0

Embed Size (px)

DESCRIPTION

Sequence Analysis. Sequence Analysis - Topics. Comparison of gene sequences for similarities and defining homologies from phylogenetic analysis Identification of gene structure, including reading frames, exon-intron distribution and regulatory elements Prediction of protein structural elements - PowerPoint PPT Presentation

Citation preview

Page 1: Sequence Analysis

Sequence Analysis

Page 2: Sequence Analysis

Sequence Analysis - Topics

• Comparison of gene sequences for similarities and defining homologies from phylogenetic analysis

• Identification of gene structure, including reading frames, exon-intron distribution and regulatory elements

• Prediction of protein structural elements• Genome mapping (linear arrangement of genes on

chromosomes and its assessment within the context of metabolic pathways

Page 3: Sequence Analysis

Sequence Analysis

• Helps understand evolution of life• Expresses Relationship between DNA sequences

of different proteins and organisms• Facilitates the collection, storage, organization and

annotation of raw data and construction of secondary and tertiary databases

• Necessary to achieve the goal of bioinformatics

Page 4: Sequence Analysis

Goal of Bioinformatics

• Organization of sequence databases with bibliographic and biological annotations

• Support via software for the alignment of sequences

• Identification of genes• Translation of DNA sequences into amino acid

sequences• Search for homologs (evolutionary related

sequences)

Page 5: Sequence Analysis

History of Seq Analysis

• Fifteen years ago – People read DNA or Amino acid Sequences over telephone

• Caused an estimated mutation rate that far exceeded that of natural DNA replication or transcription process

Page 6: Sequence Analysis

Computational tools for Sequence Analysis

• Extremely easy

• Fast

• Virtually error free

Page 7: Sequence Analysis

Database Submissions

• Information is submitted to NCBI, EBI, DDBJ

• GenBank staff scientists assign accession numbers for immediate release to public

• Daily exchanges between GenBank, EBI and DDBJ ensure information is non redundant (submitted only once)

Page 8: Sequence Analysis

Database submissions

• Authors can update original information

• Specialized submission procedures include EST(Expressed sequence tags), STs(Sequence tagged sites) and GSSs(Genome survey sequences)

Page 9: Sequence Analysis

EST (Expressed sequence tags)

• ESTs are short sequences of 300-500 bp and represent actually expressed genes.

• These are markers that are helpful in locating (map) genes on chromosomes

• EST submissions therefore include both sequence and mapping information

Page 10: Sequence Analysis

STs (Sequence tagged sites)

• Provide unique identifiers within a given genome identifiable by PCR

• Similar in length and number of submitted sequences per batch

• STs sequences will soon outnumber EST because of the non coding regions of the genomes

Page 11: Sequence Analysis

Processing of submissions

• The submissions are processed on a daily basis and can be submitted before they are completed

• The processing at NCBI includes 3 phases: 1) Unfinished , Unordered 2) unfinished ordered 3) high-quality finished sequences with no gaps

Page 12: Sequence Analysis

Annotation

• Annotation of sequences is important – helps in predicting structures, drug discovery , establishing phylogenetic relationships etc

• Erroneous annotation result in erroneous interpretation and conclusions and reduces reliability of data

• NCBI’s staff continously screen biomedical journals for published sequence and structure data and use it for annotation purposes

Page 13: Sequence Analysis

Data Retrieval

• Data for DNA and Protein sequences – enormous – searching is dubbed “biological data mining”.

• Sequences are retrieved based on specific criteria (similarity or identity between sequences)

Page 14: Sequence Analysis

Search Engines

• Perform simple string searches for information retrieval of stored data (GenBank:nucleotides and proteins; and PubMed’s MEDLINE: 3-D structures, genomes and taxonomy databases)

• Perform similarity searches (e.g., BLAST) to retrieve , align and compare sequences or structures

Page 15: Sequence Analysis

Steps in Retrieval

• First step includes retrieving sequences based on specific criteria (similarity or identity between sequences)

• If no sequence is known or available, the NCBI’s search engine can be screened at the nucleotide or protein level by typing in the keyword – the name of protein, the author or the proper accession number

Page 16: Sequence Analysis

Results of Data retrieval

• The level of reported similarity indicates potential biological relationships across species and taxonomic divisions

• Identities between sequences are measured as E-values between zero and one indicating chance of a random hit

• A value of one indicates potential randomness while values of zero or close to zero are less likely to be random hits

Page 17: Sequence Analysis

Sequence Alignment

• Pair-wise comparison of sequences

• First step in assessing the property of a newly sequenced gene

• Finding homologs in other organisms

• Identifying new sequences as novel

• BLAST 2 – Compare two sequences

• ClustalW – Multiple sequence alignment

Page 18: Sequence Analysis
Page 19: Sequence Analysis
Page 20: Sequence Analysis

Results of Sequence Alignment

• Several sequences can be submitted and different output settings can be selected

• Identities from pair wise alignments are shown

• Order of most identical to least identical sequence pairs are also shown

• Phylogenetic trees (graphical description) are also included

Page 21: Sequence Analysis

What Sequence Reveals

• The Biological function of a Gene

• Related sequences in database

• Structure prediction / comparison with X-ray structure

• ORF (open reading frame) if function is unknown

• Domain structure

Page 22: Sequence Analysis

What Sequence Reveals

• Transmembrane segments

• Signal sequence

• Alternate nomenclature

• Genetic information – regulatory sequences

• Translation

• 2-D gels, pI (charge), molecular weight

• Bibliography

Page 23: Sequence Analysis

Identification of Gene

• Software identifies ORFs (Open reading frames) or URFs (unidentified reading frames)

• Searches for long streches of sequence between a start and a stop codon

• The length of the ORF directly related to the size or molecular weight of the coded protein

• The comparison of the similarity of two or more sequences is a good indicator of biological function of gene

Page 24: Sequence Analysis

Redundancy

• Scientists work independently – results in repetitive naming of identical genes and proteins

• Similar to having name listed as 3 entries in a telephone book - first , middle and last name

• Redundancy is useful - an unintentional quality control

Page 25: Sequence Analysis
Page 26: Sequence Analysis

Human Genome Project

• The ultimate physical map of the human genome is the complete DNA sequence the determination of all base pairs on each chromosome. The completed map will provide biologists with a Rosetta stone for studying human biology and enable medical researchers to begin to unravel the mechanisms of inherited diseases.

• A major focus of the Human Genome Project is the development of automated sequencing technology that can accurately sequence 100,000 or more bases per day at a cost of less than $.50 per base. Specific goals include the development of sequencing and detection schemes that are faster and more sensitive, accurate, and economical.

Page 27: Sequence Analysis

Human Genome Project

• Second-generation (interim) sequencing technologies will enable speed and accuracy to increase by an order of magnitude (i.e., 10 times greater) while lowering the cost per base. Some important disease genes will be sequenced with such technologies as– (1) high-voltage capillary and ultra thin electrophoresis

to increase fragment separation rate and – (2) use of resonance ionization spectroscopy to detect

stable isotope labels.

Page 28: Sequence Analysis

Human Genome Project• Third-generation gel-less sequencing

technologies, which aim to increase efficiency by several orders of magnitude, are expected to be used for sequencing most of the human genome. These developing technologies include – (1) enhanced fluorescence detection of individual

labeled bases in flow cytometry, – (2) direct reading of the base sequence on a DNA

strand with the use of scanning tunneling or atomic force microscopies,

– (3) enhanced mass spectrometric analysis of DNA sequence, and

– (4) sequencing by hybridization to short panels of nucleotides of known sequence.