27
http://www.faculty.ucr.edu/~tgirke/Teaching/ Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Embed Size (px)

Citation preview

Page 1: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

http://www.faculty.ucr.edu/~tgirke/Teaching/Gen240B_2003.ppt

Web-based/Open-source Tools for Bioinformatics and

Genome Analysis

Page 2: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Bioinformatics AreasA. Traditional Bioinformatics Sequence analysis Gene expression analysis Proteomics Metabolic profiling Phenotypes Networks

B. Structural Bioinformatics Molecular modeling Drug design

C. Biological Databases

Systems Biology

Page 3: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Focus of this Seminar

1. Sequences

2. Structure

3. Expression

4. Functional Groups

Bio* Projects and Databases

Page 4: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

1. Some Analysis Steps Fragment Assembly: ESTs and genes Mapping Annotation

Gene predictions ORFs, UTRs, introns, exons, promoters Lots of errors in eukaryote genomes!!

Similarity searches BLAST, FASTA, Smith-Waterman

Gene families Domain databases Multiple alignments

Structure/Function 2D, 3D structure (availability?)

Page 5: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Important Sequence Databases Selection

NCBIEntrez: http://www.ncbi.nlm.nih.gov/Batch Entrez: http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgiDownloads: ftp://ftp.ncbi.nih.gov/blast/db/

EMBL-EBIGeneral: http://www.ebi.ac.uk/Downloads: http://www.ebi.ac.uk/FTP/

Swiss-ProtGeneral: http://us.expasy.org/Downloads: http://us.expasy.org/expasy_urls.html

TIGRGeneral: http://www.tigr.org/Downloads: ftp://ftp.tigr.org/pub/data/

Protein Data Bank (PDB)General: http://www.rcsb.org/pdb/Downloads: ftp://ftp.rcsb.org/pub/pdb/data

Page 6: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Example: NCBI

Page 7: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Sequence Database SearchesImportant search algorithms Swiss-Waterman, FASTA, BLAST

BLAST Flavors: http://www.ncbi.nlm.nih.gov/Sitemap/index.html#BLAST

BLAST: BLASN, BLASTP, TBLASTN, TBLASTX Psi-BLAST: Position-Specific Iterated BLAST RPS-BLAST: Reverse Position-Specific BLAST Phi-BLAST: Pattern Hit Initiated BLAST Mega-BLAST: 10 faster than BLASTN BLAST2: pairwise comparisons WU-BLAST: Washington University BLAST

Download of NCBI BLAST tools: ftp://ftp.ncbi.nih.gov/toolbox/

Page 8: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Homework AssignmentFinish only one assignment!

Go to http://www.ncbi.nlm.nih.gov/, select protein DB, run query: P450 & hydroxylase & human [organism], select under ‘Limits’ SwissProt

report final query syntax from ‘Details’ page.

Save GIs from this final query to file (select ‘GI List’ format under display) report how many GIs you retrieved

Retrieve the corresponding sequences through Batch-Entrez (http://www.ncbi.nlm.nih.gov/entrez/batchentrez.cgi) using GI list file as query input -> save sequences in FASTA format

Generate multiple alignment and tree of these sequences using Multalign (http://prodes.toulouse.inra.fr/multalin/multalin.html)

save multiple alignment and tree to file

identify putative heme binding cysteine

Open corresponding SwissProt page (http://us.expasy.org/sprot/) for first P450 sequence in your list Compare putative heme binding cysteine and compare with consensus pattern from Prosite database

Report corresponding Pfam ID

How many mouse (Mus musculus) sequences are in this family (use ‘species tree’ on Pfam db)

BLASTP against nr database (use again first P450 in your list), select on “See Conserved Domains from CDD” (this runs RPS-BLAST), click on red P450 domain.

Compare resulting alignment with result from MultAlin

View 3D structure in Cn3D, save structure (screen shot) and highlight heme binding cysteine

Page 9: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Remote Homology Detection

Psi-BLAST/RPS-BLAST HMMs: HMMER, SAM Domain databases Fold recognition approaches (Meta Servers)

Page 10: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Protein Domain DatabasesSelection

PFAM http://pfam.wustl.edu/

PROSITE http://us.expasy.org/prosite/

ProDom http://prodes.toulouse.inra.fr/prodom/2002.1/html/

home.php

InterPro http://www.ebi.ac.uk/interpro/

Page 11: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Selection of Tools for Promoter Analysis

Verbumculus, UC Riverside• http://www.cs.ucr.edu/%7Estelo/Verbumculus/

AlignACE & ScanACE• http://arep.med.harvard.edu/mrnadata/mrnasoft.html

MEME and META-MEME, San Diego Super Computer Center:

• http://www.sdsc.edu/Research/biology/

Regulatory Sequence Analysis Tools (RSA)• http://rsat.ulb.ac.be/rsat/

Gibbs Motif Sampler, Coldspring Harbor: • http://argon.cshl.org/ioschikz/gibbsDNA/mgibbsDNA-form.html

Motif Sampler, searches for over-represented motifs

• http://www.esat.kuleuven.ac.be/~thijs/Work/MotifSampler.html

Stanford, motif finding in upstream sequences• http://genome-www4.stanford.edu/cgi-bin/ewing/oligoAnalysis.pl

Page 12: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Example: RSA

Page 13: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Promoter DatabasesSelection

Regulatory Sequence Analysis Tools (RSA) http://rsat.ulb.ac.be/rsat/

Eukaryotic Promoter Database http://www.epd.isb-sib.ch/

Human Promoter Database http://zlab.bu.edu/%7Emfrith/HPD.html

Arabidopsis http://exon.cshl.org/cgi-bin/atprobe/atprobe.pl

Page 14: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Alternative HomeworkDo only one assignment!

Work through tutorial of Regulatory Sequence Analysis Tools (http://rsat.ulb.ac.be/rsat/). Provide short summary for different tools

Page 15: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

2. Protein Modeling

Tool collection: http://faculty.ucr.edu/~tgirke/Links.htm Databases:

Protein Data Bank: General: http://www.rcsb.org/pdb/ Downloads: ftp://ftp.rcsb.org/pub/pdb/data

More databases: http://faculty.ucr.edu/~tgirke/Links.htm#Databases

Page 16: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

3. Microarrays and Chips

Definition: Hybridization-based technique that allows simultaneous

analysis of thousands of samples on a solid substrate.

Applications: Examples

Transcriptional Profiling

Gene copy number Resequencing Genotyping Single-nucleotide polymorphism DNA-protein interaction Insertional library screening Identification of new cell lines Etc.

Developing Areas: Protein arrays Chemical arrays

Page 17: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Why Microarrays?

Simultaneous analysis of over 50,000 genes

Signaling and Metabolic Networks

Regulatory genes

First step in discovery of gene function

Prediction of limiting factors in biological processes

Rapid analysis of mutants and transgenics

Reduce time of costly clinical studies and field trials

DNA Arrays

gene expression

Input Samples Outputs

WT

MutantsTransgenics

Treatmentsbiotic, abiotic, chemicals

Prognosis

Diagnosis

Target identification

Page 18: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Basic Analysis Steps

Image analysis Filtering, background correction Standardization, scaling and normalization Significance analysis (replicates) Cluster analysis (time series) Integration with sequence and functional

information

Page 19: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Planning Steps of Transcriptional Profiling Experiments

1. Biological question(s), e.g.:

- Which genes are up or down-regulated in a mutant/transgenic line?

- Which genes cycle during a series of treatments?

2. Selection of best biological samples

- Minimize variability in sample collection.

3. Develop validation and follow-up strategy for expected expression hits

- e.g. real-time PCR and analysis of transgenics or mutants

4. Choose type of experiment

- pairwise: e.g.WT vs. Mutant/Transgenic

- series of time points or treatments

allows cluster analysis

5. Choose Reference

- sample with maximum number of expressed genes (maxim. biolog.information)

- pooled RNA of all points: less variability from reference, saves chips

WTt1 WTt2

MTt1 MTt2

WTt1

WTt1 WTt2 WTt3 WTt4 WTt5

Page 20: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Planning Steps of Transcriptional Profiling Experiments

6. How many replicates?

- biological replicate: starts with sample collection

- technical replicate: starts usually with same RNA isolation

- dye-swaps: (1) WT-Cy3:MT-Cy5, (2) WT-Cy5:MT-Cy3

7. Management of sample collection and RNA isolation

- Define a “realistic” volume

- RNA quality tests!!!!

8. cDNA/cRNA labeling

- Which labeling technique? RNA amplification, reliability, sensitivity, etc.

9. Array hybridizations and post-processing

10. Array scanning

Page 21: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Important Pattern Recognition (clustering) Methods

Hierarchical clustering single, average (UPGMA) and complete

linkage Non-hierarchical clustering

Self Organizing Maps (SOM) k-means

Dimension Reduction Analysis Principal Component Analysis

Neural Networks & Machine Learning

Page 22: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Tools for Microarray Analysis

Image analysis: ScanAlyze Normalization: SNOMAD, R projects Mining/clustering: J-Express, R projects Much more: http://faculty.ucr.edu/%7Etgirke/Links.htm#Profiling

Page 23: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Example of an Integrated Clustering Tool: J-Express

Page 24: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Microarray DatabasesSelection

Stanford Microarray Database (SMD) http://genome-www5.stanford.edu/MicroArray/SMD/

Gene Expression Omnibus (GEO)http://www.ncbi.nlm.nih.gov/geo/

Page 25: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

- Go to the SNOMAD page (Standardization and Normalization of Microarray Data):

http://pevsnerlab.kennedykrieger.org/snomadinput.html

- Select “Use an Example dataset to see how SNOMAD works” and chose either option #2 (Incyte dataset) or

#3 (Affymetrix dataset). If you prefer you can use your own or other public data instead. A good resource to

download public data is the Stanford site: http://genome-www5.stanford.edu/cgi-bin/SMD/publicData.pl

- Select all possible transformations and graphs and submit the data for processing.

- Report: Give a short description (one or two sentences) for each graph/transformation of the returned results.

Alternative Homework AssignmentDo only one assignment!

Page 26: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

4. Functional Groups

Assigning “Biological Meaning” to Profiling Data

Protein Families COGs (43 genomes, NCBI):

http://www.ncbi.nlm.nih.gov/COG/ Protein Domain Databases (PFAM)

Gene Ontology ConsortiumDf: controlled vocabulary for all organisms

http://www.geneontology.org/

Pathways KEGG Metabolic Pathways

http://www.genome.ad.jp/kegg/kegg2.html WIT Database (39 genomes)

http://wit.mcs.anl.gov/WIT2/

Page 27: Http://tgirke/Teaching/Gen240B_2003.ppt Web-based/Open-source Tools for Bioinformatics and Genome Analysis

Toolboxes for BioinformaticiansPopular scripting languages

Perl: http://www.perl.com/Python: http://www.python.org/

Bio* modules for processing data from databases and applicationsBioPerl: http://bio.perl.org/BioPython: http://biopython.org/BioJava: http://www.biojava.org/BioRuby: http://bioruby.org/

StatisticsR: http://www.R-project.orgBioConductor (Microarray): http://www.bioconductor.org/

Database systemsMySQL: http://www.mysql.com/PostgreSQL: http://www.postgresql.org/