Bioinformatics for Genomic and Proteomic data analysis

Preview:

DESCRIPTION

Bioinformatics for Genomic and Proteomic data analysis. -- Gene Prediction. Sequence Analysis. -- Alignment techniques (BLAST, PSI-BLAST). -- Major databases and retrieval techniques. -- Predicting Function, domains etc. - PowerPoint PPT Presentation

Citation preview

Bioinformatics for Genomic and Proteomic data analysis

• Sequence Analysis

-- Predicting Function, domains etc.

-- Predicting phyico-chemical properties of protein (ProtParam).

-- Predicting signal peptides and transmembrane proteins (SignalP).

-- finding homology between sequences, identifying repeats etc (DOTPLOT).

-- Major databases and retrieval techniques.

• Structure analysis

-- Gene Prediction

-- Phylogenetic analysis

-- Alignment techniques (BLAST, PSI-BLAST)

-- Analysis of Protein structure and conformation (Rasmol, SwissPDBViewer, VMD etc).

-- Protein structure predictions- Homology modeling (SwissModel, Modeller).

• Some practical applications

-- Sequence analysis

-- Structure analysis

Major Bioinformatics databases, Search engines and data

formats.

By: Sachin Pundhir Bioinformatics sub-centre DAVV, Indore

Database

• Collection of records and files

• Organized for a particular purpose

• Tables• Tuples (records)

– Attributes» Values

BIO520 Student Database

1998

Name ID Grade

Amy 123 A

Joe 456 B

Sue 789 C

Table

Tuple

.

Attribute.

Value

Database Operations

• Tables– Create, delete

• Tuples (Records)– Read,write, delete

• Search, sort, modify, print…

1998

Name ID Grade

Amy 123 A

Joe 456 B

Sue 789 C

International Nucleotide Sequence Database Collaboration (INSDC)

• Consists of

DDBJ (Japan)

GenBank (USA)

EMBL Nucleotide Sequence Database.

• The three databases exchange new and updated data on a daily basis to achieve optimal synchronisation.

Bioinformatics databases

• Nucleotide sequence database:

– Genbank: Nucleotide sequence database. Highly redundant.

– DDBJ: DNA Data Bank of Japan.

– EMBL: nucleotide sequence database.

– Refseq: integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein

products, for major research organisms.

Primary databases

• Protein sequence database:

• Genpept: Protein sequence database.

• UniProtKB/Swiss-Prot: curated protein sequence database, minimal level of redundancy and high

level of integration with other databases.

• UniProtKB/TrEMBL: computer-annotated supplement of Swiss-Prot that contains all the

translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot.

•Refseq: Well curated, non-redundant database.

• Structure Database

•PDB: Protein Data Bank

•MMDB: Molecular Modeling Database

Secondary database

GenBank Record

Header

information that apply to the whole record

Features

annotations on the record

Sequence

GeneBank Record

modification date

Header

GenBank Record

Locus Name

Sequence Length

Molecule Type

GenBank Division

Modification DateAccession Number

Version Number

GeneBank Record

Link to Seq

FEATURE

GenBank RecordSequence

Using Entrez

An integrated database

search and retrieval system

WWWAccess

Entrez&BLAST

Genomes

Taxonomy

Entrez: Database Integration

PubMed abstracts

Nucleotide sequences

Protein sequences

3-D Structure

3 -D Structure

Word weight

VAST

BLASTBLAST

Phylogeny

Database Searching with Entrez

Using limits and field restriction to find human MutL homologLinking and neighboring with MutL

Global Entrez Search

Document Summaries:MutL[All Fields]

Entrez Nucleotides: Limits & Preview/Index

Tabs

MutL

Entrez Nucleotides: LimitsAccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitleUidVolume

Field Restriction

Exclude bulk sequences

MutL

Entrez Nucleotides: Limits

Title == Definition

Exclude Bulk Sequences

Document Summaries: Limits

Adding Terms: Preview/IndexAccessionAll FieldsAuthor NameEC/RN NumberFeature keyFilterGene NameIssueJournal NameKeywordModification DateOrganismPage NumberPrimary AccessionPropertiesProtein NamePublication DateSeqID StringSequence LengthSubstance NameText WordTitle UidVolume

Human MutL Search Results

Human MutL RefSeq

GenBank Records

NM_000249: Links

Literature Links

PubMed

OMIM

NM_000249: PubMed

Books

Books Link

OMIM: Human Disease Genes

Conserved Domain

Sequence Links

Nucleotide Protein

NM_000249: Related Sequences

simila

rity

Original GenBank mRNAs

Original GenBank genomic

Genome Project BAC

Taxonomy Link

The Tax Browser

NCBI’s Taxonomy

Taxonomy Link

NCBI Protein Databases

• GenPept GenBank, EMBL, DDBJ CDS translations

• RefSeq mRNA based (NP_) and genome based (XP_)

• Swiss-Prot curated high quality protein reviews

• PIR protein information resource Georgetown University

• PRF protein resource foundation

• PDB Protein Databank sequences from structures

Protein Link

BLAST Link

Conserved Domains

Related Proteins: Redundancy

Red

un

dan

t Seq

uen

ces

Sequence from MutL structure

Related Proteins: Links

BLink: non-redundant relatives

Arabidopsis homolog

Conserved Domain

MLH1 Domain Structure: CDD

ATPase Domain Mismatch Repair Domain

MLH1: ATPase Domain

ATPase structural alignment

ATP Binding site helix

Genome Resources

NM_000249: Genome Links

Higher Genome Resources

MLH1: UniGene Cluster

ESTs in UniGene

The New Homologene

early globin gene

A-chain gene B-chain gene

frog A chick A mouse A mouse B chick B frog B

paralogsorthologs orthologs

gene duplication

• No longer UniGene based• Protein similarities first• Guided by taxonomic tree• Includes orthologs and paralogs

The New Homologene

Entrez Genes: integrated gene-based access

LocusLinkComplete Genomes

•eukaryotic•microbial•organelle

Genes MLH1: Central Resource

QUESTIONS!!!

Recommended