Practical 1

  • Upload
    deacon

  • View
    28

  • Download
    1

Embed Size (px)

DESCRIPTION

Practical 1. Discussion. Features of major databases (PubMed and NCBI Protein Db). Anatomy of PubMed Db. Epub ahead of print and journal impact factor. How to get impact factor of any journal: Direct source – web of science database - PowerPoint PPT Presentation

Citation preview

  • *DiscussionPractical 1

  • Features of major databases(PubMed and NCBI Protein Db) *

  • Anatomy of PubMed Db*

  • Epub ahead of print and journal impact factor*How to get impact factor of any journal: Direct source web of science database

    In direct source, e.g. blogs, sites etc (do Google search)Adopted from : http://admin-apps.isiknowledge.com/JCR/JCR?RQ=LIST_SUMMARY_JOURNAL

  • Anatomy of a PubMed record*

  • Demo on downloading articles*

  • Anatomy of a Protein Db*

  • *Other popular sources:dbj DDBJ (DNA Data Bank of Japan database)emb The European Molecular BiologyLaboratory (EMBL) databaseprf Protein Research Foundation databasesp SwissProtgb GenBankpir Protein Information ResourceVersionNM_000546.3GI or Geninfo Identifier)120407067SourceRefseq databaseAccessionNM_000546Accession numbers and GenInfo Identifiers

  • *Why do we need accession number and GI for one record?1) What is the difference between accession and GI?

    2) Why do we need these two when both seem to be accession numbers?

  • *Q1) Which revision will NCBI show if you were to search bythe accession only without the version number?Sequence_v1NM_000546Sequence_v2NM_000546Sequence_v3NM_000546NM_000546.1NM_000546.2NM_000546.345076368400737120407067SequenceupdateSequenceupdateGIVersionWhy do we need accession number and GI for one record?

  • *Accession numbersThe unique identifier for a sequence record.

    An accession number applies to the complete record.

    Accession numbers do not change, even if information in the recordis changed at the author's request.

    Sometimes, however, an original accession number might becomesecondary to a newer accession number, if the authors make a newsubmission that combines previous sequences, or if for somereason a new submission supercedes an earlier record.

  • *GenInfo Identifiers GenInfo Identifier: sequence identification number

    If a sequence changes in any way, a new GI number will be assigned

    A separate GI number is also assigned to each protein translationWithin a nucleotide sequence record

    A new GI is assigned if the protein translation changes in any way

    GI sequence identifiers run parallel to the new accession.version system of sequence identifiers

  • *VersionA nucleotide sequence identification number that represents a single, specific sequence in the GenBank database.

    If there is any change to the sequence data (even a single base), theversion number will be increased, e.g., U12345.1 U12345.2, butthe accession portion will remain stable.

    The accession.version system of sequence identifiers runs parallel tothe GI number system, i.e., when any change is made to a sequence,it receives a new GI number AND an increase to its version number.

    A Sequence Revision History tool (http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi)is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record

  • *Anatomy of a Protein Db record

  • *Fasta Sequence

  • Fasta FormatText-based format for representing nucleic acid sequences or peptide sequences (single letter codes).Easy to manipulate and parse sequences to programs.

    >SEQUENCE_1MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL>SEQUENCE_2SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMHDescription line/rowSequence data line(s)Description line/rowSequence data line(s)

  • Fasta Format (cont.)Begins with a single-line description, followed by lines of sequence data.Description lineDistinguished from the sequence data by a greater-than (">") symbol.The word following the ">" symbol in the same row is the identifier of the sequence. There should be no space between the ">" and the first letter of the identifier.Keep the identifier short and clear ; Some old programs only accept identifiers of only 10 characters. For example: > gi|5524211|Human or >HumanP53Sequence line(s)Ensure that the sequence data starts in the row following the description row (be careful of word wrap feature)The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence.>SEQUENCE_1MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEGLVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHKIPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTLMGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL>SEQUENCE_2SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQIATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMHDescription line/rowSequence data line(s)Description line/rowSequence data line(s)

  • Amino acids*

  • IUPAC One Letter Amino Acid CodeABCDEFGHIJKLMNOPQRSTUVWXYZAlanineCysteineGlycineHistidineIsoleucineLeucineMethionineProlineSerineThreonineValineGlutamic AcidAspartic AcidPhenylalanineLysineAsparagineGlutamineArginineTryptophanTyrosine21st (Sec) Selenocysteine22nd (Pyl) PyrrolysineGLxASxGlutamic AcidAspar(D)ic Acid(F)enylalanineLysineAsparagi(N)e(Q)lutamine(R)ginineT(W)ptophanT(Y)rosine21st (Sec)Selenocysteine22nd (Pyl) Pyrr(O)lysineGLxASx

  • Note

    Amino acidThree letter codeSingle letter codeAsparagine or aspartic acidAsxBGlutamine or glutamic acid,GLxZLeucine or Isoleucine,XleJUnspecified or unknown amino acidXaaX

  • AdviceWe highly recommend that you memorize the amino acid codes and their structuresMemorizing the codes and in particular the structures will be very useful for this module and other modules, especially for research purposes. It is not compulsory that you memorize these for this module.

  • Features of major database (Gene Db) *

  • *Anatomy of Gene Db

  • *Anatomy of a Gene Db record

  • A section of Gene Db record:Reference Sequences*mRNA Accession numberProtein Accession number

  • Take home messages for databasesBioinformatics = databases + toolsGeneral databases versus specialized databasesDatabases come and go (especially the small ones)Database redundancy - many databases for the same topic (use the most comprehensive, if not use all for comprehensiveness)Database accuracy published ones are more reliable; nevertheless, they are still prone to errors; always good to spend sometime assessing the reliability of your data of interest by doing cross-referencing to literature or other databasesFortunately, most databases are cross-referencedUnfortunately, no common standard format; need to spend some time familiarizing each; becomes easy after some practiceFinding databases relevant to youNAR Database catalogue PubmedGoogle2 main methods for searching databases (each with its own pros and cons)1. Keyword search (covered today)2. Sequence search (day 2)*

    *******************