30
DATA ACQUISITION FROM BIO-DATABASES AND BLAST Natapol Pornputtapong 18 January 2018

DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

DATA ACQUISITION FROM BIO-DATABASES

AND BLASTNatapol Pornputtapong

18 January 2018

Page 2: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

DATABASE

• Collections of data

• To share – multi-user interface

• To prevent data loss

• To make sure to get the right things

Bioinformatics for Phylogenetic Analysis Workshop 2

Page 3: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

LIBRARY -> DIGITAL LIBRARY

Bioinformatics for Phylogenetic Analysis Workshop 3

Page 4: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

DATABASE: A LIBRARY OF DATA

Database• Files, Tables, Records

• Data structure

• Database management system

• Programming interface

• User interface

Library• Books

• building, shelves

• Librarian

• Protocols, SOPs

• Services

Bioinformatics for Phylogenetic Analysis Workshop 4

Page 5: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

ADVANTAGE OF DATABASE

• Data integrity

• Smaller space

• Data availability

• Speed

Bioinformatics for Phylogenetic Analysis Workshop 5

Page 6: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

DATABASE FOR USERS

Bioinformatics for Phylogenetic Analysis Workshop 6

Database

Search

Download

Users

Submission

Page 7: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

HOW TO CHOOSE DATABASE?

• 1695 bio-databases in NAR online Molecular Biology Database Collection in 15 categories

Bioinformatics for Phylogenetic Analysis Workshop 7

Page 8: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

DATA CONTENT

• Literature

• DNA sequence

• Protein sequence

Bioinformatics for Phylogenetic Analysis Workshop 8

GenBank

RefSeq TrEMBL

Page 9: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

CONCEPTS OF DATABASE

Bioinformatics for Phylogenetic Analysis Workshop 9

Source Source Source

Database

interface

DatabaseDatabase

Database

Database

interface• Primary database• Secondary database

Page 10: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

PRIMARY & SECONDARY DB

Primary database Secondary database

Synonyms Archival database Curated database; knowledgebase

Source of data Direct submission of experimentally-derived data from researchers

Results of analysis, literature research and interpretation, often of data in primary databases

Examples •ENA, GenBank and DDBJ (nucleotide sequence)•ArrayExpress Archive and GEO (functional genomics data)•Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures)

•InterPro (protein families, motifs and domains)•UniProt Knowledgebase (sequence and functional information on proteins)•Ensembl (variation, function, regulation and more layered onto whole genome sequences)

Bioinformatics for Phylogenetic Analysis Workshop 10

Page 11: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

DATA COLLECTION CRITERIA

Bioinformatics for Phylogenetic Analysis Workshop 11

GenBank RefSeq

GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences

Page 12: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

Bioinformatics for Phylogenetic Analysis Workshop 12

Page 13: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

ACCESSIBILITY: TOOLS & INTERFACES

Bioinformatics for Phylogenetic Analysis Workshop 13

NCBI Entrez RESTful interface to the ENA

Page 14: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

NCBI SEARCH TOOL

Bioinformatics for Phylogenetic Analysis Workshop 14

Page 15: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

SIMPLE SEARCH

Bioinformatics for Phylogenetic Analysis Workshop 15

Page 16: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

BOOLEAN OPERATOR

Bioinformatics for Phylogenetic Analysis Workshop 16

Page 17: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

FILTER

• Limit with filter

• Advanced search builder

Bioinformatics for Phylogenetic Analysis Workshop 17

Page 18: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

RESULTS

Bioinformatics for Phylogenetic Analysis Workshop 18

Page 19: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

BLAST: BASIC LOCAL ALIGNMENT SEARCH TOOL

Bioinformatics for Phylogenetic Analysis Workshop 19

Page 20: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

MAJOR BLAST PROGRAMS

Bioinformatics for Phylogenetic Analysis Workshop 20

Page 21: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

BLAST SEARCH

Bioinformatics for Phylogenetic Analysis Workshop 21

Page 22: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

Bioinformatics for Phylogenetic Analysis Workshop 22

Page 23: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

OTHER BLAST PROGRAMS

Bioinformatics for Phylogenetic Analysis Workshop 23

Page 24: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

WORLD OF FILES

Text files Binary files

Bioinformatics for Phylogenetic Analysis Workshop 24

Page 25: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

TEXT FILES: WORLD OF FORMATS

• MS Words: .doc, .docx, .rtf, .txt

• Sequence: FastA (.fasta), Genbank (.gbk)

• Protein structure: PDB (.pdb)

Bioinformatics for Phylogenetic Analysis Workshop 25

Page 26: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

FASTA FORMAT

>P01013 GENE X PROTEIN (OVALBUMIN-RELATED)

QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMP

FHVTKQESKPVQMMCMNNSFNVATLPAEKMKILELPFASGDL

SMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVY

LPQMKIEEKYNLTSVLMALGMTDLFIPSANLTGISSAESLKI

SQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP

FLFLIKHNPTNTIVYFGRYWSP

>…

Bioinformatics for Phylogenetic Analysis Workshop 26

Page 27: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

GENBANKFORMAT

Bioinformatics for Phylogenetic Analysis Workshop 27

Page 28: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

NEXUS FORMAT

#NEXUS

BEGIN DATA;

DIMENSIONS NTAX=8 NCHAR=1202;

FORMAT MISSING=? DATATYPE=PROTEIN GAP=-;

OPTIONS GAPMODE=MISSING;

MATRIX

[ 10 20 ...]

[ ---------|---------|-...]

Homo_sapiens_4379045 TERLVLPPPDPLDLPLRAVEL...

Pan_troglodytes_114606536 TERLVLPPPDPLDLPLRAVEL...

Ailuropoda_melanoleuca_301788522 TERLVLPPPDPLDLPLRPVEL...

Mus_musculus_87252727 TERLVLPPLDPLNLPLRALEV...

Danio_rerio_113678409 MDKIDLPPVGPDDLPLSLLEM...

Xenopus_tropicalis_301627725 MNTLDLSNRDPLDLPLSVLEL...

Monodelphis_domestica_126309591 TERLVLPPRGPLDLPLCALEL...

Canis_familiaris_73972333 TERLALPPPDPLDLPLRPVEL...;

END;

Bioinformatics for Phylogenetic Analysis Workshop 28

Page 29: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

NEXT

Bioinformatics for Phylogenetic Analysis Workshop 29

Inputs Analysis Results

Page 30: DATA ACQUISITION FROM BIO-DATABASES AND …pharmce.weebly.com/uploads/9/5/8/7/95877138/day_2-data...GenBank ® is the NIH genetic sequence database, an annotated collection of all

QUESTION?

Bioinformatics for Phylogenetic Analysis Workshop 30