19
SRAdb - a R/Bioconductor Package Jack Zhu

SRAdb Bioconductor Package Overview

Embed Size (px)

Citation preview

Page 1: SRAdb Bioconductor Package Overview

SRAdb - a R/Bioconductor Package

Jack Zhu

Page 2: SRAdb Bioconductor Package Overview

NCBI SRA• SRA: Sequence Read Archive:

– Archive of high-throughput sequencing data

• What is stored?– Raw sequence data– Now stores alignment information in sra format– EBI: still hosts fastq files

• The international partnership (INSDC):– SRA: NCBI Sequence Read Archive– ERA: EBI Sequence Read Archive– DRA: DDBJ Sequence Read Archive– All data is shared and synchronized between SRA, ERA and DRA.

Page 3: SRAdb Bioconductor Package Overview

http://www.ncbi.nlm.nih.gov/Traces/sra

Page 4: SRAdb Bioconductor Package Overview

NCBI SRA Web Site

Page 5: SRAdb Bioconductor Package Overview

SRAdb Biocondutor Package

• SRAdb SQLite database:– SRA metadata: faithfully parsed from NCBI SRA

XML files– Main tables: Submission, study, sample,

experiment, run– MySQL database SQLite base– Portable and local– Programmatically access to data – R and SQL– Updated weekly

Page 6: SRAdb Bioconductor Package Overview
Page 7: SRAdb Bioconductor Package Overview

SRAdb Download Stats

Page 8: SRAdb Bioconductor Package Overview

SRAdb Entities/Data Types

Page 9: SRAdb Bioconductor Package Overview

SRAdb Functions

Function Category DescriptionsraConvert Query Cross-reference between GEO data types

getFASTQinfo QueryGet SRA fastq file information and associated meta data from EBI ENA

getSRAinfo QueryGet SRA data file information from NCBI SRA

listSRAfile QueryList sra, sra-lite or fastq data file names associated with input SRA accessions

Page 10: SRAdb Bioconductor Package Overview

SRAdb Functions – cont.

Function Category Description

getSRA DownloadFulltext search SRA meta data using SQLite fts3 module

getSRAdbFile DownloadDownload and unzip last version of SRAmetadb.sqlite.gz from the server

getSRAfile Download Download SRA data file through ftp or fasp

ascpR DownloadFasp file downloading using the ascp command line program

ascpSRA DownloadFasp SRA data file downloading using the ascp command line program

entityGraph GraphCreate a new graphNEL object from an input entity matrix or data.frame

sraGraph GraphCreate a new graphNEL object of SRA accessions from SRA full text search

Page 11: SRAdb Bioconductor Package Overview

SRAdb Functions – cont.

Function Category Description

startIGV IGVStart IGV from R with different amount maximum memory support

IGVclear IGV Clear IGV tracks loaded.IGVcollapse IGV Collapse tracks in the IGVIGVgenome IGV Set the IGV genome.IGVgoto IGV Go to a specified region in IGV.

IGVload IGV Load data into IGV via remote port call.IGVsession IGV Create an IGV session file

IGVsnapshot IGV Make a file snapshot of the current IGV screen.IGVsocket IGV Create a Socket Connection to IGV.

IGVsort IGV Sort an alignment track by the specified option.

Page 12: SRAdb Bioconductor Package Overview

Getting Started> library(SRAdb)Loading required package: RSQLiteLoading required package: DBILoading required package: graphLoading required package: RCurlLoading required package: bitopsSetting options('download.file.method.GEOquery'='auto')

> sqlfile <- getSRAdbFile()trying URL 'http://gbnci.abcc.ncifcrf.gov/backup/SRAmetadb.sqlite.gz'Content type 'application/x-gzip' length 916403786 bytes (874.0 MB)==================================================downloaded 874.0 MB

Unzipping...

Page 13: SRAdb Bioconductor Package Overview

SQL Query> rs <- dbGetQuery( sra_con, paste( "SELECT study_type AS StudyType,+ count( * ) AS Number FROM `study` GROUP BY study_type order+ by Number DESC ", sep="") )

> rs StudyType Number1 Whole Genome Sequencing 265632 Other 139083 Transcriptome Analysis 71794 Metagenomics 45005 <NA> 31176 Epigenetics 8457 Population Genomics 6928 Exome Sequencing 1419 Cancer Genomics 7710 Pooled Clone Sequencing 3211 Synthetic Genomics 912 RNASeq 3

Page 14: SRAdb Bioconductor Package Overview

Accession Conversion> Conversion = sraConvert( c('SRP001007','SRP000931'), sra_con )

> conversion[1:3,] study submission sample experiment run1 SRP000931 SRA009053 SRS003453 SRX006122 SRR0182562 SRP000931 SRA009053 SRS003454 SRX006123 SRR0182573 SRP000931 SRA009053 SRS003464 SRX006135 SRR018269

Page 15: SRAdb Bioconductor Package Overview
Page 16: SRAdb Bioconductor Package Overview

Full Text Search> rs <- getSRA( search_terms = "breast cancer", out_types = c('run','study'), sra_con )> dim(rs)[1] 11081 23

> rs <- getSRA (search_terms ='"breast cancer"', out_types=c('run','study'), sra_con)> dim(rs)[1] 9803 23

> rs <- getSRA (search_terms ="breast OR cancer", out_types = c('run','study'), sra_con )> dim(rs)[1] 74250 23

Page 17: SRAdb Bioconductor Package Overview

Fasp Protocol Downloading> ascpCMD <- “ascp -QT -l 300m -i '/Users/zhujack/Applications/AsperaConnect.app/Contents/Resources/asperaweb_id_dsa.putty'”

> getSRAfile( c("SRX000122"), sra_con, fileType = 'sra', srcType = 'fasp', ascpCMD = ascpCMD )Files are saved to: '/Users/zhujack/Documents/R_WD'

Completed: 130939K bytes transferred in 7 seconds (148,846K bits/sec), in 1 file.Completed: 422K bytes transferred in 0 seconds (4,067K bits/sec), in 1 file.Completed: 843K bytes transferred in 1 seconds (5,337K bits/sec), in 1 file.Completed: 159492K bytes transferred in 14 seconds (90,572K bits/sec), in 1 file.----

Page 18: SRAdb Bioconductor Package Overview

Interaction with IGV> startIGV("mm")

> sock <- IGVsocket()

> IGVgenome(sock, 'hg19')

> IGVload(sock, exampleBams)

> IGVgoto(sock, 'chr1:1-1000')

> IGVsnapshot(sock)

Page 19: SRAdb Bioconductor Package Overview

Acknowledgements

• Paul Meltzer• Sean Davis• All members in Meltzerlab

• NCI SRA Team– O'Sullivan, Christopher – Ben Busby