1
Accessing ENCODE project data using a REST API and JSON objects. Cricket A Sloan 1 , Esther T Chan 1 , Venkat S Malladi 1, , Jean M Davidson 1 , Eurie L Hong 1 , J Seth Strattan 1 , Laurence D Rowe 1 , Ben C Hitz 1 Nikhil R Podduturi 1 , Forrest Tanaka 1 , Brian T Lee 2 , Marcus Ho 1 , Stuart Miyasato 1 , Matt Simison 1 , W James Kent 2 , J Michael Cherry 1 1 Stanford University School of Medicine, Department of Genetics, Stanford, CA; 2 University of California at Santa Cruz, Center for Biomolecular Science and Engineering, Santa Cruz, CA The Encyclopedia of DNA Elements project (ENCODE) has been producing data for over eight years to investigate DNA and RNA binding proteins, chromatin structure, transcriptional activity and DNA methylation on a variety of human and mouse tissues and cell lines. As the complexity and diversity of the data grows, the tools required to organize, search and access the data in meaningful ways need to be more sophisticated. The ENCODE Data Coordination Center (DCC) has incorporated a representational state transfer application programming interface (REST API) with JSON (JavaScript Object Notation) objects to facilitate the access of ENCODE experimental metadata using a web portal. Meta-data can be accessed and data can be searched for at http://www.encodeproject.org/ using the HTTP request from a script or the curl command. We further expand on the access capability by allowing filtering of the metadata with the use of search urls. This system allows external researchers to write their own interfaces to access, analyze and visualize the ENCODE data. It also facilitates the integration of the ENCODE data with other similar large-scale data sets like Epigentics Roadmap and modENCODE. Here we will present our JSON schemas, examples of the REST API and use-cases for the search functions. Our goal is for the genomics community to use the released ENCODE data available through these methods for data mining and integration. Data from the ENCODE project can be accessed via the ENCODE portal (h ttp://www.encodeproject.org ) and documentation for the REST API can be accessed at : https://www.encodeproject.org/help/rest-api . . @ENCODE-DCC [email protected] ENCODE DCC https://www.encodeproject.org Metadata returned in JSON format Sample code https://github.com/ENCODE- DCC/submission_sample_scripts ENCODE REST API Documentation https://www.encodeproject. org/help/rest-api Each search or page is a JSON object curl -H "Accept: application/json" -X GET https://www.encodeproject.org/search/ ?type=experiment &assay_term_name=RNA-seq &organ_slims=lung &replicates.library.biosample.life_stage=fetal Use search urls with any http GET access A search produces a JSON object with an “@graph” field that is a list of minimal identifying information about each result in the search A summary page includes all of the details and sub- objects in its JSON object More search examples Every object that matches the string “CTCF”: https://www.encodeproject.org/search/? searchTerm=CTCF&format=json&frame=object All the fastq file objects from a particular experiment ENCSR000AKS (with reference objects embedded): https://www.encodeproject.org/search/? type=file&dataset=/experiments/ENCSR000AKS/&file_format=fastq&format=json&fr ame=embedded&limit=all All biosamples (abbreviated metadata): https://www.encodeproject.org/search/?type=biosample&limit=all&format=json All biosamples (full metadata with object references): https://www.encodeproject.org/search/? type=biosample&frame=object&limit=all&format=json Schema profiles can be found on the site at https://www.encodeproject.org/profiles/*.json where * is replaced by the name of the object of interest. A table of the most relevant objects can be found below. A complete listing of all the current schemas can be found in our github, https: //github.com/ENCODE-DCC/encoded/blob/master/src/encoded/schemas/ . Object relationships in the metadata model Many objects are needed to describe the varied assays, biosamples, and data processing steps that are involved in the ENCODE project. We are incorporating JSON-LD to link and embed these relationships. By having separate objects for donors, biosamples,antibodies, etc. we can model where there is an exact sharing relationship. Schema access Construct urls to search ENCODE data Batch download of files More search examples File metadata, including the href access information for the file itself, is found by querying the ENCODE portal for the file JSON object. If the file accession is ENCFF002CTW, then the metadata object can be found at https://www. encodeproject. org/files/ENCFF002CTW/ . The href field in that object, /files/ENCFF002CTW@@down load/ENCFF002CTW. narrowPeak.gz, is appended to the site url to download the file itself, https: //www.encodeproject. org/files/ENCFF002CTW/@@d ownload/ENCFF002CTW. narrowPeak.gz

The ENCODE Portal REST API

Embed Size (px)

Citation preview

Accessing ENCODE project data using a REST API and JSON objects.

Cricket A Sloan1, Esther T Chan1, Venkat S Malladi1,, Jean M Davidson1 , Eurie L Hong1, J Seth Strattan1, Laurence D Rowe1, Ben C Hitz1

Nikhil R Podduturi1, Forrest Tanaka1, Brian T Lee2, Marcus Ho1, Stuart Miyasato1, Matt Simison1, W James Kent2, J Michael Cherry1

1Stanford University School of Medicine, Department of Genetics, Stanford, CA; 2University of California at Santa Cruz, Center for Biomolecular Science and Engineering, Santa Cruz, CA

The Encyclopedia of DNA Elements project (ENCODE) has been producing data for over eight years to investigate DNA and RNA binding proteins, chromatin structure, transcriptional activity and DNA methylation on a variety of human and mouse tissues and cell lines. As the complexity and diversity of the data grows, the tools required to organize, search and access the data in meaningful ways need to be more sophisticated. The ENCODE Data Coordination Center (DCC) has incorporated a representational state transfer application programming interface (REST API) with JSON (JavaScript Object Notation) objects to facilitate the access of ENCODE experimental metadata using a web portal. Meta-data can be accessed and data can be searched for at http://www.encodeproject.org/ using the HTTP request from a script or the curl command. We further expand on the access capability by allowing filtering of the metadata with the use of search urls. This system allows external researchers to write their own interfaces to access, analyze and visualize the ENCODE data. It also facilitates the integration of the ENCODE data with other similar large-scale data sets like Epigentics Roadmap and modENCODE. Here we will present our JSON schemas, examples of the REST API and use-cases for the search functions. Our goal is for the genomics community to use the released ENCODE data available through these methods for data mining and integration. Data from the ENCODE project can be accessed via the ENCODE portal (http://www.encodeproject.org) and documentation for the REST API can be accessed at : https://www.encodeproject.org/help/rest-api.

.

@ENCODE-DCC

[email protected] DCChttps://www.encodeproject.org

Metadata returned in JSON format

Sample codehttps://github.com/ENCODE-DCC/submission_sample_scripts

ENCODE REST API Documentationhttps://www.encodeproject.org/help/rest-api

Each search or page is a JSON object

curl -H "Accept: application/json"

-X GET

https://www.encodeproject.org/search/

?type=experiment

&assay_term_name=RNA-seq

&organ_slims=lung

&replicates.library.biosample.life_stage=fetal

Use search urls with any http GET accessA search produces a JSON object with an “@graph” field that is a list of minimal identifying information about each result in the search

A summary page includes all of the details and sub-objects in its JSON object

More search examplesEvery object that matches the string “CTCF”:https://www.encodeproject.org/search/?searchTerm=CTCF&format=json&frame=objectAll the fastq file objects from a particular experiment ENCSR000AKS (with reference objects embedded):https://www.encodeproject.org/search/?type=file&dataset=/experiments/ENCSR000AKS/&file_format=fastq&format=json&frame=embedded&limit=allAll biosamples (abbreviated metadata):https://www.encodeproject.org/search/?type=biosample&limit=all&format=jsonAll biosamples (full metadata with object references):

https://www.encodeproject.org/search/?type=biosample&frame=object&limit=all&format=json

Schema profiles can be found on the site at https://www.encodeproject.org/profiles/*.json where * is replaced by the name of the object of interest. A table of the most relevant objects can be found below. A complete listing of all the current schemas can be found in our github, https://github.com/ENCODE-DCC/encoded/blob/master/src/encoded/schemas/ .

Object relationships in the metadata modelMany objects are needed to describe the varied assays, biosamples, and data processing steps that are involved in the ENCODE project. We are incorporating JSON-LD to link and embed these relationships. By having separate objects for donors, biosamples,antibodies, etc. we can model where there is an exact sharing relationship.

Schema access

Construct urls to search ENCODE data

Batch download of files

More search examples

File metadata, including the href access information for the file itself, is found by querying the ENCODE portal for the file JSON object. If the file accession is ENCFF002CTW, then the metadata object can be found at https://www.encodeproject.org/files/ENCFF002CTW/ . The href field in that object, /files/ENCFF002CTW@@download/ENCFF002CTW.narrowPeak.gz, is appended to the site url to download the file itself, https://www.encodeproject.org/files/ENCFF002CTW/@@download/ENCFF002CTW.narrowPeak.gz