Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource...

Preview:

Citation preview

Analysis of GEO datasets using GEO2R

Parthav JailwalaCCR Collaborative Bioinformatics Resource

CCR/NCI/NIH

Outline

• Background on GEO datasets• What is GEO2R and how can it help you• How to use GEO2R• Options and features• Limitations and caveats• Hands-on exercise

• An international public repository that archives and freely distributes high-throughput microarray & NGS data submitted by the scientific community

• About a billion individual gene expression measurements, derived from over 100 organisms, wide range of biological issues

• Data can be explored, queried and visualized using user-friendly Web-based tools

GEO data organization

[ GPLxxx ] [ GSMxxx ] [ GSExxx ]

[ GDSxxx ]

What kinds of data does GEO host?

• GEO was designed around the common features of most of the high-throughput and parallel molecular abundance-measuring technologies in use today. These include:

– Gene expression profiling by microarray or next-generation sequencing – Non-coding RNA profiling by microarray or next-generation sequencing– Chromatin immunoprecipitation (ChIP) profiling by microarray or next-

generation sequencing– Genome methylation profiling by microarray or next-generation

sequencing– Genome variation profiling by array (arrayCGH)– SNP arrays– Serial Analysis of Gene Expression (SAGE)– Protein arrays

What is GEO2R ?

• Interactive web tool that allows users to compare two or more groups of Samples in a GEO Series in order to identify genes that are differentially expressed across experimental conditions

• Uses GEOquery and Limma R packages from Bioconductor project

• Simple interface that allows users to perform R statistical analysis without command line expertise

• Does not rely on curated ‘DataSets’ and interrogates the original Series Matrix data file directly

How to use GEO2R• Enter a Series accession number

– Follow a link from a Series record OR– Enter a Series accession number

• Define Sample groups

– Atleast 2, upto 10 groups can be defined

• Assign Samples to each group

– Not all samples in a series need to be selected

• Perform the test

– Assess sample value distributions– Edit default test parameters

• Interpret the results

– Table of the top 250 genes ranked by p-value– Select columns to be included in the output table– Edit the test parameters -> Recalculate to apply edits– Download the tab-delimited table and open in Excel

Options and features

• Value distribution

– Number summary or boxplot– Median centered values indicative that data are normalized and cross-comparable

• Options

– Apply adjustment of p-values– Apply log transformation to the data– Category of Platform annotation to display on results (NCBI generated (preferred)

or Submitter supplied)

• Profile graph

• R script

Limitations & caveats• Check that Sample values are comparable

– Assess the value distribution boxplot– Review the GEO Series experiment description

• Data type restriction– Some GEO data do not have data tables (eg. High-throughput sequencing or

genome tiling arrays)

• Within-Series restriction– No cross-series comparisons

• 255 Sample limit

• 10 minute timeout

Summary statistics from Limma

Hands-on exercise

• Google: GSE18388• Microarray Analysis of Space-flown Murine

Thymus Tissue

Recommended