13
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

Embed Size (px)

Citation preview

Page 1: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

Analysis of GEO datasets using GEO2R

Parthav JailwalaCCR Collaborative Bioinformatics Resource

CCR/NCI/NIH

Page 2: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

Outline

• Background on GEO datasets• What is GEO2R and how can it help you• How to use GEO2R• Options and features• Limitations and caveats• Hands-on exercise

Page 3: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

• An international public repository that archives and freely distributes high-throughput microarray & NGS data submitted by the scientific community

• About a billion individual gene expression measurements, derived from over 100 organisms, wide range of biological issues

• Data can be explored, queried and visualized using user-friendly Web-based tools

Page 4: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

GEO data organization

[ GPLxxx ] [ GSMxxx ] [ GSExxx ]

[ GDSxxx ]

Page 5: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

What kinds of data does GEO host?

• GEO was designed around the common features of most of the high-throughput and parallel molecular abundance-measuring technologies in use today. These include:

– Gene expression profiling by microarray or next-generation sequencing – Non-coding RNA profiling by microarray or next-generation sequencing– Chromatin immunoprecipitation (ChIP) profiling by microarray or next-

generation sequencing– Genome methylation profiling by microarray or next-generation

sequencing– Genome variation profiling by array (arrayCGH)– SNP arrays– Serial Analysis of Gene Expression (SAGE)– Protein arrays

Page 6: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

What is GEO2R ?

• Interactive web tool that allows users to compare two or more groups of Samples in a GEO Series in order to identify genes that are differentially expressed across experimental conditions

• Uses GEOquery and Limma R packages from Bioconductor project

• Simple interface that allows users to perform R statistical analysis without command line expertise

• Does not rely on curated ‘DataSets’ and interrogates the original Series Matrix data file directly

Page 7: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

How to use GEO2R• Enter a Series accession number

– Follow a link from a Series record OR– Enter a Series accession number

• Define Sample groups

– Atleast 2, upto 10 groups can be defined

• Assign Samples to each group

– Not all samples in a series need to be selected

• Perform the test

– Assess sample value distributions– Edit default test parameters

• Interpret the results

– Table of the top 250 genes ranked by p-value– Select columns to be included in the output table– Edit the test parameters -> Recalculate to apply edits– Download the tab-delimited table and open in Excel

Page 8: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

Options and features

• Value distribution

– Number summary or boxplot– Median centered values indicative that data are normalized and cross-comparable

• Options

– Apply adjustment of p-values– Apply log transformation to the data– Category of Platform annotation to display on results (NCBI generated (preferred)

or Submitter supplied)

• Profile graph

• R script

Page 9: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

Limitations & caveats• Check that Sample values are comparable

– Assess the value distribution boxplot– Review the GEO Series experiment description

• Data type restriction– Some GEO data do not have data tables (eg. High-throughput sequencing or

genome tiling arrays)

• Within-Series restriction– No cross-series comparisons

• 255 Sample limit

• 10 minute timeout

Page 10: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

Summary statistics from Limma

Page 11: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH

Hands-on exercise

• Google: GSE18388• Microarray Analysis of Space-flown Murine

Thymus Tissue

Page 12: Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH