ChIP%seq)Analysis)barc.wi.mit.edu/education/hot_topics/ChIPseq_2015/... · Experimental)design) • Include)acontrol)sample.) • If)the)protein)of)interestbinds)repeBBve)regions,)using)

ChIP-‐seq Analysis

BaRC Hot Topics -‐ Feb 24th 2015

BioinformaBcs and Research CompuBng Whitehead InsBtute

hGp://barc.wi.mit.edu/hot_topics/

Before we start:

1.  Log into tak (step 0 on the exercises) 2.  Go to your lab space and create a folder for

the class (see separate hand out) 3.  Connect to your lab space through the wi-‐

htdata network and open the folder you created

2

Outline •  ChIP-‐seq overview •  Quality control •  Experimental design •  Mapping

– Map reads –  Remove unmapped reads and convert to bam files –  Check the profile of the mapped reads

•  Peak calling •  More detail about MACS opBons •  Linking peaks to genes •  Visualizing ChIP-‐seq data with ngsplot

3

ChIP-Seq overview

4 Park, P. J., ChIP-‐seq: advantages and challenges of a maturing technology, Nat Rev Genet. Oct;10(10):669-‐80 (2009)

Steps in data analysis

5

1.  Quality control

2.  EffecBve mapping Treat IP and control the same way

(preprocessing and mapping) 3. Peak calling i) Read extension and signal profile

generaBon ii) Peak assignment

Pepke, S. et al. ComputaBon for ChIP-‐seq and RNA-‐seq studies, Nat Methods. Nov (2009)

Experimental design

•  Include a control sample. •  If the protein of interest binds repeBBve regions, using paired–end sequencing may reduce the mapping ambiguity. Otherwise single reads should be fine.

•  Include at least two biological replicates. If you have replicates you may want to use the parameter IDR “irreproducible discovery rate”. See us for details.

•  If only a small percentage of the reads maps to the genome, you may have to troubleshoot your ChIP protocol.

6

Data that we will use on the hands on the exercises

•  ENCODE human ChIP-‐seq experiment – EncodeBroadHistoneHepg2H3k4me3 – EncodeBroadHistoneHepg2Control

•  For quality control and mapping we will use a subset of the data (10%)

•  For running MACS we will use the full set but only chr1 reads

7

Quality control (before read mapping)

See Hot Topics: Quality Control and Mapping Reads

•  Run fastqc –  Look at output and determine

1.  What type of quality encoding (Illumina versions) your reads have

2.  If you need to remove adapter, trim reads, filter reads based on quality etc.

Input qualiBes Illumina versions

-‐-‐solexa-‐quals <= 1.2

-‐-‐phred64 1.3-‐1.7

-‐-‐phred33 >= 1.8

8

Hands-‐on Step 1: Quality control

Run step 1 commands (fastqc) bsub fastqc Hepg2Control_subset.fastq bsub fastqc Hepg2H3k4me3_subset.fastq

9

Fastqc Output

10

Mapping

•  We have to map reads to the genome. The mapping solware doesn’t have to know about splicing. – BowBe, bowBe2 (allows gaps) are good choices – Bwa (allows gaps) is fine too

•  Treat IP and control samples the same way during preprocessing and mapping.

11

Hands-‐on Step 2: Mapping •  bsub bowBe2 -‐-‐phred33-‐quals -‐N 1 -‐x /nfs/genomes/

human_gp_feb_09_no_random/bowBe/hg19 -‐U Hepg2Control_subset.fastq -‐S Hepg2Control_subset_hg19.N1.sam

•  bsub bowBe2 -‐-‐phred33-‐quals -‐N 1 -‐x /nfs/genomes/human_gp_feb_09_no_random/bowBe/hg19 -‐U Hepg2H3k4me3_subset.fastq -‐S Hepg2H3k4me3_subset_hg19.N1.sam

-‐-‐phred33-‐quals Because we have Illumina 1.9 -‐N <int> max # mismatches in seed alignment; can be 0 or 1 (default=0) -‐x path to the bowBe2 index -‐U input sequence -‐S output.sam

12

Hands-‐on Steps 3: Remove unmapped reads and convert to BAM. Sort and index the bam file.

Remove unmapped reads •  bsub "samtools view -‐hS -‐F 4 Hepg2Control_subset_hg19.N1.sam | samtools view -‐b -‐S -‐ >

Hepg2Control_subset_hg19.N1.mapped_only.bam" •  bsub "samtools view -‐hS -‐F 4 Hepg2H3k4me3_subset_hg19.N1.sam | samtools view -‐b -‐S -‐ >

Hepg2H3k4me3_subset_hg19.N1.mapped_only.bam" Sort reads •  bsub samtools sort -‐T Control_subset -‐o Control_subset_hg19.N1.mapped_only.sorted.bam

Hepg2Control_subset_hg19.N1.mapped_only.bam •  bsub samtools sort -‐T H3k4me3_subset -‐o H3k4me3_subset_hg19.N1.mapped_only.sorted.bam

Hepg2H3k4me3_subset_hg19.N1.mapped_only.bam -‐T PREFIX Write temporary files to PREFIX.nnnn.bam

-‐o FILE Write final output to FILE rather than standard output Index reads •  bsub samtools index Control_subset_hg19.N1.mapped_only.sorted.bam •  bsub samtools index H3k4me3_subset_hg19.N1.mapped_only.sorted.bam

13

Strand cross-‐correlaBon analysis

14

Strand cross-‐correlaBon is computed as the Pearson correlaBon between the posiBve and the negaBve strand profiles at different strand shil distances, k

Bailey, et al. PracBcal Guidelines for the Comprehensive Analysis of ChIP-‐seq Data, PLOS ComputaPonal Biology Nov 2013

Strand cross-‐correlaBon analysis outputs

15

Hands-‐on Steps 4: Quality control aler read mapping

Check the strand cross-‐correlaBon peak and predominant fragment length.

•  bsub /nfs/BaRC_Public/phantompeakqualtools/run_spp.R -‐c=Control_chr1.bam -‐savp -‐out=Control_chr1.run_spp.out

•  bsub /nfs/BaRC_Public/phantompeakqualtools/run_spp.R -‐c=H3k4me3_chr1.bam -‐savp -‐out=H3k4me3_chr1.run_spp.out -‐savp, save cross-‐correlaBon plot

16

Peak calling

i) Read extension and signal profile generaBon §  strand cross-‐correlaBon can be used to

calculate fragment length ii) Peak evaluaBon

§  Look for fold enrichment of the sample over input or expected background

§  EsBmate the significance of the fold enrichment using: •  Poisson distribuBon •  negaBve binomial distribuBon •  background distribuBon from input

DNA •  model background data to adjust for

local variaBon (MACS)

17

Pepke, S. et al. ComputaBon for ChIP-‐seq and RNA-‐seq studies, Nat Methods. Nov. 2009

Peak calling: MACS •  MACS uses a two-‐step strategy:

1.  Builds a model using high quality peaks and infers shil size from it. 2.  Shils the reads and does a second round to call the final peaks.

•  It uses a Poisson distribuBon to assign p-‐values to peaks. But the distribuBon has a dynamic parameter, local lambda, to capture the influence of local biases.

•  MACS default is to filter out redundant tags at the same locaBon and with the same strand by allowing at most 1 tag. This works well.

•  -‐g: You need to set up this parameter accordingly: EffecBve genome size. It can be 1.0e+9 or 1000000000, or shortcuts: 'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruit fly (1.2e8), Default:hs

•  For broad peaks like some histone modificaBons it is recommended to use --nomodel and if there is not input sample to use --nolambda.

18

Hands-‐on Steps 5: MACS •  Input files: Hepg2H3k4me3_chr1.bam Control_chr1.bam •  MACS command bsub macs2 callpeak -‐t H3k4me3_chr1.bam -‐c Control_chr1.bam -‐-‐name H3k4me3_chr1

-‐-‐format BAM

PARAMETERS •  -‐t TFILE Treatment file •  -‐c CFILE Control file •  -‐-‐name NAME Experiment name, which will be used to generate output file names. DEFAULT:

“NA” •  -‐-‐format FORMAT Format of tag file, “BED” or “SAM” or “BAM” or “BOWTIE”. DEFAULT:

“BED” •  -‐-‐extsize EXTSIZE The arbitrary extension size in bp. When nomodel is true, MACS will use this value as fragment size to extend each read towards 3' end, then , pile them up. You can use the value from the strand cross-‐correla3on analysis.

19

MACS output Output files:

1. Excel peaks file contains the following columns Chr, start, end, length, abs_summit, pileup, -LOG10(pvalue), -LOG10(qvalue), name

2. Bed peaks file chr, start, end, -LOG10(qvalue)

3. Summits.bed: contains the peak summits locaBons for every peaks. The 5th column in this file is -‐log10qvalue the same as NAME_peaks.bed

4. Peaks.narrowPeak is BED6+4 format file. Contains the peak locaBons together with peak summit, fold-‐change, pvalue and qvalue.

5. Model.r is an R script which you can use to produce a PDF image about the model based on your data. (Rscript NAME_model.r)

To look at the peaks on a genome browser you can upload one of the output bed

files or you can also make a bedgraph file with columns (step 6 of hands on): chr, start, end, fold_enrichment

20

Visualize peaks in IGV

21

Hepg2H3k4me3_chr1.bedgraph

Hepg2H3k4me3_chr1.sorted.bam Coverage

Hepg2H3k4me3_chr1.sorted.bam

Control_chr1.sorted.bam

Control_chr1.sorted.bam Cov

Other recommendaBons

•  Look at your mapped reads and peaks in a genome browser to verify peak calling thresholds

•  OpBonal: remove reads mapping to the ENCODE and 1000 Genomes blacklisted regions hGps://sites.google.com/site/anshulkundaje/projects/blacklists

22

Linking peaks to genes: Bed tools

23

intersectBed

closestBed

slopBed

coverageBed

groupBy It groups rows based on the value of a given column/s and it summarizes the other columns

Hands-‐on 8: Link peaks to nearby genes

•  Take all genes and add 3Kb up and down with slopBed slopBed -‐b 3000 -‐i GRCh37.p13.HumanENSEMBLgenes.bed -‐g /nfs/

genomes/human_gp_feb_09_no_random/anno/chromInfo.txt > HumanGenesPlusMinus3kb.bed

•  Intersect the slopped genes with peaks and get the list of unique genes overlapping intersectBed -‐wa -‐a HumanGenesPlusMinus3kb.bed -‐b peaks.bed | awk '{print $4}' | sort -‐u > Genesat3KborlessfromPeaks.txt

intersectBed -‐wa -‐a HumanGenesPlusMinus3kb.bed -‐b peaks.bed | head -‐3 chr1 45956538 45968751 ENSG00000236624_CCDC163P chr1 45956538 45968751 ENSG00000236624_CCDC163P chr1 51522509 51528577 ENSG00000265538_MIR4421

24

closestBed -‐d -‐a peaks.bed -‐b GRCh37.p13.HumanENSEMBLgenes.bed |head chr1 20870 21204 H3k4me3_chr1_peak_1 5.77592 chr1 14363 29806 ENSG00000227232_WASH7P 0

chr1 28482 30214 H3k4me3_chr1_peak_2 374.48264 chr1 29554 31109 ENSG00000243485_MIR1302-10 0

chr1 28482 30214 H3k4me3_chr1_peak_2 374.48264 chr1 14363 29806 ENSG00000227232_WASH7P 0

#the next two steps can also be done on excel

closestBed -‐d -‐a peaks.bed -‐b GRCh37.p13.HumanENSEMBLgenes.bed | groupBy -‐g 9,10 -‐c 6,7,8, -‐o disBnct,disBnct,disBnct | head -‐3

ENSG00000227232_WASH7P 0 chr1 14363 29806 ENSG00000243485_MIR1302-10 0 chr1 29554 31109

ENSG00000227232_WASH7P 0 chr1 14363 29806

closestBed -‐d -‐a peaks.bed -‐b GRCh37.p13.HumanENSEMBLgenes.bed | groupBy -‐g 9,10 -‐c 6,7,8, -‐o disBnct,disBnct,disBnct | awk 'BEGIN {OFS="\t"}{ if ($2<3000) {print $3,$4,$5,$1,$2} } ' | head -‐5

chr1 14363 29806 ENSG00000227232_WASH7P 0 chr1 29554 31109 ENSG00000243485_MIR1302-10 0 chr1 14363 29806 ENSG00000227232_WASH7P 0

chr1 134901 139379 ENSG00000237683_AL627309.1 0 chr1 135141 135895 ENSG00000268903_RP11-34P13.15 0

25

Hands-‐on 9: Link peaks to closest genes For each region find the closest gene and filter based on the distance to the gene

Hands-‐on 9: Link peaks to closest genes (1 command) For each region find the closest gene and filter based on the distance to the gene

closestBed -‐d -‐a peaks.bed -‐b GRCh37.p13.HumanENSEMBLgenes.bed | groupBy -‐g 9,10 -‐c 6,7,8, -‐o disBnct,disBnct,disBnct | awk 'BEGIN {OFS="\t"}{ if ($2<3000) {print $3,$4,$5,$1,$2} }' > closestGeneAt3KborLess.bed closestBed -‐d print the distance to the feature in -‐b groupBy -‐g columns to group on -‐c columns to summarize -‐o operaBon to use to summarize

26

Visualizing ChIP-‐seq reads with ngsplot See Hot Topics: ngsplot

•  Hands-‐on 10: bsub ngs.plot.r -‐G hg19 -‐R tss -‐C H3k4me3_chr1.bam -‐O H3k4me3_chr1.tss -‐T H3K4me3 -‐L 3000 -‐FL 300

27

References §  Previous Hot Topics

Quality Control and Mapping Reads hGp://jura.wi.mit.edu/bio/educaBon/hot_topics/NGS_QC_mapping_Feb2015/NGS_QC_Mapping2015_1perPage.pdf

§  SOPs hGp://barcwiki.wi.mit.edu/wiki/SOPs/chip_seq_peaks

§  Reviews and benchmark papers: ChIP-‐seq: advantages and challenges of a maturing technology (Oct 09) (hGp://www.nature.com/nrg/journal/v10/n10/full/

nrg2641.html) ComputaBon for ChIP-‐seq and RNA-‐seq studies (Nov 09) (hGp://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.

1371.html) PracBcal Guidelines for the Comprehensive Analysis of ChIP-‐seq Data. PLoS Comput. Biol. 2013 A computaBonal pipeline for comparaBve ChIP-‐seq analyses. Nat. Protoc. 2011 ChIP-‐seq guidelines and pracBces of the ENCODE and modENCODE consorBa. Genome Res. 2012.

§  Quality control and strand cross-‐correlaBon hGp://code.google.com/p/phantompeakqualtools/

§  MACS: Model-‐based Analysis of ChIP-‐Seq (MACS). Genome Biol 2008 hGp://liulab.dfci.harvard.edu/MACS/index.html

Using MACS to idenBfy peaks from ChIP-‐Seq data. Curr Protoc BioinformaPcs. 2011 hGp://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0214s34/pdf §  Bedtools:

hGps://code.google.com/p/bedtools/ hGp://bioinformaBcs.oxfordjournals.org/content/26/6/841.abstract §  ngsplot

hGps://code.google.com/p/ngsplot/ Shen, L.*, Shao, N., Liu, X. and Nestler, E. (2014) ngs.plot: Quick mining and visualizaBon of next-‐generaBon sequencing data by integraBng genomic databases, BMC Genomics, 15, 284. 28

Documents

ChIP%seq)Analysis)barc.wi.mit.edu/education/hot_topics/ChIPseq_2015/... · Experimental)design) • Include)acontrol)sample.) • If)the)protein)of)interestbinds)repeBBve)regions,)using)