Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
ChIP-‐seq Analysis
BaRC Hot Topics -‐ Feb 24th 2015
BioinformaBcs and Research CompuBng Whitehead InsBtute
hGp://barc.wi.mit.edu/hot_topics/
Before we start:
1. Log into tak (step 0 on the exercises) 2. Go to your lab space and create a folder for
the class (see separate hand out) 3. Connect to your lab space through the wi-‐
htdata network and open the folder you created
2
Outline • ChIP-‐seq overview • Quality control • Experimental design • Mapping
– Map reads – Remove unmapped reads and convert to bam files – Check the profile of the mapped reads
• Peak calling • More detail about MACS opBons • Linking peaks to genes • Visualizing ChIP-‐seq data with ngsplot
3
ChIP-Seq overview
4 Park, P. J., ChIP-‐seq: advantages and challenges of a maturing technology, Nat Rev Genet. Oct;10(10):669-‐80 (2009)
Steps in data analysis
5
1. Quality control
2. EffecBve mapping Treat IP and control the same way
(preprocessing and mapping) 3. Peak calling i) Read extension and signal profile
generaBon ii) Peak assignment
Pepke, S. et al. ComputaBon for ChIP-‐seq and RNA-‐seq studies, Nat Methods. Nov (2009)
Experimental design
• Include a control sample. • If the protein of interest binds repeBBve regions, using paired–end sequencing may reduce the mapping ambiguity. Otherwise single reads should be fine.
• Include at least two biological replicates. If you have replicates you may want to use the parameter IDR “irreproducible discovery rate”. See us for details.
• If only a small percentage of the reads maps to the genome, you may have to troubleshoot your ChIP protocol.
6
Data that we will use on the hands on the exercises
• ENCODE human ChIP-‐seq experiment – EncodeBroadHistoneHepg2H3k4me3 – EncodeBroadHistoneHepg2Control
• For quality control and mapping we will use a subset of the data (10%)
• For running MACS we will use the full set but only chr1 reads
7
Quality control (before read mapping)
See Hot Topics: Quality Control and Mapping Reads
• Run fastqc – Look at output and determine
1. What type of quality encoding (Illumina versions) your reads have
2. If you need to remove adapter, trim reads, filter reads based on quality etc.
Input qualiBes Illumina versions
-‐-‐solexa-‐quals <= 1.2
-‐-‐phred64 1.3-‐1.7
-‐-‐phred33 >= 1.8
8
Hands-‐on Step 1: Quality control
Run step 1 commands (fastqc) bsub fastqc Hepg2Control_subset.fastq bsub fastqc Hepg2H3k4me3_subset.fastq
9
Fastqc Output
10
Mapping
• We have to map reads to the genome. The mapping solware doesn’t have to know about splicing. – BowBe, bowBe2 (allows gaps) are good choices – Bwa (allows gaps) is fine too
• Treat IP and control samples the same way during preprocessing and mapping.
11
Hands-‐on Step 2: Mapping • bsub bowBe2 -‐-‐phred33-‐quals -‐N 1 -‐x /nfs/genomes/
human_gp_feb_09_no_random/bowBe/hg19 -‐U Hepg2Control_subset.fastq -‐S Hepg2Control_subset_hg19.N1.sam
• bsub bowBe2 -‐-‐phred33-‐quals -‐N 1 -‐x /nfs/genomes/human_gp_feb_09_no_random/bowBe/hg19 -‐U Hepg2H3k4me3_subset.fastq -‐S Hepg2H3k4me3_subset_hg19.N1.sam
-‐-‐phred33-‐quals Because we have Illumina 1.9 -‐N <int> max # mismatches in seed alignment; can be 0 or 1 (default=0) -‐x path to the bowBe2 index -‐U input sequence -‐S output.sam
12
Hands-‐on Steps 3: Remove unmapped reads and convert to BAM. Sort and index the bam file.
Remove unmapped reads • bsub "samtools view -‐hS -‐F 4 Hepg2Control_subset_hg19.N1.sam | samtools view -‐b -‐S -‐ >
Hepg2Control_subset_hg19.N1.mapped_only.bam" • bsub "samtools view -‐hS -‐F 4 Hepg2H3k4me3_subset_hg19.N1.sam | samtools view -‐b -‐S -‐ >
Hepg2H3k4me3_subset_hg19.N1.mapped_only.bam" Sort reads • bsub samtools sort -‐T Control_subset -‐o Control_subset_hg19.N1.mapped_only.sorted.bam
Hepg2Control_subset_hg19.N1.mapped_only.bam • bsub samtools sort -‐T H3k4me3_subset -‐o H3k4me3_subset_hg19.N1.mapped_only.sorted.bam
Hepg2H3k4me3_subset_hg19.N1.mapped_only.bam -‐T PREFIX Write temporary files to PREFIX.nnnn.bam
-‐o FILE Write final output to FILE rather than standard output Index reads • bsub samtools index Control_subset_hg19.N1.mapped_only.sorted.bam • bsub samtools index H3k4me3_subset_hg19.N1.mapped_only.sorted.bam
13
Strand cross-‐correlaBon analysis
14
Strand cross-‐correlaBon is computed as the Pearson correlaBon between the posiBve and the negaBve strand profiles at different strand shil distances, k
Bailey, et al. PracBcal Guidelines for the Comprehensive Analysis of ChIP-‐seq Data, PLOS ComputaPonal Biology Nov 2013
Strand cross-‐correlaBon analysis outputs
15
Hands-‐on Steps 4: Quality control aler read mapping
Check the strand cross-‐correlaBon peak and predominant fragment length.
• bsub /nfs/BaRC_Public/phantompeakqualtools/run_spp.R -‐c=Control_chr1.bam -‐savp -‐out=Control_chr1.run_spp.out
• bsub /nfs/BaRC_Public/phantompeakqualtools/run_spp.R -‐c=H3k4me3_chr1.bam -‐savp -‐out=H3k4me3_chr1.run_spp.out -‐savp, save cross-‐correlaBon plot
16
Peak calling
i) Read extension and signal profile generaBon § strand cross-‐correlaBon can be used to
calculate fragment length ii) Peak evaluaBon
§ Look for fold enrichment of the sample over input or expected background
§ EsBmate the significance of the fold enrichment using: • Poisson distribuBon • negaBve binomial distribuBon • background distribuBon from input
DNA • model background data to adjust for
local variaBon (MACS)
17
Pepke, S. et al. ComputaBon for ChIP-‐seq and RNA-‐seq studies, Nat Methods. Nov. 2009
Peak calling: MACS • MACS uses a two-‐step strategy:
1. Builds a model using high quality peaks and infers shil size from it. 2. Shils the reads and does a second round to call the final peaks.
• It uses a Poisson distribuBon to assign p-‐values to peaks. But the distribuBon has a dynamic parameter, local lambda, to capture the influence of local biases.
• MACS default is to filter out redundant tags at the same locaBon and with the same strand by allowing at most 1 tag. This works well.
• -‐g: You need to set up this parameter accordingly: EffecBve genome size. It can be 1.0e+9 or 1000000000, or shortcuts: 'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruit fly (1.2e8), Default:hs
• For broad peaks like some histone modificaBons it is recommended to use --nomodel and if there is not input sample to use --nolambda.
18
Hands-‐on Steps 5: MACS • Input files: Hepg2H3k4me3_chr1.bam Control_chr1.bam • MACS command bsub macs2 callpeak -‐t H3k4me3_chr1.bam -‐c Control_chr1.bam -‐-‐name H3k4me3_chr1
-‐-‐format BAM
PARAMETERS • -‐t TFILE Treatment file • -‐c CFILE Control file • -‐-‐name NAME Experiment name, which will be used to generate output file names. DEFAULT:
“NA” • -‐-‐format FORMAT Format of tag file, “BED” or “SAM” or “BAM” or “BOWTIE”. DEFAULT:
“BED” • -‐-‐extsize EXTSIZE The arbitrary extension size in bp. When nomodel is true, MACS will use this value as fragment size to extend each read towards 3' end, then , pile them up. You can use the value from the strand cross-‐correla3on analysis.
19
MACS output Output files:
1. Excel peaks file contains the following columns Chr, start, end, length, abs_summit, pileup, -LOG10(pvalue), -LOG10(qvalue), name
2. Bed peaks file chr, start, end, -LOG10(qvalue)
3. Summits.bed: contains the peak summits locaBons for every peaks. The 5th column in this file is -‐log10qvalue the same as NAME_peaks.bed
4. Peaks.narrowPeak is BED6+4 format file. Contains the peak locaBons together with peak summit, fold-‐change, pvalue and qvalue.
5. Model.r is an R script which you can use to produce a PDF image about the model based on your data. (Rscript NAME_model.r)
To look at the peaks on a genome browser you can upload one of the output bed
files or you can also make a bedgraph file with columns (step 6 of hands on): chr, start, end, fold_enrichment
20
Visualize peaks in IGV
21
Hepg2H3k4me3_chr1.bedgraph
Hepg2H3k4me3_chr1.sorted.bam Coverage
Hepg2H3k4me3_chr1.sorted.bam
Control_chr1.sorted.bam
Control_chr1.sorted.bam Cov
Other recommendaBons
• Look at your mapped reads and peaks in a genome browser to verify peak calling thresholds
• OpBonal: remove reads mapping to the ENCODE and 1000 Genomes blacklisted regions hGps://sites.google.com/site/anshulkundaje/projects/blacklists
22
Linking peaks to genes: Bed tools
23
intersectBed
closestBed
slopBed
coverageBed
groupBy It groups rows based on the value of a given column/s and it summarizes the other columns
Hands-‐on 8: Link peaks to nearby genes
• Take all genes and add 3Kb up and down with slopBed slopBed -‐b 3000 -‐i GRCh37.p13.HumanENSEMBLgenes.bed -‐g /nfs/
genomes/human_gp_feb_09_no_random/anno/chromInfo.txt > HumanGenesPlusMinus3kb.bed
• Intersect the slopped genes with peaks and get the list of unique genes overlapping intersectBed -‐wa -‐a HumanGenesPlusMinus3kb.bed -‐b peaks.bed | awk '{print $4}' | sort -‐u > Genesat3KborlessfromPeaks.txt
intersectBed -‐wa -‐a HumanGenesPlusMinus3kb.bed -‐b peaks.bed | head -‐3 chr1 45956538 45968751 ENSG00000236624_CCDC163P chr1 45956538 45968751 ENSG00000236624_CCDC163P chr1 51522509 51528577 ENSG00000265538_MIR4421
24
closestBed -‐d -‐a peaks.bed -‐b GRCh37.p13.HumanENSEMBLgenes.bed |head chr1 20870 21204 H3k4me3_chr1_peak_1 5.77592 chr1 14363 29806 ENSG00000227232_WASH7P 0
chr1 28482 30214 H3k4me3_chr1_peak_2 374.48264 chr1 29554 31109 ENSG00000243485_MIR1302-10 0
chr1 28482 30214 H3k4me3_chr1_peak_2 374.48264 chr1 14363 29806 ENSG00000227232_WASH7P 0
#the next two steps can also be done on excel
closestBed -‐d -‐a peaks.bed -‐b GRCh37.p13.HumanENSEMBLgenes.bed | groupBy -‐g 9,10 -‐c 6,7,8, -‐o disBnct,disBnct,disBnct | head -‐3
ENSG00000227232_WASH7P 0 chr1 14363 29806 ENSG00000243485_MIR1302-10 0 chr1 29554 31109
ENSG00000227232_WASH7P 0 chr1 14363 29806
closestBed -‐d -‐a peaks.bed -‐b GRCh37.p13.HumanENSEMBLgenes.bed | groupBy -‐g 9,10 -‐c 6,7,8, -‐o disBnct,disBnct,disBnct | awk 'BEGIN {OFS="\t"}{ if ($2<3000) {print $3,$4,$5,$1,$2} } ' | head -‐5
chr1 14363 29806 ENSG00000227232_WASH7P 0 chr1 29554 31109 ENSG00000243485_MIR1302-10 0 chr1 14363 29806 ENSG00000227232_WASH7P 0
chr1 134901 139379 ENSG00000237683_AL627309.1 0 chr1 135141 135895 ENSG00000268903_RP11-34P13.15 0
25
Hands-‐on 9: Link peaks to closest genes For each region find the closest gene and filter based on the distance to the gene
Hands-‐on 9: Link peaks to closest genes (1 command) For each region find the closest gene and filter based on the distance to the gene
closestBed -‐d -‐a peaks.bed -‐b GRCh37.p13.HumanENSEMBLgenes.bed | groupBy -‐g 9,10 -‐c 6,7,8, -‐o disBnct,disBnct,disBnct | awk 'BEGIN {OFS="\t"}{ if ($2<3000) {print $3,$4,$5,$1,$2} }' > closestGeneAt3KborLess.bed closestBed -‐d print the distance to the feature in -‐b groupBy -‐g columns to group on -‐c columns to summarize -‐o operaBon to use to summarize
26
Visualizing ChIP-‐seq reads with ngsplot See Hot Topics: ngsplot
• Hands-‐on 10: bsub ngs.plot.r -‐G hg19 -‐R tss -‐C H3k4me3_chr1.bam -‐O H3k4me3_chr1.tss -‐T H3K4me3 -‐L 3000 -‐FL 300
27
References § Previous Hot Topics
Quality Control and Mapping Reads hGp://jura.wi.mit.edu/bio/educaBon/hot_topics/NGS_QC_mapping_Feb2015/NGS_QC_Mapping2015_1perPage.pdf
§ SOPs hGp://barcwiki.wi.mit.edu/wiki/SOPs/chip_seq_peaks
§ Reviews and benchmark papers: ChIP-‐seq: advantages and challenges of a maturing technology (Oct 09) (hGp://www.nature.com/nrg/journal/v10/n10/full/
nrg2641.html) ComputaBon for ChIP-‐seq and RNA-‐seq studies (Nov 09) (hGp://www.nature.com/nmeth/journal/v6/n11s/full/nmeth.
1371.html) PracBcal Guidelines for the Comprehensive Analysis of ChIP-‐seq Data. PLoS Comput. Biol. 2013 A computaBonal pipeline for comparaBve ChIP-‐seq analyses. Nat. Protoc. 2011 ChIP-‐seq guidelines and pracBces of the ENCODE and modENCODE consorBa. Genome Res. 2012.
§ Quality control and strand cross-‐correlaBon hGp://code.google.com/p/phantompeakqualtools/
§ MACS: Model-‐based Analysis of ChIP-‐Seq (MACS). Genome Biol 2008 hGp://liulab.dfci.harvard.edu/MACS/index.html
Using MACS to idenBfy peaks from ChIP-‐Seq data. Curr Protoc BioinformaPcs. 2011 hGp://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi0214s34/pdf § Bedtools:
hGps://code.google.com/p/bedtools/ hGp://bioinformaBcs.oxfordjournals.org/content/26/6/841.abstract § ngsplot
hGps://code.google.com/p/ngsplot/ Shen, L.*, Shao, N., Liu, X. and Nestler, E. (2014) ngs.plot: Quick mining and visualizaBon of next-‐generaBon sequencing data by integraBng genomic databases, BMC Genomics, 15, 284. 28