ChIP-seq Methods & Analysis

ChIP-seq Methods & AnalysisChIP-seq Methods & Analysis

Gavin SchnitzlerAsst. Prof. Medicine TUSM, Investigator at MCRI, TMC

[email protected]

mailto:[email protected]

mailto:[email protected]

• Day 1: ChIP techniques, library production, USCS browser tracks

• Day 2: QC on reads, Mapping binding site peaks, examining read density maps.

• Day 3: Analyzing peaks in relation to genomic feature, etc.

• Day 4: Analyzing peaks for transcription factor binding site consensus sequences.

• Day 5: Variants & advanced approaches.

ChIP-seq COURSE OUTLINE

• Introduction to variations on ChIP-seq methods

• Extensions & variations on TFBS analysis• Analyzing published data & across platforms• Downloading & installing programs• Writing your own programs

Day 5 Outline

Next-Generation Sequencing Analysis

“ChIP-Seq is the best thing that happened to ChIP since the antibody. It is 100x better than ChIP-Chip since it escapes most of the problems of microarray probe hybridization. Plus it is cheaper, and genome wide. But ChIP-Seq is only the tip of the iceberg - there are many inventive ways to use a sequencer.”

Quote from intro to Homer software at: http://biowhat.ucsd.edu/homer/ngs/index.html

Extensions of ChIP-seqChIP-Seq: Isolation and sequencing of genomic DNA "bound" by a specific transcription factor, covalently modified histone, or other nuclear protein. This methodology provides genome-wide maps of factor binding. Most of HOMER's routines cater to the analysis of ChIP-Seq data.

DNase-Seq: Treatment of nuclei with a restriction enzyme such as DNase I will result in cleavage of DNA at accessible regions. Isolation of these regions and their detection by sequencing allows the creation of DNase hypersensitivity maps, providing information about which regulatory elements are accessible in the genome. (variant technique called FAIRE-seq)MNase-Seq: Micrococcal Nuclease (MNase) is a restriction enzyme that degrades genomic DNA not wrapped around histones. The remaining DNA represents nucleosomal DNA, and can be sequencing to reveal nucleosome positions along the genome. This method can also be combined with ChIP to map nucleosomes that contain specific histone modifications.

RNA-Seq: Extraction, fragmentation, and sequencing of RNA populations within a sample. The replacement for gene expression measurements by microarray. There are many variants on this, such as Ribo-Seq (isolation of ribosomes translating RNA), small RNA-Seq (to identify miRNAs), etc.

GRO-Seq: RNA-Seq of nascent RNA. Transcription is halted, nuclei are isolated, labeled nucleotides are added back, and transcription briefly restarted resulting in labeled RNA molecules. These newly created, nascent RNAs are isolated and sequenced to reveal "rates of transcription" as opposed to the total number of stable transcripts measured by normal RNA-seq.

Hi-C: Genomic interaction assay for understanding genome 3D structure. This assay is much more specialized - For more information about how to use HOMER to analyze Hi-C data, check out the Hi-C analysis section.

http://biowhat.ucsd.edu/homer/interactions/index.html

Examining long-range interactions

by ChIP-seq Two DNA fragments associated with the same IP’d protein are ligated together.

Sequencing identifies both short-range and long range interactions.

Nature Reviews Genetics 2012 13:840

Fine scale information from

DNAse-seq

Nature Reviews Genetics 2012 13:840

Sequencing the ends of DNAse cuts identifies regions of bare DNA.

Fine scale analysis of this data can identify individual TF binding sites.

Capturing allele-specific information using SNPs in reads

CTCF binds better to the A variant

Bormann Chung CA, Boyd VL, McKernan KJ, Fu Y, et al. (2010) Whole Methylome Analysis by Ultra-Deep Sequencing Using Two-Base Encoding. PLoS ONE 5(2): e9320. doi:10.1371/journal.pone.0009320 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0009320

Mapping CpG DNA methylation patterns

Approaches:

•IP of DNA fragments using antibodies against meC or meCpG binding proteins.

•Selection of DNA fragments using methyl-sensitive restriction enzymes.

•Whole genome bisulfite sequencing.

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0009320

Mapping nucleosome positions

Approaches:

•1) Fragmentation to mononucleosome size by sonication or micrococcal nuclease (MNase)

ChIP w/ antibody against histone modification (H3K4me1) – can map positions of nucleosomes with this mark.

Whole genome sequencing.

Nat Struct Mol Biol. 2011 June; 18(6): 742–746.

http://www.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&retmode=ref&cmd=prlinks&id=21623366



0

10

20

30

40

50

60

-2000 -1000 0 1000 2000

BP from TSSes of gene group

ChIP-seq reads/10bp/promoter

LiERBS_v_LiD

LiERBS_v_LiULiERBS_v_LiNon-regl.

LiERBS_v_LiNon-expr.

AoERBS_v_AoDAoERBS_v_AoU

AoERBS_v_AoNon-regl.

Plotting ChIP-seq read density versus genomic features

Taking average normalized .bedgraph data relative to TSSes…

Using input chromatin read density to measure nucleosome densities

Initial support: Average normalized .bedgraph data from INPUT sample relative to TSSes recapitulates the low nucleosome occupancy seen genomewide over promoters.

0

1

2

3

4

5

6

7

8

-2000 -1500 -1000 -500 0 500 1000 1500 2000

BP from Li Down promoter TSSes

Li INPUT shifted reads/ 10 bp/ promoter

LiINPUT_v_LiD_pros (norm'd)LiINPUT_v_LiU_pros (norm'd)LiINPUT_v_AoD_pros (norm'd)LiINPUT_v_AoU_pros (norm'd)

Hypothesis: Sonication mostly cuts in nucleosome free regions or inter-nucleosomal spacers. Thus, read positions give information about nucleosome positions.



Day 5 Outline

Hannenhalli S Bioinformatics 2008;24:1325-1331

Many approaches to TFBS analysis

Also, Ladunga I. An overview of the computational analyses and discovery of transcription factor binding sites. Methods Mol Biol. 2010;674:1-

22. doi: 10.1007/978-1-60761-854-6_1. : Introduction to a set of about a dozen methods papers.

http://www.ncbi.nlm.nih.gov/pubmed?term=Ladunga%20I%5BAuthor%5D&cauthor=true&cauthor_uid=20827582

http://www.ncbi.nlm.nih.gov/pubmed?term=Ladunga%20I%5BAuthor%5D&cauthor=true&cauthor_uid=20827582

http://www.ncbi.nlm.nih.gov/pubmed/20827582

The Gibbs sampler approachObjective: Find conserved segment of length k in n unrelated sequences

n

2

1

1 k

1 k

1 k

From : Lawrence, C. et al.(1993) Detecting Subtle Sequence Signals: A Geibbs Sampler approach to Multiple Alignment. Science 262.208-

De Novo Search Algorithms

The program will need to run once for each k:e.g. 6 bp, 7 bp, 8 bp sequences, etc. (either automatically, or by hand).

Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt

The EM approach (in MEME etc.)Expectation Maximization algorithm, proceeds in iterations until E & M converge. For an explanation of the process see Nature Biotechnology 26, 897 - 899 (2008).

Two de novo search methods

• DME is part of the same CREAD package that storm is in (run in UNIX)

• SEME some of the same refinements as CentDist to do de novo searches: http://biogpu.d1.comp.nus.edu.sg/~chipseq/webseqtools2/

Regulatory Modules:De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, 7079-84

M1

M2

M3

Stop

Start 12p

21p

Gene AGene B

Composite Patterns: BioOptimizer: the Bayesian Scoring Function Approach to Motif Discovery Bioinformatics


Extensions to Basic Models

Combining Signals and other Data

Expresssion and Motif Regression: Integrating Motif Discovery and Expression Analysis Proc.Natl.Acad.Sci. 100.3339-44

1.Rank genes by E=log2(expression fold change)2.Find “many” (hundreds) candidate motifs

3.For each motif pattern m, compute the vector Sm of matching scores for genes with the pattern

4.Regress E on Sm

€

Yg = α + βmSmg + εg

Motifs Coding regions

ChIP-on-chip - 1-2 kb information on protein/DNA interaction: An Algorithm for Finding Protein-DNA Interaction Sites with Applications to Chromatin Immunoprecipitation Microarray Experiments Nature Biotechnology, 20, 835-39

Protein binding in neighborhood Coding regions


For use of evolutionary conservation information w/ individual motifs see: Das & Dai 2007 BMC Bioinformatics 8:S21.

For regulatory modules see: Su J, Teichmann SA, Down TA (2010) Assessing Computational Methods of Cis-Regulatory Module Prediction. PLoS Comput Biol 6(12): e1001020. doi:10.1371/journal.pcbi.1001020http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1001020

Assessment of evolutionary conservation

Modules shared across species are most highly rated.


http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1001020

Integrating data from multiple sources w/ permutation of average ranks

Let’s say we want to combine data from several sources or metrics to decide which are the most relevant enriched TFs.e.g. 1) p.value in CentDist, 2) p.value in Storm & 3) p.value of homologous sequence in DME

Establish a ranking metric for each (e.g. 1 best to 10 worst). It doesn’t have to be the same for 1, 2 & 3, but you need to apply the same rank system across different biological conditions.

For each TF compute the average rank.(1) (2) (3) (avg)1 3 2 24 5 8 5.75 4 2 38 9 7 8. . . .

Permutation of average ranks

Now take the same columns of ranks for (1), (2) & (3) and randomize each one separately.

(1) (2) (3) (avg)8 5 2 54 4 8 5.35 3 7 4.31 9 2 4. . . .

Repeat this several times (until you have thousands of random average ranks & plot frequency vs avg. rank…

10 7 5 3 1

The number of times a given value is observed divided by the total number of iterations gives an estimate of false discovery rate.

2.0 observed 34/10,000 times in permuted averages.Estimated FDR ~3.4e-3


• Extensions & variations on TFBS analysis• Analyzing published data & across

platforms• Downloading & installing programs• Writing your own programs

Day 5 Outline

Getting data from SRAGo to: http://www.ncbi.nlm.nih.gov/sra

Find an experiment by searching, e.g. “encode h1-hesc h3k4me3”

Click on the name to the left of the smaller file (1.9M) & then on the downloads tab.

Right click on the ftp link for the run & copy the link location.

Open putty & login to your account at cluster.uit.tufts.edu

Go to your /cluster/shared/[userID]/chip directory & do:

wget [pasted URL]

What if you want to know something from a published dataset, but they’ve only provided the raw data on SRA?

http://www.ncbi.nlm.nih.gov/sra

Decoding the .sra formatThe SRR227387.sra file you now have is in a special file format, but it does have all the original .fastq information in it.

To get that info do:bsub /cluster/tufts/cbi*/Ch*/ESC*/sra*/bin/fastq-dump SRR227387.sra[fastq-dump is part of a package of programs for handling .sra files that you can download, unpack & run immediately from your shared directory – at least as far as simple files like fastq-dump are concerned]

This gives you the same .fastq format you’re familiar with.Use head to confirm the format, but then you might as well delete the file with rm so as not to clutter up the cluster.

After this week you are now ready to do any analysis you want on this data, from mapping reads to the genome (w/ bowtie) to peak calling (w/ MACS), to TFBS analysis.

“Liftover” programs to convert between genomes & builds

Liftover/Others Convert between RefSeq, Gene Symbols to Entrez IDs using Bioconductor.Liftover Wig Files Liftover wig files[Galaxy]Convert genome coordinates between assemblies and genomesExtract data from Wiggle Extract data for certain chromosome from a wiggle fileExtract data from Bed Extract data for certain chromosome from a BED file

Several useful tools for this in Cistrome/Galaxy:

In the UCSC genome browser:•Tools-> Liftover•Choose the starting genome/build & the one you want to convert to. •Upload a .bed file w/ the ranges you want & hit go (only works for bed files… may work with bedGraph, although I haven’t confirmed this)



Day 5 Outline

Don’t be intimidated!There’s nothing to prevent you from installing a program

you want to run in your cluster account.Before you begin, though, type “module available” to see if it’s

already installed as a module.Also go to /cluster/tufts/ngsp/ngsp/ to see if it’s installed there.• Read the documentation from the creator’s lab, download,

unzip &/or unpack the file, read the INSTALL or README files included, & give it a try.

• You may need to be running a specific version of perl or python, etc. If so, check “module available” to see if it’s installed on the cluster & use “module load [name]” to add it.

• You may also need to set system variables using “export VARIABLE=$VARIABLE:/new/path”. README files should tell you enough to know what to try.

• If you get stuck, the cluster support folks are friendly & helpful (and respond moderately fast). Contact them at: [email protected].

A different integrated package of tools to run in UNIX

http://biowhat.ucsd.edu/homer/ngs/index.html

HOMERSoftware for motif discovery and next-gen sequencing analysis

Mapping to the genome (NOT performed by HOMER, but important to understand) Creation Tag directories, quality control, and normalization. (makeTagDirectory)UCSC visualization (makeUCSCfile, makeBigWig.pl)Peak finding / Transcript detection / Feature identification (findPeaks) Motif analysis (findMotifsGenome.pl) Annotation of Peaks (annotatePeaks.pl)Quantification of Data at Peaks/Regions in the Genome/Histograms and Heatmaps (annotatePeaks.pl) Quantification of Transcripts (analyzeRNA.pl)

Additional analysis strategies:General sequence manipulation tools (homerTools) Miscellaneous Tools for Sharing Data between programs, etc. (tagDir2bed.pl, bed2pos.pl, pos2bed.pl ...)Finding overlapping or differentially bound peaks (mergePeaks, getDifferentialPeaks) ChIP-Seq analysis automation (analyzeChIP-Seq.pl)Description of file formats

Could be very useful… & with (only a bit of) luck, you’ll be able to install & run them yourself.

http://biowhat.ucsd.edu/homer/ngs/alignment.html

http://biowhat.ucsd.edu/homer/ngs/tagDir.html

http://biowhat.ucsd.edu/homer/ngs/ucsc.html

http://biowhat.ucsd.edu/homer/ngs/peaks.html

http://biowhat.ucsd.edu/homer/ngs/peakMotifs.html

http://biowhat.ucsd.edu/homer/ngs/annotation.html

http://biowhat.ucsd.edu/homer/ngs/quantification.html

http://biowhat.ucsd.edu/homer/ngs/quantification.html

http://biowhat.ucsd.edu/homer/ngs/analyzeRNA.html

http://biowhat.ucsd.edu/homer/ngs/homerTools.html

http://biowhat.ucsd.edu/homer/ngs/miscellaneous.html

http://biowhat.ucsd.edu/homer/ngs/mergePeaks.html

http://biowhat.ucsd.edu/homer/ngs/analyzeChIP-Seq.html



http://biowhat.ucsd.edu/homer/ngs/formats.html

http://biowhat.ucsd.edu/homer/index.html

Installing a program in RCheck out the Key R Commands link at http://sites.tufts.edu/cbi/resources/chip-seq/

This is not an introduction to programming in R!Instead it gives basic instructions for how to:1) install & run R packages that may be needed for your research, 2) how to move data files into R3) how to perform simple edits on this data that may be required by the package& 4) how to output your results.

Note: I find that the documentation for R packages is generally quite good.



Day 5 Outline

Mastering simple UNIX tools

Find out more at: http://sites.tufts.edu/cbi/resources/rna-seq-course/unix-resources/

find, awk, grep, sort, sed & moreOne line commands to let you search and manipulate large data files w/o writing a program or trying to use the kludgy and limited tools in Galaxy.

Programming: Get your feet wet

Perl Tutorials - learn.perl.orglearn.perl.org/tutorials/Many tutorials are available if you are interested in learning Perl. These tutorials are introductions.

Beginning Perl (free) - www.perl.orgwww.perl.org/books/beginning-perl/This book is for those new to programming who want to learn with Perl.

A ton of Perl programs for you to use/adapt/modify:http://www.bioperl.org/wiki/Main_Page

For learning R: Check out Josh’s links at: http://sites.tufts.edu/cbi/resources/rna-seq-course/r-resources/

Also check out my notes on using R (specifically geared to the minimum you need to install & use existing programs) & a brief reference sheet on Perl at http://sites.tufts.edu/cbi/resources/chip-seq/

Look at examples, check the web…

If you’re looking for a command in UNIX, R, Perl, Python, etc. do a Google search (for R add “statistical” to your search to specify what you mean).

If you’re wondering how to get a program to do something, look at other programs & see how they did it.

You don’t need to memorize the language, beyond a few basics, just look at what you (or someone else) did before & copy it.

Questions?

What would you like to explore?

What’s the next bioinformatics challenge in your research?

Course evaluation forms…

Documents

ChIP-seq Methods & Analysis