Upload
amos-young
View
102
Download
2
Tags:
Embed Size (px)
DESCRIPTION
ChIP-seq Methods & Analysis. Gavin Schnitzler Asst. Prof. Medicine TUSM, Investigator at MCRI, TMC [email protected] 617-636-0615. ChIP-seq COURSE OUTLINE. Day 1: ChIP techniques, library production, USCS browser tracks - PowerPoint PPT Presentation
Citation preview
ChIP-seq Methods & AnalysisChIP-seq Methods & Analysis
Gavin SchnitzlerAsst. Prof. Medicine TUSM, Investigator at MCRI, TMC
• Day 1: ChIP techniques, library production, USCS browser tracks
• Day 2: QC on reads, Mapping binding site peaks, examining read density maps.
• Day 3: Analyzing peaks in relation to genomic feature, etc.
• Day 4: Analyzing peaks for transcription factor binding site consensus sequences.
• Day 5: Variants & advanced approaches.
ChIP-seq COURSE OUTLINE
• Introduction to variations on ChIP-seq methods
• Extensions & variations on TFBS analysis• Analyzing published data & across platforms• Downloading & installing programs• Writing your own programs
Day 5 Outline
Next-Generation Sequencing Analysis
“ChIP-Seq is the best thing that happened to ChIP since the antibody. It is 100x better than ChIP-Chip since it escapes most of the problems of microarray probe hybridization. Plus it is cheaper, and genome wide. But ChIP-Seq is only the tip of the iceberg - there are many inventive ways to use a sequencer.”
Quote from intro to Homer software at: http://biowhat.ucsd.edu/homer/ngs/index.html
Extensions of ChIP-seqChIP-Seq: Isolation and sequencing of genomic DNA "bound" by a specific transcription factor, covalently modified histone, or other nuclear protein. This methodology provides genome-wide maps of factor binding. Most of HOMER's routines cater to the analysis of ChIP-Seq data.
DNase-Seq: Treatment of nuclei with a restriction enzyme such as DNase I will result in cleavage of DNA at accessible regions. Isolation of these regions and their detection by sequencing allows the creation of DNase hypersensitivity maps, providing information about which regulatory elements are accessible in the genome. (variant technique called FAIRE-seq)MNase-Seq: Micrococcal Nuclease (MNase) is a restriction enzyme that degrades genomic DNA not wrapped around histones. The remaining DNA represents nucleosomal DNA, and can be sequencing to reveal nucleosome positions along the genome. This method can also be combined with ChIP to map nucleosomes that contain specific histone modifications.
RNA-Seq: Extraction, fragmentation, and sequencing of RNA populations within a sample. The replacement for gene expression measurements by microarray. There are many variants on this, such as Ribo-Seq (isolation of ribosomes translating RNA), small RNA-Seq (to identify miRNAs), etc.
GRO-Seq: RNA-Seq of nascent RNA. Transcription is halted, nuclei are isolated, labeled nucleotides are added back, and transcription briefly restarted resulting in labeled RNA molecules. These newly created, nascent RNAs are isolated and sequenced to reveal "rates of transcription" as opposed to the total number of stable transcripts measured by normal RNA-seq.
Hi-C: Genomic interaction assay for understanding genome 3D structure. This assay is much more specialized - For more information about how to use HOMER to analyze Hi-C data, check out the Hi-C analysis section.
Examining long-range interactions
by ChIP-seq Two DNA fragments associated with the same IP’d protein are ligated together.
Sequencing identifies both short-range and long range interactions.
Nature Reviews Genetics 2012 13:840
Fine scale information from
DNAse-seq
Nature Reviews Genetics 2012 13:840
Sequencing the ends of DNAse cuts identifies regions of bare DNA.
Fine scale analysis of this data can identify individual TF binding sites.
Capturing allele-specific information using SNPs in reads
CTCF binds better to the A variant
Bormann Chung CA, Boyd VL, McKernan KJ, Fu Y, et al. (2010) Whole Methylome Analysis by Ultra-Deep Sequencing Using Two-Base Encoding. PLoS ONE 5(2): e9320. doi:10.1371/journal.pone.0009320 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0009320
Mapping CpG DNA methylation patterns
Approaches:
•IP of DNA fragments using antibodies against meC or meCpG binding proteins.
•Selection of DNA fragments using methyl-sensitive restriction enzymes.
•Whole genome bisulfite sequencing.
Mapping nucleosome positions
Approaches:
•1) Fragmentation to mononucleosome size by sonication or micrococcal nuclease (MNase)
ChIP w/ antibody against histone modification (H3K4me1) – can map positions of nucleosomes with this mark.
Whole genome sequencing.
Nat Struct Mol Biol. 2011 June; 18(6): 742–746.
0
10
20
30
40
50
60
-2000 -1000 0 1000 2000
BP from TSSes of gene group
ChIP-seq reads/10bp/promoter
LiERBS_v_LiD
LiERBS_v_LiULiERBS_v_LiNon-regl.
LiERBS_v_LiNon-expr.
AoERBS_v_AoDAoERBS_v_AoU
AoERBS_v_AoNon-regl.
Plotting ChIP-seq read density versus genomic features
Taking average normalized .bedgraph data relative to TSSes…
Using input chromatin read density to measure nucleosome densities
Initial support: Average normalized .bedgraph data from INPUT sample relative to TSSes recapitulates the low nucleosome occupancy seen genomewide over promoters.
0
1
2
3
4
5
6
7
8
-2000 -1500 -1000 -500 0 500 1000 1500 2000
BP from Li Down promoter TSSes
Li INPUT shifted reads/ 10 bp/ promoter
LiINPUT_v_LiD_pros (norm'd)LiINPUT_v_LiU_pros (norm'd)LiINPUT_v_AoD_pros (norm'd)LiINPUT_v_AoU_pros (norm'd)
Hypothesis: Sonication mostly cuts in nucleosome free regions or inter-nucleosomal spacers. Thus, read positions give information about nucleosome positions.
• Introduction to variations on ChIP-seq methods
• Extensions & variations on TFBS analysis• Analyzing published data & across platforms• Downloading & installing programs• Writing your own programs
Day 5 Outline
Hannenhalli S Bioinformatics 2008;24:1325-1331
Many approaches to TFBS analysis
Also, Ladunga I. An overview of the computational analyses and discovery of transcription factor binding sites. Methods Mol Biol. 2010;674:1-
22. doi: 10.1007/978-1-60761-854-6_1. : Introduction to a set of about a dozen methods papers.
The Gibbs sampler approachObjective: Find conserved segment of length k in n unrelated sequences
n
2
1
1 k
1 k
1 k
From : Lawrence, C. et al.(1993) Detecting Subtle Sequence Signals: A Geibbs Sampler approach to Multiple Alignment. Science 262.208-
De Novo Search Algorithms
The program will need to run once for each k:e.g. 6 bp, 7 bp, 8 bp sequences, etc. (either automatically, or by hand).
Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt
The EM approach (in MEME etc.)Expectation Maximization algorithm, proceeds in iterations until E & M converge. For an explanation of the process see Nature Biotechnology 26, 897 - 899 (2008).
Two de novo search methods
• DME is part of the same CREAD package that storm is in (run in UNIX)
• SEME some of the same refinements as CentDist to do de novo searches: http://biogpu.d1.comp.nus.edu.sg/~chipseq/webseqtools2/
Regulatory Modules:De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Nat’l Acad Sci USA, 102, 7079-84
M1
M2
M3
Stop
Start 12p
21p
Gene AGene B
Composite Patterns: BioOptimizer: the Bayesian Scoring Function Approach to Motif Discovery Bioinformatics
Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt
Extensions to Basic Models
Combining Signals and other Data
Expresssion and Motif Regression: Integrating Motif Discovery and Expression Analysis Proc.Natl.Acad.Sci. 100.3339-44
1.Rank genes by E=log2(expression fold change)2.Find “many” (hundreds) candidate motifs
3.For each motif pattern m, compute the vector Sm of matching scores for genes with the pattern
4.Regress E on Sm
€
Yg = α + βmSmg + εg
Motifs Coding regions
ChIP-on-chip - 1-2 kb information on protein/DNA interaction: An Algorithm for Finding Protein-DNA Interaction Sites with Applications to Chromatin Immunoprecipitation Microarray Experiments Nature Biotechnology, 20, 835-39
Protein binding in neighborhood Coding regions
Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt
For use of evolutionary conservation information w/ individual motifs see: Das & Dai 2007 BMC Bioinformatics 8:S21.
For regulatory modules see: Su J, Teichmann SA, Down TA (2010) Assessing Computational Methods of Cis-Regulatory Module Prediction. PLoS Comput Biol 6(12): e1001020. doi:10.1371/journal.pcbi.1001020http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1001020
Assessment of evolutionary conservation
Modules shared across species are most highly rated.
Adapted from: www.stats.ox.ac.uk/~hein/Signals.ppt
Integrating data from multiple sources w/ permutation of average ranks
Let’s say we want to combine data from several sources or metrics to decide which are the most relevant enriched TFs.e.g. 1) p.value in CentDist, 2) p.value in Storm & 3) p.value of homologous sequence in DME
Establish a ranking metric for each (e.g. 1 best to 10 worst). It doesn’t have to be the same for 1, 2 & 3, but you need to apply the same rank system across different biological conditions.
For each TF compute the average rank.(1) (2) (3) (avg)1 3 2 24 5 8 5.75 4 2 38 9 7 8. . . .
Permutation of average ranks
Now take the same columns of ranks for (1), (2) & (3) and randomize each one separately.
(1) (2) (3) (avg)8 5 2 54 4 8 5.35 3 7 4.31 9 2 4. . . .
Repeat this several times (until you have thousands of random average ranks & plot frequency vs avg. rank…
10 7 5 3 1
The number of times a given value is observed divided by the total number of iterations gives an estimate of false discovery rate.
2.0 observed 34/10,000 times in permuted averages.Estimated FDR ~3.4e-3
• Introduction to variations on ChIP-seq methods
• Extensions & variations on TFBS analysis• Analyzing published data & across
platforms• Downloading & installing programs• Writing your own programs
Day 5 Outline
Getting data from SRAGo to: http://www.ncbi.nlm.nih.gov/sra
Find an experiment by searching, e.g. “encode h1-hesc h3k4me3”
Click on the name to the left of the smaller file (1.9M) & then on the downloads tab.
Right click on the ftp link for the run & copy the link location.
Open putty & login to your account at cluster.uit.tufts.edu
Go to your /cluster/shared/[userID]/chip directory & do:
wget [pasted URL]
What if you want to know something from a published dataset, but they’ve only provided the raw data on SRA?
Decoding the .sra formatThe SRR227387.sra file you now have is in a special file format, but it does have all the original .fastq information in it.
To get that info do:bsub /cluster/tufts/cbi*/Ch*/ESC*/sra*/bin/fastq-dump SRR227387.sra[fastq-dump is part of a package of programs for handling .sra files that you can download, unpack & run immediately from your shared directory – at least as far as simple files like fastq-dump are concerned]
This gives you the same .fastq format you’re familiar with.Use head to confirm the format, but then you might as well delete the file with rm so as not to clutter up the cluster.
After this week you are now ready to do any analysis you want on this data, from mapping reads to the genome (w/ bowtie) to peak calling (w/ MACS), to TFBS analysis.
“Liftover” programs to convert between genomes & builds
Liftover/Others Convert between RefSeq, Gene Symbols to Entrez IDs using Bioconductor.Liftover Wig Files Liftover wig files[Galaxy]Convert genome coordinates between assemblies and genomesExtract data from Wiggle Extract data for certain chromosome from a wiggle fileExtract data from Bed Extract data for certain chromosome from a BED file
Several useful tools for this in Cistrome/Galaxy:
In the UCSC genome browser:•Tools-> Liftover•Choose the starting genome/build & the one you want to convert to. •Upload a .bed file w/ the ranges you want & hit go (only works for bed files… may work with bedGraph, although I haven’t confirmed this)
• Introduction to variations on ChIP-seq methods
• Extensions & variations on TFBS analysis• Analyzing published data & across platforms• Downloading & installing programs• Writing your own programs
Day 5 Outline
Don’t be intimidated!There’s nothing to prevent you from installing a program
you want to run in your cluster account.Before you begin, though, type “module available” to see if it’s
already installed as a module.Also go to /cluster/tufts/ngsp/ngsp/ to see if it’s installed there.• Read the documentation from the creator’s lab, download,
unzip &/or unpack the file, read the INSTALL or README files included, & give it a try.
• You may need to be running a specific version of perl or python, etc. If so, check “module available” to see if it’s installed on the cluster & use “module load [name]” to add it.
• You may also need to set system variables using “export VARIABLE=$VARIABLE:/new/path”. README files should tell you enough to know what to try.
• If you get stuck, the cluster support folks are friendly & helpful (and respond moderately fast). Contact them at: [email protected].
A different integrated package of tools to run in UNIX
http://biowhat.ucsd.edu/homer/ngs/index.html
HOMERSoftware for motif discovery and next-gen sequencing analysis
Mapping to the genome (NOT performed by HOMER, but important to understand) Creation Tag directories, quality control, and normalization. (makeTagDirectory)UCSC visualization (makeUCSCfile, makeBigWig.pl)Peak finding / Transcript detection / Feature identification (findPeaks) Motif analysis (findMotifsGenome.pl) Annotation of Peaks (annotatePeaks.pl)Quantification of Data at Peaks/Regions in the Genome/Histograms and Heatmaps (annotatePeaks.pl) Quantification of Transcripts (analyzeRNA.pl)
Additional analysis strategies:General sequence manipulation tools (homerTools) Miscellaneous Tools for Sharing Data between programs, etc. (tagDir2bed.pl, bed2pos.pl, pos2bed.pl ...)Finding overlapping or differentially bound peaks (mergePeaks, getDifferentialPeaks) ChIP-Seq analysis automation (analyzeChIP-Seq.pl)Description of file formats
Could be very useful… & with (only a bit of) luck, you’ll be able to install & run them yourself.
Installing a program in RCheck out the Key R Commands link at http://sites.tufts.edu/cbi/resources/chip-seq/
This is not an introduction to programming in R!Instead it gives basic instructions for how to:1) install & run R packages that may be needed for your research, 2) how to move data files into R3) how to perform simple edits on this data that may be required by the package& 4) how to output your results.
Note: I find that the documentation for R packages is generally quite good.
• Introduction to variations on ChIP-seq methods
• Extensions & variations on TFBS analysis• Analyzing published data & across platforms• Downloading & installing programs• Writing your own programs
Day 5 Outline
Mastering simple UNIX tools
Find out more at: http://sites.tufts.edu/cbi/resources/rna-seq-course/unix-resources/
find, awk, grep, sort, sed & moreOne line commands to let you search and manipulate large data files w/o writing a program or trying to use the kludgy and limited tools in Galaxy.
Programming: Get your feet wet
Perl Tutorials - learn.perl.orglearn.perl.org/tutorials/Many tutorials are available if you are interested in learning Perl. These tutorials are introductions.
Beginning Perl (free) - www.perl.orgwww.perl.org/books/beginning-perl/This book is for those new to programming who want to learn with Perl.
A ton of Perl programs for you to use/adapt/modify:http://www.bioperl.org/wiki/Main_Page
For learning R: Check out Josh’s links at: http://sites.tufts.edu/cbi/resources/rna-seq-course/r-resources/
Also check out my notes on using R (specifically geared to the minimum you need to install & use existing programs) & a brief reference sheet on Perl at http://sites.tufts.edu/cbi/resources/chip-seq/
Look at examples, check the web…
If you’re looking for a command in UNIX, R, Perl, Python, etc. do a Google search (for R add “statistical” to your search to specify what you mean).
If you’re wondering how to get a program to do something, look at other programs & see how they did it.
You don’t need to memorize the language, beyond a few basics, just look at what you (or someone else) did before & copy it.
Questions?
What would you like to explore?
What’s the next bioinformatics challenge in your research?
Course evaluation forms…