View
9
Download
0
Category
Preview:
Citation preview
Processing, integrating and analysingchromatin immunoprecipitation followed
by sequencing (ChIP-seq) data
DATA SCIENCE AND MACHINE LEARNING FOR BIOINFORMATICS
ALEX ESSEBIER
The regulatory system
Dynamic, sequential gene expression changes required for normal development
Histonemodifications
Chromatinaccessibility Chromatin state
Sequence features
Transcription factorbinding
Cis regulatorymodules
Gene expression
Distal or proximalinteractions
• Chromatin immunoprecipitation (ChIP-seq)
• To determine protein binding sites in vivo
• Transcription factor binding & histone modifications
• DNase I hypersensitivity (DHS) (also ATAC-seq and FAIRE-seq)
• Chromatin accessibility
• Protein binding footprints
High throughput sequencing (HTS) data
• RNA-seq and cap analysis gene expression (CAGE)• Gene expression
• Alternate transcription start site usage
• Changes in expression (temporal or perturbation)
High throughput sequencing (HTS) data
• Chromosome conformation capture • DNA looping
• Long distance enhancer/promoter interactions• 3C, 4C, 5C, Hi-C, CHi-C, ChIA-PET…
High throughput sequencing (HTS) data
ChIP-seq
GENERATING AND INTERPRETING CHIP-SEQ DATA
ChIP-seq experiment
Wet lab
• Extract DNA fragments bound by protein of interest
ChIP-seq sequence analysis
• Sequence depth
– Depends on size of genome and type of protein
• Mammalian transcription factor
20 million reads
• Sequence quality
– Read quality summary
– Unusual base pair patterns
– Adapter sequences
– E.g. FastQC tool
Sequencing & quality control
ChIP-seq alignment
Alignment & quality control
• Alignment summary
– Generated by alignment tools
– Uniquely aligned reads
ChIP-seq alignment
Alignment & quality control
• Alignment summary
– Generated by alignment tools
– Uniquely aligned reads
• Read distribution quality
– ChIP-seq peaks create bimodal pattern
– Strand cross correlation analysis (SCCA)
Peak calling
GENERATING AND EXPLORING CHIP-SEQ PEAKS
Basic principles of peak calling
InputNo antibody exposure
Sample Exposed to antibody
Peak With statistical significance
Compared to
To generate
Peak calling tools
• Everyone seems to have built their own!
• Omic Tools reports 86 ChIP-seq tools• 51 in 2016
• In-house tools
• Find or adapt a tool that fits your needs
• Potentially, develop your own
Be aware!Not all tools are well documented or tested
https://xkcd.com/927/
So, how do I choose a tool?
• Choice of tool depends on problem• Based on experiment itself
• Based on sequence data and read distribution
Combining tools and replicates
Peak caller Total Unique
MACS2 42,536 12%
HOMER 45,044 19%
SPP 19,474 0.7%
• Use multiple peak callers • Combine outcomes (e.g. common peaks)
• Take advantage of strengths of different peak callers• Bolster weaknesses
• Biological replicates can vary significantlyCall peaks for replicates individually
Compare/overlap to achieve
‘golden standard’
• Comparisons are dominated by poor replicate
10,000 peaks
3,000 peaks
Peak quality control• Number of peaks
• Motif enrichment
• Read coverage• Fraction of reads in peaks (FRiP) > 1%• Generally observe > 10%
• E.g. below 6/50 reads in peak -> 12%
5T0U (PDB)Hashimoto et al. 2017
CTCF and DNA
ChIP-seq complications
• ChIP-seq generates peaks for all of these events
Integrating data
COMBINING DATA SETS TO IMPROVE OUTCOMES
Data integration
• Experiments capture dependent regulatory events • ChIP-seq – regulatory elements
• DHS – chromatin accessibility
• RNA-seq – expression patterns
• Consider multiple datasets to:• Improve confidence
• Improve understanding
• Support hypotheses
https://marketoonist.com/2014/01/big-data.html
Supporting histone modifications
• Explore chromatin environment • Layer/overlap histone modifications
• DHS – chromatin accessibility
Supporting transcription factors
• Transcription factors preferentially bind open/active chromatin and regulatory regions
• Alter expression of genes
• RNA-seq on knock-out of transcription factor• Identify genes with significant change in expression
System complexity
• Small number of differentially expressed genes are bound by target transcription factor
• System redundancy
• Indirect changes in expression
Ma et al. 2014
More complicated relationships
• Overlapping is simple
• Next steps: pattern identification, prediction, classification
• Machine learning approaches
• Hidden Markov model to classify epigenetic state
• Bayesian network to predict transcription factor binding events
• Random forest of decision trees to predict long distance enhancer/promoter interactions
Take home messages
• Understand your data and how best to use it
• Quality control!
• Peak calling• Use multiple tools where possible
• Keep up to date with advances
• Data integration• Combine resources and data to gain a more complete picture
Resources
• Data/figures• Klisch, T. J., Xi, Y., Flora, A., Wang, L., Li, W., & Zoghbi, H. Y. (2011). In vivo Atoh1 targetome reveals how a proneural transcription
factor regulates cerebellar development. Proceedings of the National Academy of Sciences,108(8), 3288-3293. • Frank, C. L., Liu, F., Wijayatunge, R., Song, L., Biegler, M. T., Yang, M. G., ... & West, A. E. (2015). Regulation of chromatin accessibility
and Zic binding at enhancers in the developing cerebellum. Nature neuroscience, 18(5), 647-656. • Ma, S., Kemmeren, P., Gresham, D., & Statnikov, A. (2014). De-novo learning of genome-scale regulatory networks in S.
cerevisiae. Plos one, 9(9), e106479.• Hashimoto, H., Wang, D., Horton, J. R., Zhang, X., Corces, V. G., & Cheng, X. (2017). Structural basis for the versatile and methylation-
dependent binding of CTCF to DNA. Molecular cell, 66(5), 711-720.
• Useful papers • Bailey, T., Krajewski, P., Ladunga, I., Lefebvre, C., Li, Q., Liu, T., ... & Zhang, J. (2013). Practical guidelines for the comprehensive
analysis of ChIP-seq data. PLoS Comput Biol, 9(11), e1003326. • Landt, S. G., Marinov, G. K., Kundaje, A., Kheradpour, P., Pauli, F., Batzoglou, S., ... & Chen, Y. (2012). ChIP-seq guidelines and
practices of the ENCODE and modENCODE consortia. Genome research, 22(9), 1813-1831. • Farnham, P. J. (2009). Insights from genomic profiling of transcription factors.Nature Reviews Genetics, 10(9), 605-616. • Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., ... & Liu, X. S. (2008). Model-based analysis of ChIPSeq
(MACS).Genome biology, 9(9), 1. • Heinz, S., Benner, C., Spann, N., Bertolino, E., Lin, Y. C., Laslo, P., ... & Glass, C. K. (2010). Simple combinations of lineagedetermining
transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular cell, 38(4), 576-589.
ChIP-seq complications
• Possible to observe multiple states at one genomic location
• False negatives• Can’t detect small sub-populations
• False positives• General non-specific chromatin being pulled down
• Bias not removed by input
• Replicates can resolve variation
Transcription Factors
• Confirm in vitro and in silico results• Overlapping peaks with motifs
• Identify consensus motif• For transcription factors which do not have an existing/known motif
• To identify variations in motif
• Differential peak binding• To identify differences in binding patterns
• Compare cell types or time points
Histone Modifications
• Epigenetic analysis• Generate epigenetic profiles
• Identify chromatin states genome wide• E.g. ChromHMM
• Identify regulatory modules • E.g. promoters or enhancers
• Differential peak binding• Identify differences in epigenetic patterns
Long distance regulation
• Chromosome conformation capture reports different types of interactions
• Histone modifications can identify enhancer-promoter interactions
• Filter out structural interactions
Recommended