View
0
Download
0
Category
Preview:
Citation preview
EMBNET course
Bioinformatics of transcriptional regulation
Jan 31 2008
Christoph Schmid
ChIP-seq technology and data formats
Copyright restrictions may apply.
Hawkins, R. D. et al. Hum. Mol. Genet. 2006 15:R1-7R; doi:10.1093/hmg/ddl043
Histones and their modifications as epigenetic markers
Human:5 histone families with over 50 genes in total
Copyright restrictions may apply.
Hawkins, R. D. et al. Hum. Mol. Genet. 2006 15:R1-7R; doi:10.1093/hmg/ddl043
Chromatin ImmunoPrecipitation
Solexa System Workflow:target customers: scientist?
Sequencing Technology Overview
Amplification
Sequencing
Four-color DNA sequencing by synthesis using cleavable fluorescent nucleotide reversible
terminators
Ju, Jingyue et al. (2006) Proc. Natl. Acad. Sci. USA
103, 19635-19640
Data analysis
• Image aquisition• Base calling
quality scores (?)• Sequence analysis:
short reads(~30 bp) with >2% error rate
map sequence reads to the (assembled) genomes
algorithm ‘Eland’:unique match to genome with up to 2 mismatches in tag sequence
=> project-specific downstream analysis
'Solexa Analysis Pipeline'
Different systems
Company
• Solexa + Lynx -> Illumina
Method of amplification
• all 4 nucleotides added at once
Visualization
• color produced by fluorescence
Length of contiguous sequence and final read lengths
• Solexa templates = ~35 bases
1. Barski, A., et al. (2007) High-resolution profiling of histone methylations in the human genome. Cell, 129, 823-837.
• 454 Life Sciences Corp., Connecticut
• only one type of nucleotide is added at a time
• light produced from chemical reactions (PPi, sulfurylase, ATP, Luciferase, Luciferin, light emission)
• 454 templates = ~100 bases
1. Margulies, M., et al. (2005) Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437, 376-380.
Offline Data Analysis Requirements
-> crucial to plan for and manage the massive amounts of genetic information
Experiment Data Volume (>20 Million sequence tags)Cycles per run Run Time (hrs) Raw Data (Gb) Results Data (Gb)
18 42.0 360 270 26 60.7 520 390 36 84.0 720 540
-> need of network storage resources
Data transfer -> Network Bandwidth
Data archival -> Retention Policy: which data are kept for how long?
Data formats
-> indicate milions of positions on genome
• Eland output
• BED format *
• sga format
• wiggle format *
* display in UCSC genome browser
Eland format
>1-1-201-448 GGATAATAGGCCTTTATCAGACATGTT U0 1 2 1Homo_sapiens.NCBI36.42.dna.chromosome.6.fa 150586572 F ..
>1-1-199-437 GGTTGCATTGGTGCAATTGTT U1 0 1 3 Homo_sapiens.NCBI36.42.dna.chromosome.6.fa 50704915 F .. 19A
>1-1-94-309 GTTTATCCAATTTTGCTTTTAGTATCA U0 1 0 0 Homo_sapiens.NCBI36.42.dna.chromosome.16.fa 45926410 R ..
>1-1-812-242 GTTATTGGTTATGGAGAGCTTATGTTA U0 1 0 0Homo_sapiens.NCBI36.42.dna.chromosome.10.fa 104455324 F ..
>1-1-835-525 GGTTTTCTGGAGCAAAATCTCTTGAGC U0 1 0 0Homo_sapiens.NCBI36.42.dna.chromosome.7.fa 25555335 R ..
>1-1-164-118 GATATGCATTGACTAACAGAAAACCAC U0 1 0 0Homo_sapiens.NCBI36.42.dna.chromosome.4.fa 116410190 F ..
>1-1-244-71 GATATTAAAGACACCTTATTATTTGGA U0 1 0 0Homo_sapiens.NCBI36.42.dna.chromosome.5.fa 16286199 F ..
>1-1-48-511 GCTGAATTCTTCGAAGTTAATTCTTTA U0 1 0 0 Homo_sapiens.NCBI36.42.dna.chromosome.13.fa 95075033 F ..
>1-1-797-577 GTTATCAAAAGATAGATATCATTTCTG U0 1 0 0Homo_sapiens.NCBI36.42.dna.chromosome.1.fa 82693040 F ..
Eland output field definitions1. tag AC (incremental integer)
2. Sequence
3. Type of match:
U0 - Best match found was a unique exact match.
U1 - Best match found was a unique 1-error match.
U2 - Best match found was a unique 2-error match.
4. Number of exact matches found.
5. Number of 1-error matches found.
6. Number of 2-error matches found.
Rest of fields are only seen if a unique best match was found (i.e. the match code in field 3 begins with "U").
7. chromosome number in which match was found.
8. Position of match (bases in file are numbered starting at 1).
9. Direction of match (F=forward strand, R=reverse).
10. How N characters in read were interpreted: ("."=not applicable, "D"=deletion, "I"=insertion).
Rest of fields are only seen in the case of a unique inexact match (i.e. the match code was U1 or U2).
11. Position and type of first substitution error (e.g. 12A: base 12 was A, not whatever is was in read).
12. Position and type of first substitution error, as above.
BED format
Chrom chromStart chromEnd name score strand
chr20 2971283 2971306 U0 0 +
chr7 158080784 158080807 U0 0 -
chr19 43423764 43423787 U0 0 -
chr10 112107693 112107716 U0 0 -
chrX 13440269 13440292 U0 0 -
chr12 122107256 122107279 U0 0 +
chr8 42315863 42315886 U0 0 -
chr8 120598412 120598435 U0 0 -
chr6 33396062 33396085 U0 0 -
chr1 19441013 19441036 U0 0 -
sga format
Seq-name feature position Strand signal intensity (tags)
NC_000001.9 stim 185 - 1
NC_000001.9 stim 245 + 1
NC_000001.9 stim 246 + 1
NC_000001.9 stim 248 + 1
NC_000001.9 stim 557 - 2
NC_000001.9 stim 696 - 1
NC_000001.9 stim 6106 + 1
NC_000001.9 stim 6319 - 1
wiggle formattrack type=wiggle_0 name="STAT1_stim" description="STAT1_stim_chr1"
visibility=full autoScale=on viewLimits=0:10 color=0,200,100 maxHeightPixels=100:50:20 graphType=bar priority=30fixedStep chrom=chr1 start=115 step=100 span=1001121010000000000000
ChIP-seq vs. ChIP-chip
ChIP-seq (Barski et al., 2007):primary CD4 T cellsSolexa sequencingsum counts of tags in window of 100bp
ChIP-chip (Heintzman et al., 2007)HeLa cells, PCR array and Nimblegen tiling array, negative log base 10 of the p-value
Recommended