Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
DNA methylation and the analysis of CpG Islands in genomes
M. F. WojciechowskiMAT 35125 March 2005
Nucleotides
Base pairing
*
*
DNA methylation
In mammalian genomes, methylation takes place at cytosine residuesthat are located 5’ to a guanosine, i.e., 5’CpGin plants, the C in the sequence 5’CpNpG is often methylated
the dinucleotide CpG not randomly distributed in most genomes these C residues represent a target for covalent modification of DNA
Cytosine is one of two bases found commonly methylated in nature5’-methylcytosine (5mC) and N-6-methyladenosine
N-6-methyladenosine
Only 5mC is found inDNA of eukaryoticorganisms
3-D models of B DNA
minor groove
major groove
Methyl groups on 5mC protrude into the ‘major’ groove of DNA and havetwo main effects:1. displaces transcription factors that normally bind to DNA, and2. attracts methyl-binding proteins which are associated with gene silencing and chromatin compaction
DNA methylation
DNA methylation is essential for development in mammals, but despite25 yrs of research we still do not know exactly why
Post-replicational addition of methyl groups to 5’ position of cytosinealters the appearance of the major groove of DNA, and thusthe nature of its interactions w/ DNA-binding proteins
Such epigenetic ‘markers’ on DNA generated by methylation, lead tochanges in gene regulation, can be copied and inherited
Methylation of CpG-rich promoters is used to prevent transcriptionalinitiation and ensure the silencing of genes on the inactiveX-chromosome, imprinted genes, and silencing of intragenomicparasites, may also play significant role in carcinogenesis (viaaberrant, tumor-specific CpG island methylation, e.g. BRCA1)
CpG islands
CpG dinucleotides are the site of >90% of all methylation in mammalsbut are underrepresented in much of the genome
short regions of 0.5 to 4 kb in length - “CpG islands” - are rich inG+C content and the CpG dinucleotide
CpG islands are often found associated with proximal regions of genesesp. in promoters and first exons
Salient property of a CpG island is that it is unmethylated in the germline (and most somatic tissues), thus ensuring its continuedexistence in spite of mutagenic pressure on methylated Cs
CpG islands often function as strong promoters, proposed to functionas replication origins in mammals
CpG islands initially discovered on the basis of discordant patterns ofdigestion of genomic DNA by restriction enzyme ‘isoschizomers’ thatdiffered only in their sensitivity to cytosine methylation.
e.g., 5’-ATGTCCGGTACG-3’ 3’-TACAGGCCATGC-5’
5’-ATGTC CGGTACG-3’3’-TACAGGC CATGC-5’
Both enzymes MspI and HpaII digest(cut) DNA at the indicated“restriction” site, to producefragments with overlapping ends.
5’-ATGTCC*GGTACG-3’3’-TACAGGC*CATGC-5’
But HpaII is sensitive to methylation,and will not digest DNA at therestriction site if the C is methylatedat the 5’ position (5mC*)
HpaII or MspI
5’-ATGTC CGGTACG-3’3’-TACAGGC CATGC-5’
HpaII, MspI
Maintenance methylation & de novo methylation
T C G A G G A C G T A C
A G C T C C T G C A T G
CH3
5’
5’
5’
T C G A G G A C G T A C
A G C T C C T G C A T G
CH3
CH3
5’
5’
5’
T C G A G G A C G T A C
A G C T C C T G C A T G
CH3
5’
5’
T C G A G G A C G T A C
A G C T C C T G C A T G
CH3
5’
5’
5’T C G A G G A C G T A C
A G C T C C T G C A T G
CH3
5’
CH3
CH3
DNA replication
Methylation
CH3
de novo* methylation
CH3
CH3
CH3
parental strandnew strand
*Methyltransferases with de novo methylation activity have been detected
CpG islands
Restriction analyses indicated: 55-70% of CCGG sites are methylated in animal genomes ‘hypomethylated’ portion of genomes, those digested w/ HpaII
were esp. high in G+C content and CpG rich
Gardiner-Garden & Frommer (1987) - vertebrates in Genbank - proposedgenomic criteria for recognizing CpG islands:sequence must have:G+C content of 0.50 or greaterobserved to expected CpG dinucleotide ratio of 0.60 or greaterboth occurring within a sequence window of 200 bp or greater
But CpGs are vastly underrepresented in genomes compared to what would be expected by chance (0.625) - 0.23 in human, 0.19 in mouse
Why? Methylated cytosines are mutational hotspots - deamination of5mC leads to depletion of CpG dinucleotides through evolution
Outcomes of deamination
Uracil is seen as anabnormal base inDNA and is easilydetected andremoved by excisionrepair mechanisms.
Thymine is seen as anormal base in DNAand repaired bymismatch repairwhich is a veryinefficient mechanismof DNA repair.
CpG islands
Distribution (using previous definition): preferentially at promotor sites of ca. 40% of genes in human genome genes are expressed in most/all tissues, i.e., “housekeeping genes” promoters of some endogenous retroelements defined as CpG islands Genome analyses estimate 55-70% of HpaII sites are methylated in animals
ca. 50% of these are located within transposable elements (humans)while ca. 35% are in the mouse genome
Suggests a variable but substantial proportion of HpaII sites within uniquesequences are methylated in any given cell type
But only a few HpaII sites in the human (22%) & mouse (14%) genomes arelocated in CpG islands
Most NotI sites (5’-GCGGCGCC-3’), 90%, in both species are located withinCpG islands, and RLGC analyses show tissue-dependent methylation
Molecular techniques for detecting CpG islands
Restriction landmark genomic screening (RLGS)Costello, J. F., et al. (2000), Nature Genetics 24: 132-138
Methylated CpG island amplification-representational difference analysis (MCA-RDA)
Toyota, M., et al. (1999), Cancer Research 59: 2307-2312
Methylation-sensitive arbitrarily primed PCR (MS-AP-PCR)Gonzalgo, M. L., et al. (1997), Cancer Research 57: 594-599
Differential methylation hybridization (DMH)Huang, T. H., et al. (1999), Human Molecular Genetics 8: 459-470
Most involve restriction analysis combined with polymerase chain reactionmethods to identify specific methylation sensitive and insensitive sites.
MSEEQFGGDGAAAAATAAVGGSAGEQEGAMVAATQGAAAAAGSGAGTGGGTASGGTEGGSAESEGAKIDASKNEEDEGHSNSSPRHSEAATAQREEWKMFIGGLSWDTTKKDLKDYFSKFGEVVDCTLKLDPITGRSRGFGFVLFKESESVDKVMDQKEHKLNGKVIDPKRAKAMKTKEPVKKIFVGGLSPDTPEEKIREYFGGFGEVESIELPMDNKTNKRRGFCFITFKEEEPVKKIMEKKYHNVGLSKCEIKVAMSKEQYQQQQQWGSRGGFAGRARGRGGGPSQNWNQGYSNYWNQGYGNYGYNSQGYGGYGGYDYTGYNNYYGYGDYSNQQSGYGKVSRRGGHQNSYKPY
cagcgggagagagcgggagtgtgcgccgcgcgagagtgggaggcgaagggggcaggccagggagaggcgcaggagcctttgcagccacgcgcgcgccttccctgtcttgtgtgcttcgcgaggtagagcgggcgcgcggcagcggcggggattactttgctgctagtttcggttcgcggcagcggcgggtgtagtctcggcggcagcggcggagacactagcactATGtcggaggagcagttcggcggggacggggcggcggcagcggcaacggcggcggtaggcggctcggcgggcgagcaggagggagccatggtggcggcgacacagggggcagcggcggcggcgggaagcggagccgggaccgggggcggaaccgcgtctggaggcaccgaagggggcagcgccgagtcggagggggcgaagattgacgccagtaagaacgaggaggatgaaggccattcaaactcctccccacgacactctgaagcagcgacggcacagcgggaagaatggaaaatgtttataggaggccttagctgggacactacaaagaaagatctgaaggactacttttccaaatttggtgaagttgtagactgcactctgaagttagatcctatcacagggcgatcaaggggttttggctttgtgctatttaaagaatcggagagtgtagataaggtcatggatcaaaaagaacataaattgaatgggaaggtgattgatcctaaaagggccaaagccatgaaaacaaaagagccggttaaaaaaatttttgttggtggcctttctccagatacacctgaagagaaaataagggagtactttggtggttttggtgaggtggaatccatagagctccccatggacaacaagaccaataagaggcgtgggttctgctttattacctttaaggaagaagaaccagtgaagaagataatggaaaagaaataccacaatgttggtcttagtaaatgtgaaataaaagtagccatgtcgaaggaacaatatcagcaacagcaacagtggggatctagaggaggatttgcaggaagagctcgtggaagaggtggtggccccagtcaaaactggaaccagggatatagtaactattggaatcaaggctatggcaactatggatataacagccaaggttacggtggttatggaggatatgactacactggttacaacaactactatggatatggtgattatagcaaccagcagagtggttatgggaaggtatccaggcgaggtggtcatcaaaatagctacaaaccatacTAAattattccatttgcaacttatccccaacaggtggtgaagcagtattttccaatttgaagattcatttgaaggtggctcctgccacctgctaatagcagttcaaactaaattttt
Take a gene, any gene . . . e.g., human nuclear ribonucleoprotein D (AU-rich element RNA binding protein 1)
ATG - start codon; TAA - stop codon
cagCGggagagagCGggagtgtgCGcCGCGCGagagtgggaggCGaagggggcaggccagggagaggCGcaggagcctttgcagccaCGCGCGCGccttccctgtcttgtgtgcttCGCGaggtagagCGggCGCGCGgcagCGgCGgggattactttgctgctagtttCGgttCGCGgcagCGgCGggtgtagtctCGgCGgcagCGgCGgagacactagcactATGtCGgaggagcagttCGgCGgggaCGgggCGgCGgcagCGgcaaCGgCGgCGgtaggCGgctCGgCGggCGagcaggagggagccatggtggCGgCGacacagggggcagCGgCGgCGgCGggaagCGgagcCGggacCGggggCGgaacCGCGtctggaggcacCGaagggggcagCGcCGagtCGgagggggCGaagattgaCGccagtaagaaCGaggaggatgaaggccattcaaactcctccccaCGacactctgaagcagCGaCGgcacagCGggaagaatggaaaatgtttataggaggccttagctgggacactacaaagaaagatctgaaggactacttttccaaatttggtgaagttgtagactgcactctgaagttagatcctatcacagggCGatcaaggggttttggctttgtgctatttaaagaatCGgagagtgtagataaggtcatggatcaaaaagaacataaattgaatgggaaggtgattgatcctaaaagggccaaagccatgaaaacaaaagagcCGgttaaaaaaatttttgttggtggcctttctccagatacacctgaagagaaaataagggagtactttggtggttttggtgaggtggaatccatagagctccccatggacaacaagaccaataagaggCGtgggttctgctttattacctttaaggaagaagaaccagtgaagaagataatggaaaagaaataccacaatgttggtcttagtaaatgtgaaataaaagtagccatgtCGaaggaacaatatcagcaacagcaacagtggggatctagaggaggatttgcaggaagagctCGtggaagaggtggtggccccagtcaaaactggaaccagggatatagtaactattggaatcaaggctatggcaactatggatataacagccaaggttaCGgtggttatggaggatatgactacactggttacaacaactactatggatatggtgattatagcaaccagcagagtggttatgggaaggtatccaggCGaggtggtcatcaaaatagctacaaaccatacTAAattattccatttgcaacttatccccaacaggtggtgaagcagtattttccaatttgaagattcatttgaaggtggctcctgccacctgctaatagcagttcaaactaaattttt
human nuclear ribonucleoprotein D (AU-rich element RNA binding protein 1)
ATG - start codonTAA - stop codon1068 bpCG - CpG dinucleotide
1st half of gene: 56% G+C, 39 CpGs2nd half of gene: 43% G+C, 6 CpGs1st 1/4 of gene: 71% G+C, 34 CpGs200 bp upstream: 69% G+C, 26 CpGs
Example of a genomic scale analysis of CpG islands inhuman DNA
Takai and Jones (2002), in PNAS - human chromosomes 21 and 22
Revisited the generally accepted definition of what constitutes a CpGisland which was derived before the sequencing of mammalian and othergenomes (Gardiner-Garden and Frommer, 1987).
Takai and Jones hypothesis was that any definition of a CpG island issomewhat arbitrary, but the current one would include many sequencesthat were not necessarily associated with controlling regions of genes butrather are associated with repetitive sequences such as intragenomicparasites.
Computational analysis of CpG islands
Gardiner-Garden & Frommer (1987), using vertebrate sequences defined a CpG island as a 200-bp region of DNA w/ high G+C content (>50%)
and observed CpG/expected CpG ratio > or = 0.6 definition somewhat arbitrary, depends on what sequences are included difficult to distinguish bona fide CpG islands from repetitive sequences
e.g., nearly 1,000,000 Alu copies per haploid genome in humansshort, interspersed elements, 280-bp consensus w/ high %GC
Takai and Jones (2002) complete sequences of human chromos 21 & 22, ca. 2% of genome refined the algorithm to search for & describe CpG islands results led to re-definition of CpG island reduced # of CpG islands on these chromos from 14,062 to 1,101
more consistent with expected number of genes (~750)
A. Set a 200 bp window @ beginning ofcontig, compute %GC and ObsCG/ExpCG. Shiftwindow 1 bp until window meets criteria ofCpG island.B. If window meets criteria, shift window200 bp and evaluate again.C. & D. Repeat 200 bp shifts until windowdoes not meet criteria.E., F., & G. Repeat from other end.H. If large CpG island does not meetcriteria, trim 1 bp from each end until itdoes.I. Two individual CpG islands areconnected if they were separated by lessthan 100 bp.
Schematic for the algorithms for CpG island extraction from human genome sequences (from Takai and Jones, 2002)
To exclude “mathematical CpG islands”, criterion added that there be at least sevenCpGs in the 200-bp window: there would be 12.5 CpGs (200/16) in a random DNAfragment, with ObsCG/ExpCG ratio of 0.6 would = 7.5; thus, 7 used as cutoff.
Table 1. Number of CpG islands in chromosomes 21 and 22 based on Gardiner-Garden & Fromme (1987) criteria
Category 21 22 21+225’ regions 57 138 195Exon 334 423 757Alu repeats 2,520 5,131 7,651Unknown 2,128 3,331 5,459Total 5,039 9,023 14,062
Takai and Jones (2002) analyses of CpG islands in human sequences
Data (table 1 & figure 2) suggested majority (>13,000) of CpG islands by these criteriawere associated with Alu repeats, or unknown, and not with genes, and that criteriamight be too lenient to be useful.
For example, majority had %GC <59%, ObsCG/ExpCG of <0.72 and a length of <600 bp.
Table 2. Number of CpG islands-distribution of %GC, andObsCG/ExpCG, and lengths of CpGislands in human 21 & 22.
CpG islands divided into fourcategories:5’ region (incl. first coding exon)Exon (subsequent introns/exons)Alu (other than exonic regions)Unknown (any other region)
Results. CpG islands associatedw/5’ regions of genes showedmarkedly different distributioncompared with that of Alus -higher GC content, biphasicdistribution of occurrence ofObsCG/ExpCG, and biphasicdistribution w/ respect to lengths.
%GC ObsCG/ExpCG Length
60 0.7 400-599
Takai & Jones, Fig. 2 (2002)
When the criteria were modified to a %GC > 55% and a length of > 500 bp withan ObsCG/ExpCG ratio > 0.65, this resulted in exclusion of vast majority of Alusequences and the unknowns, with only slight decreases in number of CpGislands within the 5’ region of genes.
Example of one gene, NHP2L1, on chromosome 22 is shown below. In thisexample, the modified criteria helped remove Alu sequences previouslyidentified as part of 5’ region islands, reducing the size of the island from1,233 bp to 620 bp.
Takai & Jones, Fig. 3 (2002)
Comparison of consecutive 500-bp windows of genomes from other species (%GC vs.ObsCG/ExpCG) reveals strong suppression of CpG in human 21 & 22. This suppression iscaused not only by CpG depletion through evolution but also by high content of simplerepetitive sequences and low rate of sequence utilization for genes (i.e., low density).Note low GC content of Arabidopsis and C. elegans - few fragments would fulfill criteriafor a CpG island, whereas Drosophila contains large # of CpG islands as defined.
Takai & Jones, Fig. 4 (2002)%GC
Obs
CG/E
xpCG
0.65
0.65
55%
Takai & Jones, Fig. 4G (2002)
CpG distribution in other organisms
Nearest-neighbor sequence analysis of model species also shows that the frequency ofthe CpG dinucleotide is suppressed in several organisms, including those that are notknow to have DNA methylation (e.g., Drosophila). In fact, with exception of E. coli, allshow the CpG dinucleotide is the most infrequent dinucleotide in their genomes. E. colidoes not have a recognizable sequences for a CpG methyltransferase in its genome.
Some relevant references
Fazzari, M. J., and J. M. Greally. 2004. Epigenomic: beyond CpG islands. Nature Reviews,Genetics 5: 446-455.
Jones, P. A., and S. B. Baylin. 2002. The fundamental role of epigenetic events in cancer.Nature Reviews, Genetics 3; 415-428.
Takai, D., and P. A. Jones. 2002. Comprehensive analysis of CpG islands in humanchromosomes 21 and 22. Proc. Natl. Acad. Sci. USA 99: 3740-3745.
Yoder, J. A., C. P. Walsh, and T. H. Bestor. 1997. Cytosine methylation and the ecology ofintragenomic parasites. Trends in Genetics 13: 335-340.