DNA methylation and the analysis of CpG Islands in genomesandrzej/teachold/351Spr05/teach/...2. attracts methyl-binding proteins which are associated with gene silencing and chromatin

DNA methylation and the analysis of CpG Islands in genomes

M. F. WojciechowskiMAT 35125 March 2005

Nucleotides

Base pairing

*

*

DNA methylation

In mammalian genomes, methylation takes place at cytosine residuesthat are located 5’ to a guanosine, i.e., 5’CpGin plants, the C in the sequence 5’CpNpG is often methylated

the dinucleotide CpG not randomly distributed in most genomes these C residues represent a target for covalent modification of DNA

Cytosine is one of two bases found commonly methylated in nature5’-methylcytosine (5mC) and N-6-methyladenosine

N-6-methyladenosine

Only 5mC is found inDNA of eukaryoticorganisms

3-D models of B DNA

minor groove

major groove

Methyl groups on 5mC protrude into the ‘major’ groove of DNA and havetwo main effects:1. displaces transcription factors that normally bind to DNA, and2. attracts methyl-binding proteins which are associated with gene silencing and chromatin compaction

DNA methylation

DNA methylation is essential for development in mammals, but despite25 yrs of research we still do not know exactly why

Post-replicational addition of methyl groups to 5’ position of cytosinealters the appearance of the major groove of DNA, and thusthe nature of its interactions w/ DNA-binding proteins

Such epigenetic ‘markers’ on DNA generated by methylation, lead tochanges in gene regulation, can be copied and inherited

Methylation of CpG-rich promoters is used to prevent transcriptionalinitiation and ensure the silencing of genes on the inactiveX-chromosome, imprinted genes, and silencing of intragenomicparasites, may also play significant role in carcinogenesis (viaaberrant, tumor-specific CpG island methylation, e.g. BRCA1)

CpG islands

CpG dinucleotides are the site of >90% of all methylation in mammalsbut are underrepresented in much of the genome

short regions of 0.5 to 4 kb in length - “CpG islands” - are rich inG+C content and the CpG dinucleotide

CpG islands are often found associated with proximal regions of genesesp. in promoters and first exons

Salient property of a CpG island is that it is unmethylated in the germline (and most somatic tissues), thus ensuring its continuedexistence in spite of mutagenic pressure on methylated Cs

CpG islands often function as strong promoters, proposed to functionas replication origins in mammals

CpG islands initially discovered on the basis of discordant patterns ofdigestion of genomic DNA by restriction enzyme ‘isoschizomers’ thatdiffered only in their sensitivity to cytosine methylation.

e.g., 5’-ATGTCCGGTACG-3’ 3’-TACAGGCCATGC-5’

5’-ATGTC CGGTACG-3’3’-TACAGGC CATGC-5’

Both enzymes MspI and HpaII digest(cut) DNA at the indicated“restriction” site, to producefragments with overlapping ends.

5’-ATGTCC*GGTACG-3’3’-TACAGGC*CATGC-5’

But HpaII is sensitive to methylation,and will not digest DNA at therestriction site if the C is methylatedat the 5’ position (5mC*)

HpaII or MspI

5’-ATGTC CGGTACG-3’3’-TACAGGC CATGC-5’

HpaII, MspI

Maintenance methylation & de novo methylation

T C G A G G A C G T A C

A G C T C C T G C A T G

CH3

5’

5’

5’



CH3

CH3

5’

5’

5’



CH3

5’

5’



CH3

5’

5’

5’T C G A G G A C G T A C


CH3

5’

CH3

CH3

DNA replication

Methylation

CH3

de novo* methylation

CH3

CH3

CH3

parental strandnew strand

*Methyltransferases with de novo methylation activity have been detected

CpG islands

Restriction analyses indicated: 55-70% of CCGG sites are methylated in animal genomes ‘hypomethylated’ portion of genomes, those digested w/ HpaII

were esp. high in G+C content and CpG rich

Gardiner-Garden & Frommer (1987) - vertebrates in Genbank - proposedgenomic criteria for recognizing CpG islands:sequence must have:G+C content of 0.50 or greaterobserved to expected CpG dinucleotide ratio of 0.60 or greaterboth occurring within a sequence window of 200 bp or greater

But CpGs are vastly underrepresented in genomes compared to what would be expected by chance (0.625) - 0.23 in human, 0.19 in mouse

Why? Methylated cytosines are mutational hotspots - deamination of5mC leads to depletion of CpG dinucleotides through evolution

Outcomes of deamination

Uracil is seen as anabnormal base inDNA and is easilydetected andremoved by excisionrepair mechanisms.

Thymine is seen as anormal base in DNAand repaired bymismatch repairwhich is a veryinefficient mechanismof DNA repair.

CpG islands

Distribution (using previous definition): preferentially at promotor sites of ca. 40% of genes in human genome genes are expressed in most/all tissues, i.e., “housekeeping genes” promoters of some endogenous retroelements defined as CpG islands Genome analyses estimate 55-70% of HpaII sites are methylated in animals

ca. 50% of these are located within transposable elements (humans)while ca. 35% are in the mouse genome

Suggests a variable but substantial proportion of HpaII sites within uniquesequences are methylated in any given cell type

But only a few HpaII sites in the human (22%) & mouse (14%) genomes arelocated in CpG islands

Most NotI sites (5’-GCGGCGCC-3’), 90%, in both species are located withinCpG islands, and RLGC analyses show tissue-dependent methylation

Molecular techniques for detecting CpG islands

Restriction landmark genomic screening (RLGS)Costello, J. F., et al. (2000), Nature Genetics 24: 132-138

Methylated CpG island amplification-representational difference analysis (MCA-RDA)

Toyota, M., et al. (1999), Cancer Research 59: 2307-2312

Methylation-sensitive arbitrarily primed PCR (MS-AP-PCR)Gonzalgo, M. L., et al. (1997), Cancer Research 57: 594-599

Differential methylation hybridization (DMH)Huang, T. H., et al. (1999), Human Molecular Genetics 8: 459-470

Most involve restriction analysis combined with polymerase chain reactionmethods to identify specific methylation sensitive and insensitive sites.

MSEEQFGGDGAAAAATAAVGGSAGEQEGAMVAATQGAAAAAGSGAGTGGGTASGGTEGGSAESEGAKIDASKNEEDEGHSNSSPRHSEAATAQREEWKMFIGGLSWDTTKKDLKDYFSKFGEVVDCTLKLDPITGRSRGFGFVLFKESESVDKVMDQKEHKLNGKVIDPKRAKAMKTKEPVKKIFVGGLSPDTPEEKIREYFGGFGEVESIELPMDNKTNKRRGFCFITFKEEEPVKKIMEKKYHNVGLSKCEIKVAMSKEQYQQQQQWGSRGGFAGRARGRGGGPSQNWNQGYSNYWNQGYGNYGYNSQGYGGYGGYDYTGYNNYYGYGDYSNQQSGYGKVSRRGGHQNSYKPY

cagcgggagagagcgggagtgtgcgccgcgcgagagtgggaggcgaagggggcaggccagggagaggcgcaggagcctttgcagccacgcgcgcgccttccctgtcttgtgtgcttcgcgaggtagagcgggcgcgcggcagcggcggggattactttgctgctagtttcggttcgcggcagcggcgggtgtagtctcggcggcagcggcggagacactagcactATGtcggaggagcagttcggcggggacggggcggcggcagcggcaacggcggcggtaggcggctcggcgggcgagcaggagggagccatggtggcggcgacacagggggcagcggcggcggcgggaagcggagccgggaccgggggcggaaccgcgtctggaggcaccgaagggggcagcgccgagtcggagggggcgaagattgacgccagtaagaacgaggaggatgaaggccattcaaactcctccccacgacactctgaagcagcgacggcacagcgggaagaatggaaaatgtttataggaggccttagctgggacactacaaagaaagatctgaaggactacttttccaaatttggtgaagttgtagactgcactctgaagttagatcctatcacagggcgatcaaggggttttggctttgtgctatttaaagaatcggagagtgtagataaggtcatggatcaaaaagaacataaattgaatgggaaggtgattgatcctaaaagggccaaagccatgaaaacaaaagagccggttaaaaaaatttttgttggtggcctttctccagatacacctgaagagaaaataagggagtactttggtggttttggtgaggtggaatccatagagctccccatggacaacaagaccaataagaggcgtgggttctgctttattacctttaaggaagaagaaccagtgaagaagataatggaaaagaaataccacaatgttggtcttagtaaatgtgaaataaaagtagccatgtcgaaggaacaatatcagcaacagcaacagtggggatctagaggaggatttgcaggaagagctcgtggaagaggtggtggccccagtcaaaactggaaccagggatatagtaactattggaatcaaggctatggcaactatggatataacagccaaggttacggtggttatggaggatatgactacactggttacaacaactactatggatatggtgattatagcaaccagcagagtggttatgggaaggtatccaggcgaggtggtcatcaaaatagctacaaaccatacTAAattattccatttgcaacttatccccaacaggtggtgaagcagtattttccaatttgaagattcatttgaaggtggctcctgccacctgctaatagcagttcaaactaaattttt

Take a gene, any gene . . . e.g., human nuclear ribonucleoprotein D (AU-rich element RNA binding protein 1)

ATG - start codon; TAA - stop codon

cagCGggagagagCGggagtgtgCGcCGCGCGagagtgggaggCGaagggggcaggccagggagaggCGcaggagcctttgcagccaCGCGCGCGccttccctgtcttgtgtgcttCGCGaggtagagCGggCGCGCGgcagCGgCGgggattactttgctgctagtttCGgttCGCGgcagCGgCGggtgtagtctCGgCGgcagCGgCGgagacactagcactATGtCGgaggagcagttCGgCGgggaCGgggCGgCGgcagCGgcaaCGgCGgCGgtaggCGgctCGgCGggCGagcaggagggagccatggtggCGgCGacacagggggcagCGgCGgCGgCGggaagCGgagcCGggacCGggggCGgaacCGCGtctggaggcacCGaagggggcagCGcCGagtCGgagggggCGaagattgaCGccagtaagaaCGaggaggatgaaggccattcaaactcctccccaCGacactctgaagcagCGaCGgcacagCGggaagaatggaaaatgtttataggaggccttagctgggacactacaaagaaagatctgaaggactacttttccaaatttggtgaagttgtagactgcactctgaagttagatcctatcacagggCGatcaaggggttttggctttgtgctatttaaagaatCGgagagtgtagataaggtcatggatcaaaaagaacataaattgaatgggaaggtgattgatcctaaaagggccaaagccatgaaaacaaaagagcCGgttaaaaaaatttttgttggtggcctttctccagatacacctgaagagaaaataagggagtactttggtggttttggtgaggtggaatccatagagctccccatggacaacaagaccaataagaggCGtgggttctgctttattacctttaaggaagaagaaccagtgaagaagataatggaaaagaaataccacaatgttggtcttagtaaatgtgaaataaaagtagccatgtCGaaggaacaatatcagcaacagcaacagtggggatctagaggaggatttgcaggaagagctCGtggaagaggtggtggccccagtcaaaactggaaccagggatatagtaactattggaatcaaggctatggcaactatggatataacagccaaggttaCGgtggttatggaggatatgactacactggttacaacaactactatggatatggtgattatagcaaccagcagagtggttatgggaaggtatccaggCGaggtggtcatcaaaatagctacaaaccatacTAAattattccatttgcaacttatccccaacaggtggtgaagcagtattttccaatttgaagattcatttgaaggtggctcctgccacctgctaatagcagttcaaactaaattttt

human nuclear ribonucleoprotein D (AU-rich element RNA binding protein 1)

ATG - start codonTAA - stop codon1068 bpCG - CpG dinucleotide

1st half of gene: 56% G+C, 39 CpGs2nd half of gene: 43% G+C, 6 CpGs1st 1/4 of gene: 71% G+C, 34 CpGs200 bp upstream: 69% G+C, 26 CpGs

Example of a genomic scale analysis of CpG islands inhuman DNA

Takai and Jones (2002), in PNAS - human chromosomes 21 and 22

Revisited the generally accepted definition of what constitutes a CpGisland which was derived before the sequencing of mammalian and othergenomes (Gardiner-Garden and Frommer, 1987).

Takai and Jones hypothesis was that any definition of a CpG island issomewhat arbitrary, but the current one would include many sequencesthat were not necessarily associated with controlling regions of genes butrather are associated with repetitive sequences such as intragenomicparasites.

Computational analysis of CpG islands

Gardiner-Garden & Frommer (1987), using vertebrate sequences defined a CpG island as a 200-bp region of DNA w/ high G+C content (>50%)

and observed CpG/expected CpG ratio > or = 0.6 definition somewhat arbitrary, depends on what sequences are included difficult to distinguish bona fide CpG islands from repetitive sequences

e.g., nearly 1,000,000 Alu copies per haploid genome in humansshort, interspersed elements, 280-bp consensus w/ high %GC

Takai and Jones (2002) complete sequences of human chromos 21 & 22, ca. 2% of genome refined the algorithm to search for & describe CpG islands results led to re-definition of CpG island reduced # of CpG islands on these chromos from 14,062 to 1,101

more consistent with expected number of genes (~750)

A. Set a 200 bp window @ beginning ofcontig, compute %GC and ObsCG/ExpCG. Shiftwindow 1 bp until window meets criteria ofCpG island.B. If window meets criteria, shift window200 bp and evaluate again.C. & D. Repeat 200 bp shifts until windowdoes not meet criteria.E., F., & G. Repeat from other end.H. If large CpG island does not meetcriteria, trim 1 bp from each end until itdoes.I. Two individual CpG islands areconnected if they were separated by lessthan 100 bp.

Schematic for the algorithms for CpG island extraction from human genome sequences (from Takai and Jones, 2002)

To exclude “mathematical CpG islands”, criterion added that there be at least sevenCpGs in the 200-bp window: there would be 12.5 CpGs (200/16) in a random DNAfragment, with ObsCG/ExpCG ratio of 0.6 would = 7.5; thus, 7 used as cutoff.

Table 1. Number of CpG islands in chromosomes 21 and 22 based on Gardiner-Garden & Fromme (1987) criteria

Category 21 22 21+225’ regions 57 138 195Exon 334 423 757Alu repeats 2,520 5,131 7,651Unknown 2,128 3,331 5,459Total 5,039 9,023 14,062

Takai and Jones (2002) analyses of CpG islands in human sequences

Data (table 1 & figure 2) suggested majority (>13,000) of CpG islands by these criteriawere associated with Alu repeats, or unknown, and not with genes, and that criteriamight be too lenient to be useful.

For example, majority had %GC <59%, ObsCG/ExpCG of <0.72 and a length of <600 bp.

Table 2. Number of CpG islands-distribution of %GC, andObsCG/ExpCG, and lengths of CpGislands in human 21 & 22.

CpG islands divided into fourcategories:5’ region (incl. first coding exon)Exon (subsequent introns/exons)Alu (other than exonic regions)Unknown (any other region)

Results. CpG islands associatedw/5’ regions of genes showedmarkedly different distributioncompared with that of Alus -higher GC content, biphasicdistribution of occurrence ofObsCG/ExpCG, and biphasicdistribution w/ respect to lengths.

%GC ObsCG/ExpCG Length

60 0.7 400-599

Takai & Jones, Fig. 2 (2002)

When the criteria were modified to a %GC > 55% and a length of > 500 bp withan ObsCG/ExpCG ratio > 0.65, this resulted in exclusion of vast majority of Alusequences and the unknowns, with only slight decreases in number of CpGislands within the 5’ region of genes.

Example of one gene, NHP2L1, on chromosome 22 is shown below. In thisexample, the modified criteria helped remove Alu sequences previouslyidentified as part of 5’ region islands, reducing the size of the island from1,233 bp to 620 bp.

Takai & Jones, Fig. 3 (2002)

Comparison of consecutive 500-bp windows of genomes from other species (%GC vs.ObsCG/ExpCG) reveals strong suppression of CpG in human 21 & 22. This suppression iscaused not only by CpG depletion through evolution but also by high content of simplerepetitive sequences and low rate of sequence utilization for genes (i.e., low density).Note low GC content of Arabidopsis and C. elegans - few fragments would fulfill criteriafor a CpG island, whereas Drosophila contains large # of CpG islands as defined.

Takai & Jones, Fig. 4 (2002)%GC

Obs

CG/E

xpCG

0.65

0.65

55%

Takai & Jones, Fig. 4G (2002)

CpG distribution in other organisms

Nearest-neighbor sequence analysis of model species also shows that the frequency ofthe CpG dinucleotide is suppressed in several organisms, including those that are notknow to have DNA methylation (e.g., Drosophila). In fact, with exception of E. coli, allshow the CpG dinucleotide is the most infrequent dinucleotide in their genomes. E. colidoes not have a recognizable sequences for a CpG methyltransferase in its genome.

Some relevant references

Fazzari, M. J., and J. M. Greally. 2004. Epigenomic: beyond CpG islands. Nature Reviews,Genetics 5: 446-455.

Jones, P. A., and S. B. Baylin. 2002. The fundamental role of epigenetic events in cancer.Nature Reviews, Genetics 3; 415-428.

Takai, D., and P. A. Jones. 2002. Comprehensive analysis of CpG islands in humanchromosomes 21 and 22. Proc. Natl. Acad. Sci. USA 99: 3740-3745.

Yoder, J. A., C. P. Walsh, and T. H. Bestor. 1997. Cytosine methylation and the ecology ofintragenomic parasites. Trends in Genetics 13: 335-340.

Documents

DNA methylation and the analysis of CpG Islands in genomesandrzej/teachold/351Spr05/teach/...2. attracts methyl-binding proteins which are associated with gene silencing and chromatin