10
A framework for the DNA-protein recognition code of the probe helix in transcription factors: the chemical and stereochemical rules Masashi Suzuki MRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK Background: Understanding the general mechanisms of sequence specific DNA recognition by proteins is a major challenge in structural biology. The existence of a 'DNA recognition code' for proteins, by which cer- tain amino acid residues on a protein surface confer specificity for certain DNA bases, has been the subject of much discussion. However, no simple code has yet been established. Results: The principles of DNA recognition can be described at two levels. The 'chemical' rules describe the partnerships between amino acid side chains and DNA bases making favourable interactions in the ma- jor groove of DNA. Here I analyze the occurrence of nucleotide-amino acid contacts in previously deter- mined crystal structures of DNA-protein complexes and find that simple rules pertain. I also describe 'stereo- chemical' rules for the probe helix type of DNA-bind- ing motif found in certain transcription factors including leucine zipper and homeodomain proteins. These are a consequence of the binding geometry, and specify the amino acid and base positions used for the contacts, and the sizes of residues in the contact interface. Conclusions: The chemical rules can be generalized for any DNA-binding motif, while the stereochemical rules are specific to a particular DNA-binding motif. The recognition code for a particular binding motif can be described by combining the two sets of rules. Structure 15 April 1994, 2:317-326 Key words: DNA binding, DNA-protein interaction, homeodomain, leucine zipper, transcription factor Introduction The discovery of the 'triplet code' was epoch-making in relating DNA sequences to protein sequences. A problem of similar importance is the recognition of specific DNA sequences by transcription factors. No simple algorithm has yet been found to explain how proteins discriminate between DNA sequences. The first step towards understanding a DNA recognition code was taken by Seeman et al. [1]. They studied the crystal structure of tRNA and proposed that as- paragine/glutamine generally contact adenine and argi- nine generally contacts guanine. Similar contacts were later found in crystal structures of DNA-protein com- plexes. The term '(DNA) recognition code' was first used by Pabo and Sauer [2] and was defined by Matthews [3] as "a code whereby certain DNA base pairs are recognized by certain amino acids", although he doubted the existence of such a code. A num- ber of experiments have been carried out towards understanding such a code. Pioneering experiments by Mfiller-Hill and coworkers [4,5] on helix-turn-he- lix proteins, and the later work on the zinc finger DNA-binding motif by others [6,7], showed that certain amino acid and base positions are used for contacts by these proteins. Although these early studies found some similarities in the DNA binding geometry of factors of the same family (see, for example, [8,9]), they were not specific enough to provide a code. Indeed, it now seems to be widely believed that the mode of interaction with DNA can vary substantially even among the same family of factors [3] and that there is no simple code describing the contacts (see, for example, discussions on helix- turn-helix proteins [3,8] and zinc finger proteins [10] ). This paper focuses on the two major requirements for understanding a recognition code. First, a systematic analysis of the possible amino acid-DNA base part- ners for the contacts is presented. Sequence-specific DNA-protein interactions can be reduced, at a very basic level, to contacts between side chains of amino acid residues and DNA bases. Although examples of particular contacts have been noted repeatedly in the reports of crystal structures (see, for example, Fig. 16 of [11]) there has been little or no systematic analy- sis. A table listing the possible side chain-base con- tacts (the chemical rules; see Table 1) is necessary but does not solely constitute a recognition code. A second requirement is a stereochemical analysis of the kind of contacts which are tolerated at particular positions on the interaction surface; that is, an understanding of stereochemical restrictions placed on each contact. In an earlier paper, I have noted that certain eukary- otic transcription factors, including those of the basic domain-leucine zipper (bZip) and homeobox families, use a similar ac-helical segment for DNA recognition, and have suggested that all these 'probe helices' recog- nize DNA according to certain recognition rules [12]. These rules are expected to explain why, for example, asparagine is allowed at position 1 of the probe helix but glutamine is not. ) Current Biology Ltd ISSN 0969-2126 317

A framework for the DNA–protein recognition code of the probe helix in transcription factors: the chemical and stereochemical rules

Embed Size (px)

Citation preview

A framework for the DNA-protein recognition codeof the probe helix in transcription factors:

the chemical and stereochemical rules

Masashi SuzukiMRC Laboratory of Molecular Biology, Hills Road, Cambridge, CB2 2QH, UK

Background: Understanding the general mechanismsof sequence specific DNA recognition by proteins is amajor challenge in structural biology. The existence ofa 'DNA recognition code' for proteins, by which cer-tain amino acid residues on a protein surface conferspecificity for certain DNA bases, has been the subjectof much discussion. However, no simple code has yetbeen established.Results: The principles of DNA recognition can bedescribed at two levels. The 'chemical' rules describethe partnerships between amino acid side chains andDNA bases making favourable interactions in the ma-jor groove of DNA. Here I analyze the occurrenceof nucleotide-amino acid contacts in previously deter-

mined crystal structures of DNA-protein complexes andfind that simple rules pertain. I also describe 'stereo-chemical' rules for the probe helix type of DNA-bind-ing motif found in certain transcription factors includingleucine zipper and homeodomain proteins. These are aconsequence of the binding geometry, and specify theamino acid and base positions used for the contacts, andthe sizes of residues in the contact interface.Conclusions: The chemical rules can be generalizedfor any DNA-binding motif, while the stereochemicalrules are specific to a particular DNA-binding motif. Therecognition code for a particular binding motif can bedescribed by combining the two sets of rules.

Structure 15 April 1994, 2:317-326

Key words: DNA binding, DNA-protein interaction, homeodomain, leucine zipper, transcription factor

IntroductionThe discovery of the 'triplet code' was epoch-makingin relating DNA sequences to protein sequences. Aproblem of similar importance is the recognition ofspecific DNA sequences by transcription factors. Nosimple algorithm has yet been found to explain howproteins discriminate between DNA sequences. Thefirst step towards understanding a DNA recognitioncode was taken by Seeman et al. [1]. They studiedthe crystal structure of tRNA and proposed that as-paragine/glutamine generally contact adenine and argi-nine generally contacts guanine. Similar contacts werelater found in crystal structures of DNA-protein com-plexes. The term '(DNA) recognition code' was firstused by Pabo and Sauer [2] and was defined byMatthews [3] as "a code whereby certain DNA basepairs are recognized by certain amino acids", althoughhe doubted the existence of such a code. A num-ber of experiments have been carried out towardsunderstanding such a code. Pioneering experimentsby Mfiller-Hill and coworkers [4,5] on helix-turn-he-lix proteins, and the later work on the zinc fingerDNA-binding motif by others [6,7], showed that certainamino acid and base positions are used for contacts bythese proteins.

Although these early studies found some similaritiesin the DNA binding geometry of factors of the samefamily (see, for example, [8,9]), they were not specificenough to provide a code. Indeed, it now seems to bewidely believed that the mode of interaction with DNA

can vary substantially even among the same family offactors [3] and that there is no simple code describingthe contacts (see, for example, discussions on helix-turn-helix proteins [3,8] and zinc finger proteins [10] ).

This paper focuses on the two major requirements forunderstanding a recognition code. First, a systematicanalysis of the possible amino acid-DNA base part-ners for the contacts is presented. Sequence-specificDNA-protein interactions can be reduced, at a verybasic level, to contacts between side chains of aminoacid residues and DNA bases. Although examples ofparticular contacts have been noted repeatedly in thereports of crystal structures (see, for example, Fig. 16of [11]) there has been little or no systematic analy-sis. A table listing the possible side chain-base con-tacts (the chemical rules; see Table 1) is necessary butdoes not solely constitute a recognition code. A secondrequirement is a stereochemical analysis of the kindof contacts which are tolerated at particular positionson the interaction surface; that is, an understandingof stereochemical restrictions placed on each contact.In an earlier paper, I have noted that certain eukary-otic transcription factors, including those of the basicdomain-leucine zipper (bZip) and homeobox families,use a similar ac-helical segment for DNA recognition,and have suggested that all these 'probe helices' recog-nize DNA according to certain recognition rules [12].These rules are expected to explain why, for example,asparagine is allowed at position 1 of the probe helixbut glutamine is not.

) Current Biology Ltd ISSN 0969-2126 317

318 Structure 1994, Vol 2 No 4

This paper sets out how the rules for DNA recogni-tion can be understood through a process of selection;amino acid residues and DNA base partners are cho-sen from the candidates that provide the appropriatehydrogen bond (H-bond) or hydrophobic interactionto satisfy certain stereochemical restrictions.

Results and discussionChemical rules for the recognition of DNA bases byamino acid side chainsThe first step towards determining the chemical rulesfor side chain-base contacts was to make a systematicanalysis of the contacts observed in the crystal struc-tures of known protein-DNA complexes and then toestablish the nature of the chemical features importantfor the contacts.

Contacts in crystal structuresThe crystal structures of DNA-transcription factor com-plexes investigated were HNF3 [13], Max [14], Gal4[15], E2 [16], GCN4 [17,18], MatcL2 [19], engrailed(Engl) [20], Zif268 (Zif) [21], glucocorticoid recep-tor (GluR) [22], tryptophan repressor (TrpR) [23,24],catabolite activator protein (CAP) [25], 434 repressor(434R) [26,271, 434 cro (434C) [28,29], lambda re-pressor (R) [30], met repressor (MetR) [31], EcoRI[32,33], EcoRV [34], GLI [10], oestrogen receptor(EstR) [35] and tramtrack (TTK) [36]. In almost allthese crystal structures, the sequence-specific interac-tions occur in the major groove of DNA and it is thesemajor groove interactions that are focused on here. (Inthe following discussion, the standard three letter ab-breviations are used for amino acid residues and A, C,G, and T for DNA bases.)

In the crystal structures, some residues bind to just oneor two types of bases but others do not discriminate:Ala binds only to T, Arg and Lys bind predominantly toG, Asp and Glu bind to C or A, Asn and Gin bind toall bases but most often to A and some residues, suchas Ser, bind to all four bases (see Table 2). The originsof these differences are discussed below.

Hydrogen bondingContacts between amino acid residues and DNA basesare achieved by H-bonding or hydrophobic packinginteractions. The H-bond donors (dn) and acceptors(ac) in the major groove (M) are illustrated in Fig.la. These chemical groups are available for contactthrough H-bonds. H-bond contacts between a singlebase and a single amino acid residue can be pre-dicted on the basis of their hydrogen bonding potential(Table 2).

Amino acid residues Glu and Asp have H-bond accep-tors but not donors, therefore they can form H-bondsonly with the C or A bases. Residues such as Ser, Cys,and Thr have a hydroxyl or sulphydryl group which canbe used simultaneously as a donor and as an acceptor,and they can therefore contact any base. Amino acidresidues such as Arg and Lys have a H-bond donor butno acceptor group in their side chains and thereforecannot contact the C base because it has no acceptor.

Hydrophobic interactionsThe main target for hydrophobic interaction in the ma-jor groove of DNA is the single methyl group of the Tbase (Fig. la). Hydrophobic residues such as Ala, Val,Ile, Leu, Met, Phe, Tyr, Trp and Thr can recognize themethyl on T. The two ring CH groups of the C base arealso possible targets for recognition by hydrophobicresidues [18]. However, interaction with the hydrogenatoms of C will not be as strong as with the methylgroup of thymine. Alanine appears to be insufficientlyhydrophobic to contact the C base (Table 2).

The stem of the side chains of polar residues can alsocontact the methyl group of T. Interactions betweenC{ of Ser and T [17], C[3 and Cy of Gin and T [26],

Table 1. Possible contacts between amino acids and DNA basesin the major groove of DNA.

(a) (b)

Amino acida Baseb Specificityc Base Amino acid Type

Ala (s) T + + A Asn mGin I

Val (m) TIC +

Leu (I) T Ala slie (m) Val mPhe (a) lie m

Leu I

Met (I) T/C A - Met ITrp (a) Tyr a

Phe a

Tyr (a) TIC > A/G Trp a

Asp (m) C2A + G His mGlu (I) Lys I

Arg I

Asn (m) A > T/G/C +

Gin (I) C Val slie s/m/l

Cys (5) A/T/G/C Leu m/ISer (s) Met m/IThr (s) Asp m

Clu I

Arg () G>>T>A ++ Tyr aLys (I) Phe a

Trp a

His (m) G T2A C _

(a) Contacts between single residues and single bases. (b) The specificcontacts listed in (a) are rearranged according to the sizes of the residuesand their partner bases. aThe sizes of amino acid residues are shownin parentheses: small (s), medium (m), large (I) and aromatic (a). bBoldsignifies a bidentate interaction. C+ + = highly specific, + = specificand - = less specific.

DNA recognition code Suzuki 319

Fig. 1. Chemical groups of DNA in themajor groove. (a) The H-bond acceptors(ac) and donors (dn) on the G:C and A:Tpairs are shown. The C base has a H-bond donor in the major groove (M),while T has an acceptor. The A base hasa donor and an acceptor and G has twoacceptors. The methyl group of T, whichis available for hydrophobic interac-tion is also shown. (b) A shorter bridgecan be made between two featuresof C(i) and W(i + 1), but not betweenthose of W(i) and C(i+l1), because ofthe clockwise rotation of base pairsaround the helix axis (upper). For simpli-fication, the two base pairs are drawnwithout the rotation (lower) and in (c).(c) Possible cross-bridges of two basepairs, by a single amino acid residue,at pyrimidine-pyrimidine/purine-purine,pyrimidine-purine and purine-pyrimid-ine steps are shown. The chemicalgroups are numbered, so that a groupcloser to the reader has a lower num-ber.

and the imidazole ring of His and T [13], have beenreported. However, the features of these amino acidresidues used for base recognition are predominantlythe H-bond donors or acceptors (Table 2).

'Bidentate' interactionsSome residues can recognize two chemical groups onthe same DNA base [1,26] and such interactions wouldbe stronger than single contacts. For example, Lys andArg can make two H-bond contacts to G, and Asn andGin can contact A through two H-bonds, one from Aand the other to A. Other interactions involving twoH-bonds, such as from Ser, Thr, Tyr or Cys to A (oneH-bond to and one from A) [2], and Thr to T throughone H-bond and a hydrophobic interaction, whilst the-oretically possible, have not been found so far. Histi-dine has both a H-bond donor (NH) and a H-bond ac-ceptor (N) which point in almost opposite directions.Thus, only one of the two groups seems likely to beavailable for base recognition at any one time makinga bidendate contact difficult.

Ionic effects on contactsTheoretically, Arg and Lys can bind to any of thebases, other than C, by a H-bond. However, these

CPy

(a) M (c)acac ac dn 6 6

GPu

+ mO/ M "~~~1-L

dC UIU A~ - _ I

APy

(b) ,, W

two residues almost always bind to G (Table 2) eventhrough a single contact. This is probably due to ioniceffects on the contacts. The G base has two acceptorsand a partial negative charge, while the T base is lessnegative, as it has only one acceptor. The A base has anacceptor and a donor and is nearly neutral, thereforeArg and Lys bind to G much more often than to T, andto T more often than to A. Histidine is less charged thanArg and Lys and thus its binding preference might beexpected to be weaker. For similar reasons, Asp andGlu might prefer C to A.

Bridging of two bases by the same residueIn the crystal structures, some residues bridge two con-secutive base pair steps (Table 3). The way in whichtwo base pairs stack is a consequence of many struc-tural factors [37] and can also be influenced by contactwith an amino acid residue. However, a simple modelcan predict most of the possible interactions of a singleamino acid residue with a pair of consecutive base pairsteps.

Two base pairs stacking onto each other are rotatedabout 36° clockwise around the helix axis (Fig. lb).Therefore, a shorter bridge can be made between

T

V -0

"

I II

_^ en c . _ I I

PU II I

s Z ^~c

320 Structure 1994, Vol 2 No 4

Table 2. Contacts observed in crystal structures between single residues andsingle bases.

Amino Base' Interaction bExamplesc

acid

Ala T (1) hydrophobic Ala (GCN4) Ala (CCN4) Ala (GLI)

Val Val (GIcR)Leulie lie (Engl)Phe Phe (E2)

Met T (1) hydrophobicA (1) acC (1) ac

Thr T (2) hydrophobicTyr and dn

A (2) dn and acT (1) hydrophobicC (1) hydrophobic Tyr (GLI)T (1) dn Thr (EcoRV)G (1) dnC (1) acA (1) dn/ac Thr (MetR)

Asp C (1) ac Clu (CAP) GClu (Max) Glu (EstR) Asp (Zif)Clu Asp (TTK) Asp (GLI) Asp (GLI) Asp (GLI)

A (1) ac Glu (Max) Asp (Zifl Asp (Zif)

Asn A (2) dn and ac Asn (HNF3) Asn (Engl) Asn (Mat) Asn (EcoRV)Gin Asn (TTK) Gin (434R) Gln (434C) Gn (R)

A (1) dn/ac Asn (E2) Asn (TTK)T (1) dn Gln (Engl) GIn (434R) Gn (4340 Asn (GCN4)G (1) dn Gin (434R) Gin (434C) Asn (R)C (1) ac Asn (GCN4) Asn (E2)

Cys A (2) dn and acSer T (1) dn Ser (Mat) Ser (TTK) Ser (GLI)

G (1) dn Ser (kR) Ser (CLI) Ser (GLI) Cys (E2)C (1) acA (1) dn/ac Cys (E2) Ser (GLI)

Arg G (2) dn Arg (GCN4) Arg (Zifl Arg (Zif) Arg (GIcR)Lys Arg (XR) Arg (R) Arg (Max) Arg (CAP)

Arg (EcoRI) Arg (TTK) Arg (TTK) Arg (TrpR)Lys (E2) Lys (MetR)

G (1) dn Arg (Mat) Arg (CAP) Arg (GLI) Arg (EstR)Lys (GIcR) Lys (EstR) Lys (R) Lys (Gal4)Lys (E2) Lys (GLI) Lys (GLI)

T (1) dn Lys (EstR) Arg (CAP)A (1) dn

His T (1) dnG (1) dn His (Max) His (Zif)C (1) acA (1) dn/ac

Trp T (1) dnT (1) hydrophobicA (1) dnG (1) dn

aNumber of contacts made between the amino acid and the base is given inparentheses. bContact type: dn = hydrogen bond from an amino acid to a DNAbase; ac = hydrogen bond from amino acid to amino acid. Bold type indicates abidentate interaction. Some proteins have, for example, two arginines binding todifferent G bases. These are treated as two independent entries.

Table 3. Contacts observed in crystal structures between singleresidues and two bases.

Amino C(i)-C(i + 1) Example C(i)-W(i + 1) Exampleacid W(i + 1)-W(i)

aW(i + 1-C(i)

b

Asp, Clu A A Glu (EcoRI) A CC A Glu (Max) A AC C Asp (GLI) C C CO (Gal4)

Asn, Gln A A A ASer, Cys A C Asn (E2) A TThr, Tyr A G A G Cys (E2)

T A Gin (434R) water (EstR)Gin (434C) T C Asn (GCN4)

T C G C

C AG CC TC G

Lys, Arg G G Arg (EcoRI) G G

Lys (E2)

Lys (Gal4)A G water (TrpR) T G water (EstR)

water (GIcR) Arg (CAP)T G Ser (GLI)' A C

G T Lys (EstR) T TG AA AA TT T

Leu, Val T Tlie T C

C C

aBridging of two bases on the same strand. bBridging of two bases ondifferent strands. 'Indicates interaction through bifurcated H-bonds.

two bases, C(i) and W(i+ 1), than between W(i) andC(i + 1) [where the sequence of base pairs i, i+ 1 etc.,is in the direction 5' to 3' for the Crick (C) strand and3' to 5' for the Watson (W) strand; Fig. lb]. Thus, whenthe H-bonding groups of the base pairs are systemati-cally numbered, so that the number of a H-bondinggroup closer to the reader in Fig. c becomes smaller,it is likely that in the two base pairs bridged by the sameamino acid residue, the number, J(i), of the group onbase pair i, is the same as or smaller than that, J(i + 1),of the feature on base pair i + 1.

All the bridges observed in the crystal structures oc-cur when 0< [J(i+ 1)-J(i)] <3, including bridges bywater molecules (TrpR, EstR) and by a main chain car-bonyl group (Gal4), which can be regarded as mod-els for bridging by the side chain of an amino acidresidue. All the possible bridges, according to this rule,are drawn in Fig. Ic and are listed in Table 3 withexamples observed in the crystal structures.The residues that bridge two base pairs by H-bondscan be classified into three groups: the double accep-

DNA recognition code Suzuki 321

tor (AA) type (Asp and Glu); the double donor (DD)type (Arg and Lys); and the acceptor donor (AD) type(Asn, Gin, Cys, Ser, Thr and Tyr). If a bifurcated H-bond is used, Met and Trp can act as the DD type,Asn, Gin, His, Ser, Cys, Thr and Tyr can act either asAA or DD. However, bifurcated H-bonds are probablyweaker than a combination of two 'proper' H-bondsand only one example of such a bond is found in thecrystal structures analyzed here (Ser of GLI [10]).

Fig. 2 summarizes some other types of bridging con-tacts. Parts (a) to (d) illustrate the bridging of T withany of the other bases by a single residue making ahydrophobic interaction with the T base through itsstem, plus another interaction (H-bond or hydropho-bic) through its side chain. Parts (f) to (i) show bridg-ing of two bases by two residues interacting with eachother, and bridging of a base and a phosphate groupis shown in (1) to (q).

Sizes of amino acid residuesWhen replacing one amino acid residue involved incontacts with DNA by another, so as to produce a dif-ferent binding specificity, not only the chemistry of theinteraction but also the shape of the side chain must beconsidered. The new residue must be chosen so thatit continues to fill the space between the protein andDNA. The shapes of side chains can be roughly dividedinto four groups on the basis of two related properties:the distance between the Cca of the amino acid residueand the chemical groups in its side chain which contactDNA bases, and the volume of the side chain [38]. Themembers of the four groups - small, medium, largeand aromatic - are indicated in Fig. 3.

It is expected that for proteins of the same family withthe same geometry relative to DNA, contact residuesat homologous positions will be of similar size, unlessthe bases change in size. Thus, although Ala (small)

,4O I)

T

obs

obs (h) obs (j) obs\ _.C-/

cc ,H

(c)

(KW)

(i)

o)

(o) bs

- A

H

(10

.o-

(p) (q) obs(m) obs (n) obs

\S'z2=A~

H

Fig. 2. Bridging contacts between two features of DNA. (a) to (e) Bridging of T-N base steps by single amino acid residues. (f) to (i) Bridgingof two bases by two residues. (j), (k) Binding to single bases by two amino acid residues. (I) to (q) Bridging of a base and a phosphateby one or two amino acid residues. Observed bridgings are indicated (obs). The others are predicted based on the observed examples.Residues/bases which might be used in the interactions, but which are of different types from the others, are shown in parentheses.Observed bridgings are as follows: (a) Gln-TG/CA (434R); (f) Asn-Arg-AC/GT (XR); (h) Arg-Asp-G(G/T)/(C/A)C (GLI, Zif); (j) Lys-Asn-G (XR);(I) Ser-T (GCN4) and Tyr-C (GLI); (m,n) Asn-AC (E2); (o) Gln-GIln-A (434R, 434C, XR); (q) Arg-G (Max). The contacts listed here and inTables 1 and 2 are the contacts through the side chains. However, if a recognition helix lies perpendicular to the major groove ratherthan parallel, contacts from the main chain amide and carboxyl groups can occur at the amino- and carboxy-terminal edges of thehelix, respectively. At the carboxy-terminal edge of the recognition helix of Ga14, a main chain carbonyl group bridges CC( + 1)] andCIW(i + 2)], while its side chain, NH of Lys, bridges two bases, G[W(i)] and G[W(i + 1)], and so the single Lys residue recognizes four basesin the three base pairs, GCC/CGG 1151. At the amino-terminal edge of the recognition helix of TrpR, two main chain amide groups bindto bases through water molecules (see Fig. 3.6 of [461). One water molecule bridges A[C(i)] and G[C(i + 1)], while the other binds to thesame G[C(i + 1)]. A similar interaction might be achieved even without the water molecules, i.e. by direct contacts from main chainatoms. However, DNA sequences which can be recognized by the main chain in these ways appear to be limited.

(a) (b) (d)

OX W~

0-C-0.

(e)

() Wg

v-, V__SCt..

(

322 Structure 1994, Vol 2 No 4

LOR OY

OpOYEQ H M

M co OR YC 0 01 1, /

A-

I 150150

W

I

A

200 ' V(A3)

Fig. 3. Amino acid sizes. With the ex-ceptions of glycine and proline, aminoacid residues can be classified into fourgroups, small, medium, large, and aro-matic, according to their volume [38] (x-axis) and the number of atoms from theCot atom to those interacting with DNAbases (y-axis). Different symbols repre-sent the different atom groups availablefor interaction: H-bond donor (O); H-bond acceptor (O); donor and acceptor(0); hvdrophobic interaction ().

and Val (medium) are different in size, the two inter-actions Ala-T and Val-C seem to be interchangeable.The bulkier side chain of valine can fill the space, oc-cupied by the CH3 of T in the Ala-T interaction, andcan contact the two ring CH groups of C [18].

The chemical characteristics, binding specificity andsize of residues can be summarized in a single table(Table lb). This set of rules or 'chemical code' is likelyto be generally applicable for any DNA-protein inter-actions taking place in the major groove of'DNA

Stereochemical rules for the recognition of DNA byprobe helicesThe geometry of a protein surface relative to the DNAsurface is fixed by structural elements. A probe helix[12] is a type of recognition helix found in eukaryotictranscription factors including basic leucine zipper andhomeodomain proteins. It has Lys/Arg residues whichfix its inclination by binding to the phosphate groupsof DNA. The recognition helix of a helix-tum-helix pro-tein or a hormone receptor has only limited rotationalfreedom in the major groove of DNA. This is becausethere is another helix bound to the far side of therecognition helix which runs in an almost perpendicu-lar direction and this helix would collide with the DNAif the recognition helix rotated too far. In this way,different DNA-binding motifs fix the recognition helixinto particular binding geometries. In turn, the bindinggeometry fixes the positions both of the amino acids inthe recognition helix that are used for base recognitionand of the bases that are contacted. At some of thesecontact positions, a residue which has a long side chaincan reach further, so the position of the base contactedis dependent on the size of the residue. At other po-sitions, the side chain is so close to the DNA surfacethat the introduction of any large residue would ruinthe packing.

To determine the stereochemical rules for DNA recog-nition by the probe helix, the crystal structures of com-

plexes of DNA with transcription factors of the probehelix type (see above) were examined (E2 [16], en-grailed (Engl) [20], Matc2 [19], GCN4 [17,18] and Max[14]).

Binding geometry of the probe helixA probe helix is composed of 12 residues (Table 4).The inclination of the helix in the major groove of DNAis fixed: the helix lies at an angle so that the residues atits carboxyl terminus lie further away from the groovethan the residues at its amino terminus. This is achievedby basic/hydrophilic residues at helix positions 2, 7,9, 11 and 12, which bind to DNA phosphates [12]. Asa consequence of this inclination, a set of amino acidresidues at positions 1, 4, 5 and 8 (aal, aa4, etc.) facethe DNA and contact essentially three base pairs (Fig.4), and in one case four (Fig. 4h). Three base pairs arerecognized because a helix spans 10.5 A from aal toaa8 whilst neighbouring base pairs in the major grooveof DNA are approximately 5 A apart.

Relative positions of the residues and bases that formcontactsFour positions in the probe helix (1, 4, 5 and 8) areused for base recognition. Amino acid residues foundat position 1 in the crystal structures are Ile, Asn, andHis (Table 4). Valine is found instead of Ile in somehomeodomains. These four residues all belong to the'medium' size class (Table 1). The residue at position1 is close to the bottom of the major groove of DNA.It therefore appears that large amino acid residues willnot fit and that small ones do not reach far enough toform contacts with the bases.

Residue 1 contacts base 2 on the Watson strand (W2),and can also be close to base Cl (see Fig. 4a for thenotation of the two DNA strands and the base number-ing). It can therefore cross-bridge the two strands, orbridge W2 and W3 on the same strand (Fig. 4b). Thesebridges can be made by asparagine but probably not

6-

5-

4-

3-

2-

1-

S

G- "1 T I I I 1 I

50 100n- N .

I .. . . . .i

. .

4

DNA recognition code Suzuki 323

Fig. 4. Contacts between residues of probe helices and DNA bases found in the crystal structures. (a) Schematic template. The fourblocks in the central line show the residues in the probe helix at positions 1, 4, 5, 8. The three blocks in the lower and upper lines showDNA bases on the Crick (C) and Watson (W) strands, respectively. Schemes (b) to (e) show thepositions of all the bases recognizedby amino acid residues at positions 1, 4, 5 and 8 of the probe helix, respectively. The sizes of the residues are also shown: s = small;m= medium; I = large and a = aromatic. The hatched blocks are the major interaction sites. Parts (f) to (j) show the contacts observedin each crystal structure. The specific contacts (see Table 1) are marked with a circle. The dashed contact involving the stem of Ser(GCN4) appears to be non-specific (see text). Asn(aal) of Matoa2 has been reported to point towards DNA but does not contact a DNAbase in the crystal. However, it seems only reasonable to expect that this residue would bind to T(C1) and C(W2) like Asn (aal) of GCN4,or C(W2) and A(W3) like Asn (aal) of E2. Also Leu (aa4) of Max is free in the crystal structure but might be able to contact C(C1) and/orC(C2).

by histidine (see Bidentate interaction section above).The two bases T(W2) and T(W3) might be bridged bylie and Val, especially by Ile as its side chain is longerthan that of Val (Fig. 2e).Both small and large residues are used for base recog-nition at positions 4, 5 and 8 (Table 4). The size of theresidue determines which base positions are contacted.The contacts for positions 4, 5 and 8 are summarizedin Figs 4c, 4d and 4e, respectively. At position 4, smallresidues, such as Ser and Ala, contact C1, whilst largerresidues (e.g. Gin and Lys) contact the region CO-C2.At position 5, small and medium-sized residues, Ala(small), Cys (small) and Asn (medium), contact W3.

In one case, cysteine of E2 also bridges to C2 (Fig. 4f).A large residue, Glu, contacts C2 and extends further toC3 (Fig. 4j). At position 8, a large residue, Arg, contactsC2 (Fig. 4i), and an aromatic residue, Phe, contacts C3(Fig. 4f). The behaviour of small and medium residuesat position 8 is not well characterized in the crystalstructures.

Arginine at position 9 usually binds to a phosphate.However, when G occurs at position 4 on the Watsonstrand (W4), Arg at position 9 binds occasionally to theG, for example in one of the three structures of GCN4bound to its DNA half site (see footnote to Table 4)

Table 4. Sequences of the probe helices found in crystals.

Protein 1 2 3 4 5 6 7 8 9 10 11 12 Binding site (C) Resolution R-factor Reference

E2 n q V K C Y r F r V K K C G T 1.7A 0.16 1161GCN4 (K) N t E A A r r s r A r k T G A 3.0 A 0.216 [181GCN4 (EC) N t E A A r r s R A r k T G A 2.9A 0.23 [171

GCN4 (ECG) n t E A A R r s r A r k T G A

Matr2 N w V S N R r R k E k T T G T 2.7A 0.23 [19]Engl I w F Q N K r A k I k K A A T 2.8 A 0.24 [201Max H n A L E R K r r D H I C C A 2.9 A 0.232 [141

Residues interacting with a DNA base are shown in bold, while those interacting with a phosphate group are shown in lower case. The sequences(on the Crick strand) are given for the binding sites of the probe helices. Two crystal structures of the GCN4 dimer complexed with DNA have beendetermined. In one [181, the two half sites are identical (K), while in the other [171 the two half sites are slightly different (EC and EC) but the three corebases are the same in both.

324 Structure 1994, Vol 2 No 4

and in Max. This behaviour of residue 9 does not alterthe previous discussion as the base which it contactsis positioned outside the three core base pairs.

Arrangement of residues into a single helixSteric hindrance prevents two residues on a helix fromcontacting two bases in a way that would require themto cross one another. Two residues might change orincrease their binding specificity by directly interactingwith each other or indirectly by sharing the same basepair. The combination of Asn(aal) and Ala(aa4) inGCN4 is such an example. Asn(aal) binds to C1 andW2. Although nine combinations of bases at C1 and W2can be recognized by asparagine (Table 3), Ala (aa4)fixes C1 as T so that only two of the five base combi-nations, T(Cl)-C(W2) and T(Cl)-A(W2), remain.

It should also be noted that to discriminate betweenbase pairs a protein does not need to bind to bothbases in the pair. If it binds to just one base, the otherbase is automatically fixed by the base pairing rules.

DNA-binding specificity of other probe helicesThe DNA-binding characteristics of other transcriptionfactors with probe helices can be examined to see ifthey conform to the stereochemical rules derived fromthe five structures of Table 4. The bZip proteins allform dimers using the same type of 'zipper', and thusthe two half sites on the DNA form an inverted repeatwith a similar separation. It is therefore easy to identifythe core three base pairs recognized by these proteins(see, for example, [ 18,39]). Also the correct orientation(with respect to the amino and carboxyl termini) of thehelices lying in the binding site is readily determined.

The binding specificity of bZip proteins, TEF, C/EBP,CYS3 and TAF1, conform well to the rules describedin the previous section (see Figs 5c to 5f).

In addition to Asn (aal), Asn (aa2) has been shown tobe important for the recognition of T(C2)-A(W2) incertain bZip proteins [40]. This is unusual as aa2 isexpected to bind to a phosphate but not to a base. Thissequence specificity could well be due to side chainto side chain interactions of Asn (aal) and Asn (aa2).Similar interactions are seen between two glutamineresidues of some helix-tum-helix proteins [81; one ofthem binds to the A base, while the other binds to aDNA phosphate.

Most homeodomain proteins function as monomers.The N to C direction of bicoid (Bcd) protein on theDNA seems to be wrongly assigned in the previouswork [41-45]. The antennapedia-type (A-type) home-odomain proteins, including engrailed, have glutamineat position 4, and bind to the sequence TAATnonC[41], whereas the bicoid-type (B-type) have lysine atposition 4, and bind to TAATCC/GGATTA [42,43].The difference in binding specificity depends on po-sition 4 [44,45]. It was therefore believed that lysinewould interact with the G bases positioned at W4 andW5 (Fig. 5i). However, these contacts seem to be stere-ochemically impossible. An alternative interpretationbecomes possible, if the N to C direction of the he-lix is reversed, such that GGATTA becomes the Crickstrand (Fig. 5h). The G bases are then positioned ap-propriately for contacting Lys(aa4) without changingthe sequence of base pairs 2 and 3.

Fig. 5. Contacts predicted for bZip (b)to (f)] and homeodomain [(g) to (i)] pro-teins. Specific contacts (see Table 1) areindicated with a circle. Those for CYS3are predicted from [48,49]. For bicoid(Bcd) protein, two modes are shown; (h)shows the arrangement proposed hereand (i) shows the one proposed in earlierwork [41-45] but which seems unlikely.

DNA recognition code Suzuki 325

DNA recognition code of transcription factorsThe DNA recognition rules for the probe helix canbe described as a combination of the chemical code(Table lb) and the stereochemical rules (Figs 4b to4e). A computer program has been developed on thebasis of these two sets of rules to predict the bind-ing site on DNA from the protein sequence and todesign a protein sequence which will bind to a par-ticular DNA sequence (Suzuki and Yagi, unpublishedprogram). The program selects the proper binding se-quences on DNA as one of the DNA sequences towhich the highest matching score is given for each ofthe transcription factors of the probe helix type men-tioned in this paper. The score here is essentially thenumber of contacts predicted to be made between theprotein and the DNA and reflects the binding energy.In other words, by using the program one can excludeabout 98 % of all the possible combinations of DNAbases and the proper binding sequence is found inthe remaining 2 %. The reason why sometimes morethan one DNA sequence is given the highest score isunknown. Sequence-dependent structural features ofDNA, for example small deviations from the canoni-cal B-DNA structure, may have some influence. Alter-natively, the transcription factor might actually bind tomore than one DNA sequence.

The rules described in this paper appear to be suffi-cient to explain the binding specificity of the probehelix type of transcription factors examined here. Ittherefore seems fair to state that the DNA recognitioncode does exist and is summarized by the chemicaland stereochemical rules described here. The rules arededuced from a limited number of examples, so whilstthe framework of the code will not change, some sup-plementary rules, such as those governing water-medi-ated contacts, may become apparent from future stud-ies.

The analytical approach to DNA recognition describedin this paper has also been extended to hormone re-ceptors, helix-turn-helix and standard zinc finger pro-teins. The recognition helices in these proteins packagainst DNA in orientations which differ from that ofthe probe helix and from each other. This means that,whilst sharing the chemical code, each group of recog-nition proteins has a different set of stereochemicalrules. These will be reported elsewhere.

Biological implicationsThe ability of transcription factors to recognizespecific DNA sequences with high fidelity is es-sential for the regulation of gene transcription,and hence for the control of cell growth anddifferentiation. The crystal structures of abouttwenty complexes of DNA with transcription fac-

tors have been determined, and for individualcomplexes it has been possible to understandthe nature of the interaction between the twomolecules in some detail.The possibility that the specificity of interactionmay be governed by a few simple rules, or a 'code',has been discussed for many years, but no suchcode has been found to cover the known facts.In this paper, I describe a novel approach whichseparates the problem into two component parts:chemical recognition rules, which relate to theability of certain amino acids to make favourablecontacts such as hydrogen bonds and hydropho-bic interactions with specific bases, and stereo-chemical rules, which are different for each ofthe different families of transcription factors, anddepend on the geometry of the recognition he-lix. Using this approach, I propose a DNA-proteinrecognition code for proteins of the probe he-lix type including the basic-leucine zipper andhomeobox families. It should be possible to usethis code to predict DNA or protein sequencesthat are used in recognition for complexes whosestructures have not yet been determined. A directtest of these rules would be to use the chemicalcode table I supply here to select an amino acidchange that should change the binding specificityof a transcription factor without altering the bind-ing geometry.

Acknowledgements I thank Drs A Klug, C Chothia, J Finch and NYagi for their critical reading of the manuscript and for their usefulcomments. I thank Drs L Fairall and J Schwabe for communicatingtheir experimental results before publication.

References1. Seeman, N.C., Rosenberg, J.M. & Rich, A. (1976). Sequence

specific recognition of double helical nucleic acids by proteins.Proc. Nat. Acad Sci. USA 73, 804-808.

2. Pabo, C.O. & Sauer, R.T. (1984). Protein-DNA interactions.Annu. Ret Biochem. 53, 293-321.

3. Matthews, B.W. (1988). Protein-DNA interaction: no code forrecognition. Nature 335, 294-295.

4. Lehming, N., Sartorius, J., Kisters-Woike, B., von Wilcken-Bergmann, B. & Miller-Hill, B. (1991). Rules for protein DNArecognition for a family of helix-turn-helix proteins. NucleicAcids and Molecular Biology 5. (Eckstein, F. & Lilley, D.MJ.,eds), pp. 114-125, Springer-Verlag, Heidelberg.

5. Kisters-Woike, B., Lehming, N., Sartorius, J., von Wilcken-Bergmann, B. & Miller-Hill, B. (1991). A model of the lacrepressor-operator complex based on physical and geneticdata. Eur J Biochem. 198, 411-419.

6. Desjarlais. J.R. & Berg, J.M. (1993). Use of a zinc-finger consen-sus sequence framework and specificity rules to design spe-cific DNA binding proteins. Proc. Natl. Acad Sci USA 90,2256-2260.

7. Klevit, R.E. (1991). Recognition of DNA by Cys2, His2 zincfingers. Science 253, 1367-1393.

8. Pabo, C.O., Aggarwal, AK., Jordan, S.R., Beamer, LJ., Obey-sekare, U.R. & Harrison, S.C. (1990). Conserved residues make

326 Structure 1994, Vol 2 No 4

similar contacts in two repressor-operator complexes. Science247, 1210-1213.

9. Brennan, R.C. (1991). Interactions of the helix-tum-helix bind-ing domain, Curr. Opin. Struct. Biol 1, 80-88.

10. Pavletich, N.P., & Pabo, C.O. (1993). Crystal structure of afive-finger GLI-DNA complex: new perspectives on Zn fingers.Science 261, 1701-1707.

11. Pabo, C. & Sauer, RT. (1992). Transcription factors: struc-tural families and principles of DNA recognition. Ann Rev.Biochem. 61, 1053-1095.

12. Suzuki, M. (1993). Common features in DNA recognitionhelices of eukaryotic transcription factors. EMBO J. 12,3221-3226.

13, Clark, M.L, Halay, E.D., Lai, E. & Burley, S.K. (1993). Co-crys-tal structure of the HNF-3/fork head DNA-recognition motifresembles histone H5. Nature 364, 412-420.

14. Ferr&-D'Amare, AR, Prendergast, G.C., Ziff, E.B. & Burley, S.K.(1993). Recognition by Max of its cognate DNA through adimeric b/HLH/z domain. Nature 363, 38-45.

15. Marmorstein, R., Carey, M., Ptashne, M. & Harrison, S.C. (1992).DNA recognition by Gal4: structure of a protein-DNA complex.Nature 356, 408-414.

16. Hegde, R.S., Grossman, S.R., Laimins, LA & Sigler, P.B. (1992).Crystal structure at 1.7A of the bovine papillomavirus-1 E2DNA-binding domain bound to its DNA target. Nature 359,505-512.

17. Ellenberger, T.E., Brandl, C.S., Struhl, K. & Harrison, S.C.(1992). The GCN4 basic region leucine zipper binds DNA asa dimer of uninterrupted a helices: crystal structure of theprotein-DNA complex. Cell 71, 1223-1237.

18. Konig, P. & Richmond, T. (1993). The X-ray structure of theGCN4-bZip bound to ATF/CREB site DNA shows the complexdepends on DNA flexibility. J Mol. Biol 233, 139-154.

19. Wolberger, C., Vershon, AK., Liu, B., Johnson, AD.& Pabo, C.O. (1991). Crystal structures of a Matca2homeodomain-operator complex suggests a general model forhomeodomain-DNA interactions. Cell 67, 517-528.

20. Kissinger, C.R., Liu, B., Martin-Blanco, E., Kornberg, T.B.& Pabo, C.O. (1990). Crystal structure of an engrailedhomeodomain-DNA complex at 2.8A resolution: a frameworkfor understanding homeodomain-DNA interactions. Cell 63,579-590.

21. Pavletich, N.P. & Pabo, C.O. (1991). Zinc finger-DNA recogni-tion: crystal structure of a Zif268-DNA complex at 2.13A Sci-ence 252, 809-817.

22. Luisi, B.F., Xu, W.X., Otwinowski, Z., Freedman, LP., Ya-mamoto, K.R. & Sigler, P.B. (1991). Crystallographic analysisof the interaction of the glucocorticoid receptor with DNANature 352, 497-505.

23. Otwinowski, Z., et at, & Sigler, P.B. (1988). Crystal structure oftrp repressor/operator complex at atomic resolution. Nature335, 321-329.

24. Lawson, C.L & Carey, J. (1993). Tandem binding in crystals of atrp repressor/operator half-site complex. Nature 366, 178-182.

25. Shultz, S.C., Shields, G.C. & Steitz, TA (1991). Crystal structureof a CAP-DNA complex: the DNA is bent by 90°. Science 253,1001-1007.

26. Anderson, J.E., Ptashne, M. & Harrison, S.C. (1987). Structure ofthe repressor-operator complex of bacteriophage 434. Nature326, 846-852.

27. Aggarwal, AK., Rodgers, D.W., Drottar, M., Ptashne, M. & Har-rison, S.C. (1988). Recognition of a DNA operator by the re-pressor of phage 434: a view at high resolution. Science 242,899-907.

28. Wolberger, C., Dong, Y., Ptashne, M. & Harrison, S.C. (1988).Structure of a phage 434 Cro/DNA complex. Nature 335,789-795.

29. Mondragon, A & Harrison, S.C. (1991). The phage 434Cro/OR 1 complex at 2.5A resolution. J. Mol BioL 219,321-334.

30. Jordan, S.R. & Pabo, C.O. (1988). Structure of the lamda com-plex at 2.5A resolution: details of the repressor-operator in-teractions. Science 242, 893-899.

31. Somers, W.S. & Phillips, S.E.V. (1992). Crystal structure ofthe met repressor-operator complex at 2.8A resolution revealsDNA recognition by -strands. Nature 359, 387-393.

32. McClarin, JA, et al, & Rosenberg, J.M. (1986). Structure ofthe DNA-EcoRI endonuclease recognition complex at 3A res-olution. Science 234, 1526-1541.

33. Kim, Y., Grable, J.C., Love, R., Greene, PJ. & Rosenberg, J.M.(1990). Refinement of EcoRI endonuclease crystal structures: arevised protein chain tracing. Science 249, 1307-1309.

34. Winkler, F., et al. & Wilson, K.S. (1993). The crystal structureof EcoRV endonuclease and of its complex with cognate andnon-cognate DNA fragments. EMBO J 12, 1781-1795.

35. Schwabe, J.W., Chapman, L., Finch, J.T. & Rhodes, D. (1993).The crystal structure of the complex between the oestrogenreceptor DNA-binding domain and DNA at 2.4A' how recep-tors discriminate between their response elements. Cell 75,567-578.

36. Fairall, L, Schwabe, J.W.R., Chapman, L, Finch, J.T. & Rhodes,D. (1993). The crystal structure of a two zinc-finger peptide re-veals an extension to the rules for zinc finger/DNA recognition.Nature 366, 483-487.

37. Calladine, C.R. & Drew, H.R. (1992). Understanding DNA Aca-demic Press, London.

38. Chothia, C. (1984). Principles that determine the structure ofproteins. Anna Rev. Biochem 53, 537-572.

39. Hu, J.C. & Sauer, RT. (1992). The basic-region leucine-zipperfamily of DNA binding proteins. In Nucleic Acids and Molec-ular Biology 6 (Eckstein, F. & Lilley, D.M. eds.), pp. 82-101,Springer-Verlag, Berlin-Heidelberg.

40. Suckow, M., von Wilcken-Bergmann, B. & Miller-Hill, B. (1993).Identification of three residues in the basic regions of the bZipproteins GCN4, C/EBP and TAF-1 that are involved in specificDNA binding. EMBO J 12, 1193-1200.

41. Treissman, J., Harris, E., Wilson, D. & Desplan, C. (1992). Thehomeodomain: a new face for the helix-tum-helix? Bioessays14, 145-150.

42. Driever, W. & Niisslein-Volhard, C. (1989). The bicoid pro-tein is a positive regulator of hunchback transcription in earlyDrosophila embryos. Nature 337, 138-143.

43. Finkelstein, R., & Perrimon, N. (1990). The orthodenticle geneis regulated by bicoid and torso and specifies Drosophila headdevelopment. Nature 346, 485-488.

44. Treissman, J., Goenczy, P., Vashishtha, M., Harris, E. & Desplan,C. (1989). A single amino acid can determine the DNA bindingspecificity of homeodomain proteins. Cell 59, 553-562.

45. Hanes, S.D. & Brent, R. (1991). A genetic model for interactionof the homeodomain recognition helix with DNA Science 251,426-430.

46. Travers, AA (1993). DNA-Protein Interactions Chapman andHall, London.

47. Kanaan, M.N. & Marzluf, GA (1991). Mutational analysis ofthe DNA-binding domain of the CYS3 recognition protein ofNeuropora crassa. Mol. Cell BioL 9, 4356-4362.

48. Suckow, M., von Wilcken-Bergmann, B. & Miller-Hill, B. (1993).The DNA binding specificity of the basic region of the yeasttranscriptional activator GCN4 can be changed by substitutionof a single amino acid. Nucleic Acids Res 21, 2081-2086.

Received: 22 Dec 1993; revisions requested: 12 Jan 1994;revisions received: 21 Feb 1994. Accepted: 8 Mar 1994.