5
Proc. Natil. Acad. Sci. USA Vol. 81, pp. 3781-3785, June 1984 Evolution Segmental homology and internal repetitiousness identified in putative nucleic acid polymerase and human hepatitis B surface antigen of human hepatitis B virus (oligomeric repeats/alternative open reading frame/oligopeptidic repeats/segmental homology) SUSUMU OHNO Beckman Research Institute of the City of Hope, Duarte, CA 91010 Contributed by Susumu Ohno, March 5, 1984 ABSTRACT In a previous paper, it was argued that only those coding sequences descended from oligomeric repeats (the number of bases in the oligomeric unit not being a multiple of 3) can retain sufficiently long alternative open reading frames, and that such alternative open reading frames serve as the res- ervoir for the sudden generation of new polypeptide chains with novel functions. It was suggested that plasmid-encoded 6- amino hexanoic acid linear oligomer hydrolase that suddenly endowed Flavobacterium sp. K172 with the capacity to live off nylon by-products arose by the above mechanism. A corollary to the above argument is the expectation that those viral base sequences that are known to use two of the three alternative reading frames to encode two different polypeptide chains should invariably contain recognizable remains of the oligo- meric tandem repeats, and as a consequence, various oligopep- tidic repeats should also be present in the amino acid sequence of each. Furthermore, two polypeptide chains encoded by the same base sequence translated in different reading frames should show segmental homology of the type depicted previ- ously. In the present paper, the base sequence of human hepa- titis B virus ayw subtype that encodes an 832 amino acid resi- due long putative nucleic acid polymerase in one reading frame and a 226 residue long human hepatitis B surface anti- gen in the other reading frame was examined. All three predic- tions noted above were satisfied. The mechanism of gene duplication as the means to generate a new gene encoding a functionally novel protein from a re- dundant copy of the preexisting gene is a rather slow (each 1% amino acid sequence diversification from the ancestral protein requires roughly one million years) and ineffectual (most of the redundant copies become degenerate pseudo- genes) process (1). It thus appears improbable that this means alone sufficed early in the evolution of living crea- tures, for the requirement of that critical stage was the al- most simultaneous creation of a sufficiently diverse variety of proteins. Bearing the above in mind, it was proposed that the first set of coding sequences that emerged several billion years ago almost had to be repeats of various base oligomers. Pro- vided that the number of bases in the oligomeric unit is not a multiple of three, a high percentage of the randomly generat- ed oligomeric repeats (e.g., 59.14% of the monodecameric repeats) has not one but all three reading frames open for indefinite lengths. Initially, three polypeptide chains en- coded by such an oligomeric repeat had to be identical, but the accumulation of randomly sustained base substitutions would soon have diversified their amino acid sequences, en- abling certain descendants of each oligomeric repeat to spec- ify at the most three and more often two functionally diverse polypeptide chains simultaneously (2, 3). Certain of the mod- ern genes retaining enough of the original internal repeti- tiousness should maintain a long alternative open reading frame, and such unused open reading frames may serve as the reservoir from which new proteins with novel functions can be generated in one fell swoop. As an example, it was suggested that the plasmid-encoded enzyme 6-amino hexa- noic acid linear oligomer hyt,;-olase, which conferred the ca- pacity to metabolize nylon !-y-products to Flav'obacteriurm sp. K172 within 30 years of its encounter with man-made nylon (4-6), was derived from an unused open reading frame of the coding sequence that originally specified a 472 amino acid residue long arginine-rich protein (3). In the exceptionally small genome of certain viruses, two alternate open reading frames of the same base sequence are often used to yield two polypeptide chains of dissimilar func- tions. The customary explanation that the very smallness of their genome dictated such an efficient utilization of the same base sequence is in reality a mere a posteriori state- ment constituting no explanation at all, for natural selection is bound to ignore an unused open reading frame of the exist- ing coding sequence until it too begins to specify a polypep- tide chain of relevant function. It follows then that, unless a given base sequence was a priori endowed with more than one translatable open reading frame, the protective surveil- lance by natural selection could not have covered both of its reading frames. In this paper, the portion of the human hepatitis B viral genome that encodes an 832 amino acid residue long putative DNA polymerase (or reverse transcriptase?) in one reading frame and a 226 amino acid residue long human hepatitis B surface antigen (HBS antigen) in the other is analyzed to show that an a priori characteristic of this portion of the viral genome is sufficient retention of original internal repetitious- ness. Odds Against the Coding Sequence Having a Long Alternative Open Reading Frame The complete genomic DNA base sequences of human hepa- titis B viruses of subtypes ayw (3182 base pairs long), adw (3200 base pairs long), and adr (3188 base pairs long) were determined by Galibert et al. (7) and by Ono et al. (8). Shown in Fig. 1 is the 2519-base-long segment of one DNA strand of subtype ayw (7) that encodes an 832 amino acid residue long protein suspected of being either DNA polymer- ase (8-10) or reverse transcriptase (11-14). Thus, the amino acid sequence shown below the base sequence of Fig. 1 is identified as putative nucleic acid polymerase (Pol). It has been determined that the alternative reading frame of the Abbreviations: Pol, putative nucleic acid polymerase; HBS antigen, human hepatitis B surface antigen; RSV, Rous sarcoma virus; bp, base pair(s). 3781 The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. Downloaded by guest on June 29, 2021

Segmental in B antigen - PNAS · 3782 Evolution: Ohno Proc. Nati. Acad2 Sci. USA81(1984) base sequencespecifyingaminoacid residues 334-570ofPol the coding sequence extremely rich

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • Proc. Natil. Acad. Sci. USAVol. 81, pp. 3781-3785, June 1984Evolution

    Segmental homology and internal repetitiousness identified inputative nucleic acid polymerase and human hepatitis B surfaceantigen of human hepatitis B virus

    (oligomeric repeats/alternative open reading frame/oligopeptidic repeats/segmental homology)

    SUSUMU OHNOBeckman Research Institute of the City of Hope, Duarte, CA 91010

    Contributed by Susumu Ohno, March 5, 1984

    ABSTRACT In a previous paper, it was argued that onlythose coding sequences descended from oligomeric repeats (thenumber of bases in the oligomeric unit not being a multiple of3) can retain sufficiently long alternative open reading frames,and that such alternative open reading frames serve as the res-ervoir for the sudden generation of new polypeptide chainswith novel functions. It was suggested that plasmid-encoded 6-amino hexanoic acid linear oligomer hydrolase that suddenlyendowed Flavobacterium sp. K172 with the capacity to live offnylon by-products arose by the above mechanism. A corollaryto the above argument is the expectation that those viral basesequences that are known to use two of the three alternativereading frames to encode two different polypeptide chainsshould invariably contain recognizable remains of the oligo-meric tandem repeats, and as a consequence, various oligopep-tidic repeats should also be present in the amino acid sequenceof each. Furthermore, two polypeptide chains encoded by thesame base sequence translated in different reading framesshould show segmental homology of the type depicted previ-ously. In the present paper, the base sequence of human hepa-titis B virus ayw subtype that encodes an 832 amino acid resi-due long putative nucleic acid polymerase in one readingframe and a 226 residue long human hepatitis B surface anti-gen in the other reading frame was examined. All three predic-tions noted above were satisfied.

    The mechanism of gene duplication as the means to generatea new gene encoding a functionally novel protein from a re-dundant copy of the preexisting gene is a rather slow (each1% amino acid sequence diversification from the ancestralprotein requires roughly one million years) and ineffectual(most of the redundant copies become degenerate pseudo-genes) process (1). It thus appears improbable that thismeans alone sufficed early in the evolution of living crea-tures, for the requirement of that critical stage was the al-most simultaneous creation of a sufficiently diverse varietyof proteins.

    Bearing the above in mind, it was proposed that the firstset of coding sequences that emerged several billion yearsago almost had to be repeats of various base oligomers. Pro-vided that the number of bases in the oligomeric unit is not amultiple of three, a high percentage of the randomly generat-ed oligomeric repeats (e.g., 59.14% of the monodecamericrepeats) has not one but all three reading frames open forindefinite lengths. Initially, three polypeptide chains en-coded by such an oligomeric repeat had to be identical, butthe accumulation of randomly sustained base substitutionswould soon have diversified their amino acid sequences, en-abling certain descendants of each oligomeric repeat to spec-ify at the most three and more often two functionally diverse

    polypeptide chains simultaneously (2, 3). Certain of the mod-ern genes retaining enough of the original internal repeti-tiousness should maintain a long alternative open readingframe, and such unused open reading frames may serve asthe reservoir from which new proteins with novel functionscan be generated in one fell swoop. As an example, it wassuggested that the plasmid-encoded enzyme 6-amino hexa-noic acid linear oligomer hyt,;-olase, which conferred the ca-pacity to metabolize nylon !-y-products to Flav'obacteriurmsp. K172 within 30 years of its encounter with man-madenylon (4-6), was derived from an unused open reading frameof the coding sequence that originally specified a 472 aminoacid residue long arginine-rich protein (3).

    In the exceptionally small genome of certain viruses, twoalternate open reading frames of the same base sequence areoften used to yield two polypeptide chains of dissimilar func-tions. The customary explanation that the very smallness oftheir genome dictated such an efficient utilization of thesame base sequence is in reality a mere a posteriori state-ment constituting no explanation at all, for natural selectionis bound to ignore an unused open reading frame of the exist-ing coding sequence until it too begins to specify a polypep-tide chain of relevant function. It follows then that, unless agiven base sequence was a priori endowed with more thanone translatable open reading frame, the protective surveil-lance by natural selection could not have covered both of itsreading frames.

    In this paper, the portion of the human hepatitis B viralgenome that encodes an 832 amino acid residue long putativeDNA polymerase (or reverse transcriptase?) in one readingframe and a 226 amino acid residue long human hepatitis Bsurface antigen (HBS antigen) in the other is analyzed toshow that an a priori characteristic of this portion of the viralgenome is sufficient retention of original internal repetitious-ness.

    Odds Against the Coding Sequence Having a LongAlternative Open Reading Frame

    The complete genomic DNA base sequences of human hepa-titis B viruses of subtypes ayw (3182 base pairs long), adw(3200 base pairs long), and adr (3188 base pairs long) weredetermined by Galibert et al. (7) and by Ono et al. (8).Shown in Fig. 1 is the 2519-base-long segment of one DNAstrand of subtype ayw (7) that encodes an 832 amino acidresidue long protein suspected of being either DNA polymer-ase (8-10) or reverse transcriptase (11-14). Thus, the aminoacid sequence shown below the base sequence of Fig. 1 isidentified as putative nucleic acid polymerase (Pol). It hasbeen determined that the alternative reading frame of the

    Abbreviations: Pol, putative nucleic acid polymerase; HBS antigen,human hepatitis B surface antigen; RSV, Rous sarcoma virus; bp,base pair(s).

    3781

    The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement"in accordance with 18 U.S.C. §1734 solely to indicate this fact.

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    29,

    202

    1

  • 3782 Evolution: Ohno Proc. Nati. Acad2 Sci. USA 81 (1984)

    base sequence specifying amino acid residues 334-570 of Pol the coding sequence extremely rich in G+C has a betteralso encodes HBS surface antigen of hepatitis B Dane parti- probability of having a long alternative open reading frame.dles (7, 8, 15). In Fig. 1, the 226 amino acid residue long However, the AT/GC ratio of the base sequence shown insequence of HBS antigen is shown above the base sequence Fig. 1 was found to be 1.03/1.00. The 228 codons specifyingidentified as HBS (fourth row from the bottom of Fig. la to amino acid residues 343-570 of Pol (fourth row from the bot-eighth row of Fig. lb). Inasmuch as the AT/GC ratio of three tom of Fig. la to eighth row of Fig. lb) included 27 codonschain-terminating base triplets TGA, TAG, and TAA is 7/2, that ended in TG (-TG), 6 codons that ended in TA (-TA), 48

    aA TG CC C CT A TC CTATCA ACAC TTCC G GAGA CTACTG T T G TTAGACG A CGAGG C A GG T CC CPOL 1 MET PRO LEU SER TYR GLN HIS PHE ARG ARG LEU LEU LEU LEU ASP ASP GLU ALA GLY PRO

    C TAGAG GA A GAACT CC CT CG C CTC G CAGACG AAGC T CTCAATC G CCGC GT CGCAGAAGATPOL 21 LEU GLU GLU GLU LEU PRO ARG LEU ALA ASP GLU ALA LEU ASN ARG ARG VAL ALA GLU ASP

    C T CAAT CTCG GGAA T CTCAA TG TTAG TAT TCCT TG G CA TCA TA A GGTGG G GAACT T TACTPOL 141 LEU ASN LEU GLY ASN LEU ASN VALNW~ ILE PRO TRP HIS HISS VAL GLY ASN PHE THR

    POL 61 GLY LEU TYR SER SER THR VAL PRO VAL PHEASN PRO HIS TRP LYS THR PRO SER PHE PRO- I~~M

    A A TATACAT T TACACCA AGACATTA T CAAAAAA TG TGAACAGT T T GTAGG CC CAC T CACA

    POL 8l jrJ ILE HIS LEU HIS GLN ASP ILE I LE LYS LYS CYS U GLN PHE VAL GLY PRO LEU THRG TTAATGAGAAAAGA AGATT G CAAT TGATT A TG CC T CCCAGG T T TTAT C CAAAGG T TAC C

    POL 101 VALMIJ'~, U LYS ARG ARG LEU GIN LEU I LE MET PRO PRO ARG PHE TYR PRO LYS VAL. THR

    POL 121 Lys TYR LEU PRO LEU ASP S GLY ILE YS PRO TYR TYR PRO GLU HIS LEU VAL1j1 1 HIS

    POL 1141 TYR PHE GIN THIj HS YR LE HI TR LU TP LY ALA GLY lILE LEU TYRj~S ARG

    POL 161 GLU THR THR HISEWR ALA SER PHE CYS GLY SER PRO TYR SER TRP GLU GIN ASP LEU GIN

    HBs-163 MET SLY GLN A'SN LEL 'SLR THK SLR ASN 'RO LII GLY 'lIE PH5 PRO ASP HIS GLN LEU ASP

    POL 181 HIS GLY ALA GLU SER PHE HIS GIN GIN SER SER GLY ILE LEU SER ARG PRO PRO VAL GLY

    ItBS-143 PRO ALA PHlE AR(; ALA ASN SIHR ALA ASN PRO ASP TRI' ASP PHE ASN PRO ASS LYS ASP THRT C CAGC CT T CAGA G CAA ACA C CG CAAAT C CAGAT T G GGACT T CAA TC C CAACAAG GACA C

    POL 201 SER SER LEU GIN SER LYS HIS ARG LYS SER ARO LEU GLY LEU GIN SER GIN GIN GLY HIS

    HJBS-123 TRP PRO ASP ALA ASN LYS VAL I;LY ALA IL) A1.LAPiL. (;LY L.L GSLY Pli I. TSR PRO PRO HIS

    POL 22 LEU ALA ARG ARG GIN GIN GLY0j3 SER TRP SER ILE ARG ALA GLY PHE HIS PRO THR ALA

    HBS-103 GLY GLY LEU LEU IL) rRI' SLR PRO I;N AlA IN (I1.) ILL LII' GLN TSR ILL PRO ALA ASN

    POL 2141 ARG ARG PRO PHE GLY VAL GLU PRO SER GLY SER GLY HIS THR THR ASN PHE ALA SER LYS

    KBS -83 PRO PRO PRO ALA SIR [SHR AsS ARG GLN SI R (LS ARI USLY RI) rtSR PRO LLL ',LR PRO PRO

    POL 261 SER ALA SER CYS LEO HIS GIN SER PRO VAL ARG LYS ALA ALA TYR PRO ALA VAL SER THR

    IJE -63 LEU AR(; ASN SIR HIS PRO 11.LN ALA MI.T (LN RI'K1 A.5> S1' IR15 'AiR PHlL 115 OLN STR LLL

    POL 281 PH'in J LYS HIS SER SER SER GLY HIS ALA VAL GLU PHE HIS ASN LEO PRO PRO ASN SER

    H~s -143 GLN) ASP' 'RO SRI; VAL ARG IL) LEL' [YR PHII IRS ALA (LI ILS SLFR SLR SER GLY TIR SAL

    POL 301 ALA ARG SER GIN SERO J ARG PRO VAL PHE PRO CYS TRP TRP LEU GIN PHE ARG ASN SERS

    HBS -23 AS?) PRO SAL LiLU PIE iIR ALA SIFR IRo III >10K 'AR ILl. 1II "SL AR(G ILL GLY AS)' PRO

    POL 321 PRO CYS SER P TYR CYS LEU SER LEO ILE VAL ASN LEO LEO GLU ASP TRP GLY PRO

    H~s -3 ALA LI L ASN MET GLU ASN ILE THR SER GLY PHE LEU GLY PRO LEU LEO VAL LEO GIN ALAT GC G CT GA ACATGGAGA ACAT CACA TCAG GA TT C CTA GGACC C CTT CT C GT G TTACAGG C

    POL 3141 cys 4ALAW HIS GLY GLU HIS HIS ILE ARG ILE PR01Ga THR PRO SER ARG VAL THR GLY

    HEs 18 GLY PHE PHE LEU LEU THRI ARG ILE LEU TieR I LE PRO GIN SER LEU ASP SER TRP TRP THRGG G GT T TT TC TT GT TGACA AGAAT C CT CACA A TACC G CAGA GT CT AGACT C GT G GTGGA C

    POL 361 GLY VAL PHE LEU VALW j' LYS ASN PRO HIS ASN THR ALA GLU SER111G LEO VAL VAL ASP

    'lBs 38 SER LEU ASN PHE LEU GLY GLY THR THR VAL CYS LEO GLY GIN ASN SER GIN SER PRO THRT T CT C TCAATT T TCTAGGG G GA A CTAC C GT GT GT CTTGGCCAAAA TT CGCAGT CC CCAA C

    POL 381 PHE SER GIN PHE SERing GLY ASN TYR ARG VAL SER TRP PRO LYS PHE ALA VAL PRO AS1)

    HBs 58 SER ASN HIS SER PRO THR SER CYS PRO PRO THR CYS PRO GLY TYR ARG TRP MET CYS LEOC TC CAA T CAC T CA CCAACC T CT T GTC C TCCAACT TGTC CT G GT TAT C GC TGGA T GTGTC T

    POL 1401 LEU GIN SER LEU THR ASN LEU LEU SER SER ASN LEU SER TRP LEO SER LEU ASP VAL SER

    FIG. 1. The 2519-base-long cDNA sequence of human hepatitis B subtype ayw genome that encodes the 832 amino acid residue long Pol inone reading frame and the 226 amino acid residue long HBS antigen in the other (7). This corresponds to =80% of the one genomic DNA strand.The Pol coding sequence is identified by the amino acid residues that it specifies placed immediately below the rows of base sequences, whereasthe sequence for HBS antigen is identified by the corresponding amino acid residues placed immediately above the rows of base sequence.Amino acid residues at both ends of each row are numbered. In the case of the ayw subtype, HBS antigen coding sequence is preceded in its 5'end by an additional 163-codon-long potential coding sequence that begins with a chain initiator ATG (extreme left of row 10 in a). Amino acid

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    29,

    202

    1

  • Evolution: Ohno Proc. Natl. Acad. Sci. USA 81 (1984) 3783

    codons that started with A (A--), and 54 codons that started which becomes (0.9633)226. Thus, the random probability ofwith G (G--). Thus, the random probability of the above cod- the above-noted Pol coding segment also encoding the 226ing segment having the alternative reading frame open for amino acid residue long HBS antigen by its one particular226 amino acid residues ofHBS antigen can be calculated as: alternative reading frame becomes 2.15 x 10-4. This revela-

    /2748 226 tion makes us realize a peculiarity in the original construc-1i- ~-2-x7 _4 + (6 102 tion of the above-noted Pol coding segment in that, although

    228( 28 2) 28 228 there was no shortage of codons ending in TG (-TG), suchb HBs 7° ARG ARG PHE I LE I LE PHE LEU PHE I LE LEU LEU LEU CYS LEU I LE PHE LEU LEU VAL LEU

    G C G G C G T T T T A T C A T C T T C C T C T T C A T C C T G C TG C T A TG C C T C A T C T T C T T G T T G G T T C TPOL421 ALA ALA PHE TYR HIS LEU PRO LEU HIS PRO ALA ALA MET PRO HIS LEU LEU VAL GLY SER

    HBS 98 LEU ASP TYR GLN GLY MET LEU PRO VAL CYS PRO LEU ILE PRO GLY SER SER THR THR SERT C T G G A C T A T C A A G G T A T G T T G C C C G T T T G T C C T C T A A T T C C A G G A T C C T C A A C A A C C A G

    PoL 441 SER GLY LEU SER ARG TYR VAL ALA ARG LEU SER SERN& SER ARG ILE LEU ASN ASN GLN

    HBS 118 THR GLY PRO CYS ARG THR CYS MET THR THR ALA GLN GLY THR SER MET TYR PRO SER CYSC A C G G G A C C A T G C C G G A C C T G C A T G A C T A C T G C T C A A G G A A C C T C T A T G T A T C C C T C C T G

    POL 461 HIS GLY THR MET PRO ASP LEU HIS TPYR CYS SER ARG ASN LEU TYR VAL SER LEU LEU

    BS 138 CYS CYS THR LYS PRO SER ASP GLY ASN CYS THR CYS I LE PRO I LE PRO SER SER TRP ALAT T G C T G T A C C A A A C C T T C G G A C G G A A A T T G C A C C T G T A T T C C C A T C C C A T C A T C C T G G G C

    POL 481 LEU LEU TYR GLN THR PHE GLY ARG LYS LEU HIS LEU TYR SER HIS PRO ILE ILE LEU GLY

    HBs 158 PHE GLY LYS PHE LEU TRP GLU TRP ALA SER ALA ARG PHE SER TRP LEU SER LEU LEU VALPL5 T T T C G G A A A ATTC C T AT G G G A G T G G G C C T C A G C C C G T T TC T C C T GGC T C AG T TTA CTAGTPOt. 501PHE ARG LYS I LE PRO MET GLY VAL GLY LEU SER PRO PHE LEU LEU ALA GLN PHETHFGTCA

    HBS 178 PRO PHE VAL GLN TRP PHE VAL GLY LEU SER PRO THR VAL TRP LEU SER VAL I LE TRP METG C C A T T T G T T C A G T G G T T C G T A G G G C T T T C C C C C A C T G T T T G G C T T T C A G T T A T A T G G A T

    POL 521 ALA ILE CYS SER VAL VAL ARGO1 ALA PHE PRO HIS CYS LEU ALA PHE SER TYR MET ASPIHBS 198 MET TRP TYR TRP GLY PRO SER LEU TYR SER ILE LEU SER PRO PHE LEU PRO LEU LEU PRO

    G A T G T G G T A T T G G G G G C C A A G T C T G T A C A G C A T C T T G A G T C C C T T T T T A C C G C T G T T A C CPOL 541 VAL VAL LEU GLY ALA LVS SER VAL GLN HIS LEL_ U SER LEU PHE THR ALA VAL THR

    HES 218 ILE PHE PHE CYS LEU TRP VAL TYR ILEAA T T T T C T T T T G T C T T T G G G T A T A C AT T T AAA C C C T AA C A A A A C AA A G AG A T G G G G T T A C

    POL 561 ASN PHE LEU LEU SER LEU GLY I LE HIS LEU ASN PROMW LYS THR LYS ARG TRP GLY TYR

    T C T C T A A A T T T T A T G G G T T A T G T C A T T G G A T G T T A T G G G T C C T T G C C A C A A G A A C A C A T CPOL 581 SER LEU ASN PHE MET GLY TYR VAL ILE GLY CYS TYR GLY SER LEU PRO GLN GLU HIS ILE

    A T A C A A A A A A T C A A A G A A T G T T T T A G A A A A C T T C C T A T T A A C A G G C C T A T T G A T T G G A A APOL 601 ILE GLN LYS ILE LYS GLU CYS PHEEM LYS LEU PRO ILEj~a ARG PRO ILE _W TRP LYS

    G T A T G T C A A C G A A T T G T G G G T C T T T T G G G T T T T G C T G C C C C T T T T A C A C A A T G T G G T T A TPOL 621 VAL CYS GLN ARG ILE VAL GLY LEU LEU GLY PHE ALA ALA PRO PHE THR GLN CYS GLY TYR

    C C T G C G T T G A T G C C T T T G T A T G C A T G T A T T C A A T C T A A G C A G G C T T T C A C T T T C T C G C C APOL 641 PRO ALA LEU MET PRO LEU TYR ALA CYS I LE GLN SEES GLN ALA PHE THR PHE SER PRO

    A C T T A C A A G G C C T T T C T G T G T A A A C A A T A C C T G A A C C T T T A C C C C G T T G C C C G G C A A C G GPOL 661 THR TYR LYS ALA PHE LEU CYSS GLN TYR LEU ASN LEU TYR PRO VAL ALA ARG GLN ARG

    C C A G G T C T G T G C C A A G T G T T T G C T G A C G C A A C C C C C A C T G G C T G G G G C T T G G T C A T G G G CPOL 681 PRO GLY LEU CYS GLN VAL PHE ALA/_ ALA THR PRO THR GLY TRP GLY LEU VAL MET GLY

    C A T C A G C G C A T G C G T G G A A C C T T T T C G G C T C C T C T G C C G A T C C A T A C T G C G G A A C T C C T APoL 701 HIS GLN ARG MET ARG GLY THR PHE SER ALA PRO LEU PRO ILE HIS THR ALA GLU LEU LEU

    IsG C A G C T T G T T T T G C T C G C A G C A G G T C T G G A G C A A A C A T T A T C G G G A C T G A T A A C T C T G T T

    POL 721 ALA ALA CYS PHE ALA ARG SER ARG SER GLY ALA ASN ILE I LE GLY THR_ FV SER VAL

    G TAAACACTGGATCPOt. 741 VAL LEU SER ARG LYS TYR THR SER PHE PRO TRP LEU LEU GLY CYS ALA ALA ASN TRP ILE

    C T G C G C G G G A C G T C C T T T G T T T A C G T C C C G T C G G C C C T G A A T C C T G C G G A C G A C C C T T C TPOL 761 LEU ARG GLY THR SER PHE VAL TYR VAL PRO SER ALA LEU ASN PRO ALA ASP ASP PRO SER

    C G G G G T C G C T T G G G A C T C T C T C G T C C C C T T C T C C G T C T G C C G T T C C G A C C G A C C A C G G G GPOL 781 ARG GLY ARG LEU GLY LEU SER ARG PRO LEU LEU ARG LEU PRO PHE ARG PRO THR THR GLY

    C G C A C C T C T C T T T A C G C G G A C T C C C C G T C T G T G C C T T C T C A T C T G C C G G A C C G T G T G C A CPOL 801 ARG THR SER LEU TYR ALA ASP SER PRO SER VAL PRO SER HIS LEU PRO ASP ARG VAL HIS

    T T C G C T T C A C C T C T G C A C G T C G C A T G G A G A C C A C C G TG A C G CC C A C C A A A T A T T G C C CPOL 821 PHE ALA SER PRO LEU HIS VALAALA TRP ARGPRO PRO

    residues specified by this additional coding sequence are shown in smaller letters contiguous with HBS antigen and they are numbered back-ward starting from HBS antigen -163 to HBS antigen -1. All chain terminating base triplets are identified by solid boxes placed in their properreading frames; those belonging to the same reading frame as the HBS antigen coding sequence are placed immediately above the base triplets,while a single chain terminator of Pol coding sequence is placed immediately below TGA (bottom row in b). The rest are placed below the Polamino acid sequence, as they belong to the third unused reading frame.

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    29,

    202

    1

  • Proc. NatL. Acad Sci. USA 81 (1984)

    codons are not followed by those starting with A (A--), andthat codons starting with either A or G, (A--) and (G--), arenot preceded by those ending in TA (-TA). Such avoidanceof particular codon combinations by the Pol coding segmentmust have been inherent in the original construction of hu-man hepatitis B viral genome, for if it had not preexisted,natural selection has no means at its disposal to favor thesubsequent creation of the alternative open reading frame forHBS antigen by the Pol coding segment. All that natural se-lection has been doing is maintaining the functional integrityof the preexisting HBS antigen coding sequence togetherwith that of the overlapping Pol coding sequence.

    Internal Repetitiousness Embodied in the Base SequenceThat Specified Pol and HBS Antigen in DifferentReading Frames

    The tendency of one of the two alternative reading frames ofthe 2496-base-long Pol coding sequences shown in Fig. 1 toremain open indeed appears as an inherent and original char-acteristic of this particular base sequence, for the first alter-native reading frame that in part encodes the 226 amino acidresidue long HBS antigen contains altogether only 12 chain-terminating codons per 2496 bases, each identified in Fig. 1by a solid box placed above the base sequence. In sharp con-trast, the second alternative reading frame contains as manyas 38 chain terminators per the same number of bases, eachidentified in Fig. 1 by a solid box placed below the rows ofamino acid residues of Pol. Even if we exclude from consid-eration the 684-base-long segment where the first alternativereading frame encodes HBS antigen (fourth row from thebottom of Fig. la to eighth row of Fig. lb) the difference innumbers of chain terminators remains 11 chain terminatorsfor the first alternative reading frame versus 28 chain termi-nators for the second alternative reading frame. Indeed, inthe case of the first alternative reading frame of ayw subtypeshown in Fig. 1, the coding sequence for the 226 amino acidresidue long HBS antigen is preceded on its 5' side by anadditional 618-base-long open reading frame (Fig. la, rows7-18). Furthermore, due to the appropriate appearance ofthe chain initiator ATG (Fig. la, row 10, far left), a distal80% of this additional open reading frame has the potential tospecify a 163 amino acid residue long polypeptide chain thatis contiguous with HBS antigen.The inherent and original characteristics of the base se-

    quences shown in Fig. 1 that enable it to encode Pol andHBS antigen in two different reading frames are apparentlyderived from its ultimate ancestry, the ancestor being therepeat of a base oligomer or oligomers. In Fig. 2, six separateinstances of still discernible tandem oligomeric repeats aretabulated. These tandem oligomeric repeats are accompa-nied by amino acid residues of HBS antigen they specify fortheir easy identification in Fig. 1. Only those found withinthe 684-base-long segment that encodes both Pol and HBSantigen are tabulated (fourth row from the bottom of Fig. lato eighth row of Fig. lb). In some instances, tandemly ar-ranged copies are translated in the same reading frame.Thus, in HBS antigen the tetrapeptidic sequence Ser-Pro-Thr-Ser recurs in rapid succession, occupying amino acidposition 55-58 and then 61-64, as shown at the top of Fig. 2.In other instances, two consecutive copies of the base oligo-mer are translated in different reading frames. Thus, twopentapeptidic sequences, Pro-Leu-Ile-Pro-Gly and Ser-Ser-Thr-Thr-Ser, of amino acid positions 108-112 and then 113-117 of HBS antigen are encoded by two 16-base-long ho-mologous base sequences tandemly arranged (73% base se-quence homology). Overall, the entire base sequence of Fig.1 appears to have descended from the nanodecameric re-peat, the oligomeric unit sequence being something like C-T-T-C-T-G-G-T-G-C-T-C-C-A-A-C-C-A-G, three consecutive

    Hs (AYW) ANTIGEN 54 GSL SEN PRO T SER 58A G T CC C C A A C C

    60 HIS SER PRO THR SER CYS PRO 66CACTCACCAACC TC TTG TCC T

    6;7 PRO THt --- CYS PRO 70c.AACT - - - IT **

    Hts(AYW) ANTIGEN ILE ILE P1H LEU P1H ILE 1EU LEU 88ATCATC TTCC TCTTCATC C TGCTG

    1 LU ILE P LEU LEU VAL .EU LEU 98' I . A. 1 C I I . T I G II G G I T CI T C. I G

    HS (AYW) ANTIGEN 108 PRO LEU ILE PRO SLYTCC TC TAAT TCCAGGA

    113 SER SE R THR DHR SERI C. C.I fA

    AC A A i i

    A

    HBs (Ayw) ANTIGEN 19 SLY PRO CYSG & *6A i A I 6i

    122 ARG THR CYS METC G G A C C - T G C A T G

    L30 SLY THN CYS METG G AA5CC G C ATG

    * *- * ---

    HBs (AYW) ANTIGEN

    Has (AYW) ANTIGEN

    165 TRP ALA SER ALA ARGG T G G G. CC T C A G C CC G

    184 VAL SLY LEU SER PROG. GTAGSGG C T T TC C C C C

    *-- *- *.-190 VAL TRP LEU SERG+ I T T j 6. C I .A

    211 PRO PME LEUCCC TTTTTA

    214 PRO LEU LEUCTC SC TC TI A

    217 PRO ILE PEAA IIII C

    112

    121

    125

    133

    169

    188

    193

    213

    216

    219

    FIG. 2. Six examples of oligomeric tandem repeats found withinthe 684-base-long segment simultaneously encoding Pol and HBSantigen (fourth row from bottom of Fig. la to eighth row of Fig. lb)are tabulated here. In each example, one unmarked oligomeric unitis considered as the prototype, and homologous bases in othercopies are identified by asterisks. Each oligomeric unit is accompa-nied by corresponding amino acid residues of HBS antigen. Whenprototype and copies are translated in the same reading frame, ho-mologous amino acid residues of copies are also identified by aster-isks.

    copies of it translated in three different reading frames orig-inally giving the nanodecapeptidic Leu-Leu-Val-Leu-Gln-Pro-Ala-Ser-Gly-Ala-Pro-Thr-Ser-Phe-Trp-Cys-Ser-Asn-Glnperiodicity to all three polypeptide chains it encoded. Manytripeptidic segments included in the above nanodecapeptidicunit occur at least once either in a Pol amino acid sequenceor in an HBS antigen sequence, and sometimes in both. Forexample, the Leu-Leu-Val tripeptide occupies amino acidposition 94-96 of HBS antigen and position 436-438 of Pol(first row of Fig. lb). Furthermore, the Leu-Leu-Val-Leu-Gln pentapeptide is seen in position 12-16 of HBS antigen(fourth row from the bottom of Fig. la). When the aminoacid composition of 832 amino acid residue long Pol and 226amino acid residue long HBS antigen of ayw subtype werecompared, the overall similarity between the two was ratherstriking. Furthermore, this similarity increased if the addi-tional open reading frame immediately upstream of HBSantigen coding sequence was considered as a part of HBSantigen coding sequence (Fig. la, HBS -163 to 226). Itshould be pointed out here that the amino acid compositionof the primordial nanodecapeptidic unit proposed above iscompatible with the observed amino acid compositions ofboth Pol and HBS antigen. For example, both Pol and HBSantigen were characterized by the relative paucity of Glu,Lys, and Tyr, and these three residues are not included inthe nanodecapeptidic unit. By contrast, usually uncommonPhe, Cys, Trp were relatively abundant in both Pol and HBSantigen (6.72%, 3.62%, and 4.13%, respectively, for Pol;7.08%, 6.19%, and 5.75%, respectively, for HBS antigen),

    3784 Evolution: Ohno

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    29,

    202

    1

  • Proc. Natl. Acad. Sci. USA 81 (1984) 3785

    PbL (AY) 58 VAL GLY LEU SER PRO PE LEU LEU 51BGTGGGCCTCAGCCCGTTTC TCCTG

    SS (ayw) 1 VAL GLY LEU SER PRO 188A G iIT T C iiiC

    (AW) 2B LE sER PRo PHE Le PRo 214TIGtgTiiCIIITIAiCg

    PR. (AW) 26 TYR CYS LEsu SER LEJ IEVLAL 32IA . T G ccIcTC C T A T CPci (AvW) 412 LE SER 7P WL SER W ASP VAL 419

    TTGTC CTGGTTATCGC TGGATGTG

    H AS)ay) PM sPE W LEu sEE LEa L VAL I7II c I; . I G G. C A.C GT T I A c T A T ~ F~V Pa.. .17 LEU-Hs-1YR4VT-ASP-ASP{- --LEu-LEIJ-A.A-ALA-ALA-SER-s-HS-A-LY-LEU-GlU 195

    N~s(AYw) 201 TRP GLY PRO SER LEU H POL 53 (5%) PHE-SE R-PE- - P 1 [L-L LYS-SR-VAL-GLN s-LE -GL 53 (5i)T G G G G G C C A A G T C T G

    POL (AYW)i78 TR GLYS' S E" 5 HB Pci 728 ARG-SER-GLY-ALAA-ILE-ILE-GLY-THR-P-ASP -SER- SAL-VAL-LD-LYS-1YE 746

    I i i i i T T A C T C I C I A RW P1 68 UlY-Av.-PRD-LYS- .A-IE-LYS-THR-ASP-SN-LY- s8v-vs-PHE-TM-SR-LYS-SER 701

    FIG. 3. (Left) Three examples of segmental amino acid sequence homology between Pol and HBS antigen are shown accompanied by thecorresponding coding bases for each. (Right) Two segments of human hepatitis B virus ayw subtype Pol (Hb Pol) amino acid sequence thatshow reasonable homology with portions of two suggested active sites of RSV Pol (Rous sarcoma virus reverse transcriptase) are shown, eachaccompanied by a corresponding segment of RSV Pol. Amino acid residues that are thought to remain invariant in virus reverse transcriptasesof divergent oncogenic retroviruses are shown in large capital letters (14, 16), as are residues homologous with the above that are found in twosegments of Hb Pol. The tripeptides, Val-Val-Leu, found in both segments of Hb Pol are enclosed in a box.

    and all three are represented once each in the nanodecapep-tidic unit. On the other hand, the fact that the unused secondalternative reading frame had a chain terminator recurring,on the average, every 21.9 codons (38 chain terminators per832 codons) would have been more readily explainable iffour base triplets in the proposed primordial nanodecamericunit, TGG, TGC, CAA, and CAG, that can easily change tochain terminators, TGA, TAA, and TAG, were arranged inthe same reading frame.

    Segmental Homology Between Pol and HBS Antigen and thePutative Polymerase Active Sites of Pol

    Inasmuch as the base sequence of Fig. 1 apparently consistsmainly of variously degenerated copies of the primordialoligomeric unit or units that are translated in various readingframes within both Pol and HBS antigen-coding sequences,as a corollary to the observed recurrence of various oligo-peptides in each, segmental amino acid sequence homologyis expected between Pol and HBS antigen. Three of the moreobvious examples of such segmental amino acid sequencehomology between Pol and HBS antigen are tabulated in Fig.3. At the top of Fig. 3, the octapeptidic sequence occupyingposition 508-515 of Pol finds its homologues in two segmentsof HBS antigen. The pentapeptidic portion Val-Gly-Leu-Ser-Pro is also found in position 184-188 of HBS antigen (Fig.lb, rows 5 and 6), whereas another overlapping pentapepti-dic portion, Leu-Ser-Pro-Phe-Leu, is also found to occupyposition 209-213 of HBS antigen (Fig. lb, rows 5 and 7). Theother pentapeptide, Ser-Trp-Leu-Ser-Leu, also occurs onceeach in Pol (position 413-417) and in HBS antigen (position171-175) as shown in the middle of Fig. 3.Recently, the amino acid sequence homology between the

    certain carboxyl-terminal region of human hepatitis B Poland the certain amino-terminal region of Pol (reverse tran-scriptase) of Rous sarcoma virus (RSV) and other oncogenicretroviruses was revealed (14). More precisely, amino acidresidues 499-596 of human hepatitis B Pol (Fig. lb, rows 4-9) were found to be 22% homologous with residues 141-237ofRSV Pol. It is of interest to note that the octapeptidic Val-Gly-Leu-Ser-Pro-Phe-Leu-Leu of Pol that finds its homo-logues in HBS antigen is included in this putative active sitesegment of Pol (Fig. 3 Left). The conserved core of this puta-tive active site ofhuman hepatitis virus B Pol is Phe-Ser-Tyr-Met-Asp-Asp-Val-Val-Leu-Gly-Ala-Lys-Ser of position536-548, the corresponding sequence in RSV Pol being Leu-His-Tyr-Met-Asp-Asp-Leu-Leu-Leu-Ala-Ala-Ala-Ser, asshown (Fig. 3 Right). On the other hand, more extensivecomparison between various Pols of diverse oncogenic ret-

    roviruses led another group of investigators (16) to place themost evolutionary conserved putative active site not on theamino-terminal region but on the carboxyl-terminal region ofthe reverse transcriptase. The conserved core of this alterna-tive active site in the case of RSV Pol was Ala-Ile-Lys-Thr-Asp-Asn-Gly-Ser-Cys-Phe-Thr-Ser-Lys-Ser, occupying po-sition 689-702 as shown (Fig. 3 Right). Interestingly, furthertoward the carboxyl terminus of human hepatitis B virus Pol,we also find the segment (position 733-746) with the se-quence Ile-Ile-Gly-Thr-Asp-Asn-Ser-Val-Val-Leu-Ser-Arg-Lys-Tyr (Fig. 3 Right; Fig. lb, sixth and fifth rows from thebottom). The homology between these two segments ofRSVPol and human hepatitis B virus Pol is 35.7%. Not surprising-ly, the two putative active sites within human hepatitis Polare also somewhat homologous with each other, sharing theVal-Val-Leu tripeptide, thus again attesting to the inherentand original internal repetitiousness of human hepatitis B vi-rus Pol coding sequence.

    This work was supported by the Jeff, Fred, and Ritchie SmithResearch Fellowship.1. Ohno, S. (1970) Evolution by Gene Duplication (Springer, Hei-

    delberg).2. Ohno, S. & Epplen, J. (1983) Proc. Natl. Acad. Sci. USA 80,

    3391-3395.3. Ohno, S. (1984) Proc. Natl. Acad. Sci. USA 81, 2421-2425.4. Kinoshita, S., Negoro, S., Murayama, M., Bisaria, V. S.,

    Sawada, S. & Ikada, H. (1977) Eur. J. Biochein. 80, 489-495.5. Kinoshita, S., Terada, T., Taniguchi, T., Takene, Y., Masuda,

    S., Matsunaga, N. & Okada, H. (1981) Eur. J. Biochem. 116,547-551.

    6. Okada, H., Negoro, S., Kumura, H. & Nakamura, S. (1983)Nature (London) 306, 203-206.

    7. Galibert, F., Mandart, E., Fitoussi, F., Tiollais, P. & Charnay,P. (1979) Nature (London) 281, 646-650.

    8. Ono, Y., Onda, H., Sasada, R., Igarashi, K., Sugino, Y. &Nishioka, K. (1983) Nucleic Acids Res. 11, 1747-1771.

    9. Landers, T. A., Greenbert, H. B. & Robinson, W. S. (1977) J.Virol. 23, 368-376.

    10. Hruska, J., Clayton, D., Rubenstein, J. & Robinson, W. (1977)J. Virol. 21, 666-672.

    11. Summer, J. & Mason, W. S. (1982) Cell 29, 403-415.12. Tiollais, P., Charnay, P. & Vyas, G. N. (1981) Science 213,

    406-411.13. Varmus, H. E. (1982) Nature (London) 299, 204-205.14. Toh, H., Hayashida, H. & Miyata, T. (1983) Nature (London)

    305, 827-829.15. Valenzuela, P., Gray, P., Quiroga, M., Zaldivar, J., Goodman,

    H. M. & Rutter, W. (1979) Nature (London) 280, 815-819.16. Chiu, I.-M., Callahan, R., Tronick, S. R., Schlom, J. & Aaron-

    son, S. A. (1984) Science 223, 364-370.

    Evolution: Ohno

    Dow

    nloa

    ded

    by g

    uest

    on

    June

    29,

    202

    1