8
The statistics view allows you to review word usage in the entire corpus ( Full corpus statistics ), in individual authors or works ( Author statistics ) or view the usage of a particular lemma ( Lemma statistics ). STATISTICS

STATISTICS - University of California, Irvinestephanus.tlg.uci.edu/helppdf/stats.pdf · The statistics view allows you to review word usage in the entire corpus (Full corpus statistics),

  • Upload
    doandat

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

  • The statistics view allows you to review word usage in the entire corpus (Full corpus statistics) , in individual authors or works (Author statistics) or view the usage of a particular lemma (Lemma statistics).

    STATISTICS

  • This page inc ludes a summary o f the data inc luded in the TLG cor pus. In for mat ion i s presented in the for ms o f d iag rams and expandable menus.

    In for mat ion inc luded under the expandable menus i s l inked to Text search or the TLG Canon.

    FULL CORPUS STATISTICS

  • This page prov ides s tat i s t i ca l in for mat ion about a spec ific author or work .

    The in for mat ion i s g iven through d iag rams or expandable menus ( l e f t co lumn) .

    Each word in each l i s t i s l inked to Text Search.

    Two icons at the bot tom of each d iag ram a l low you to en large the d iag ram or obta in a l i s t o f the most f requent lemmata in the se lec ted author or work .

    AUTHOR STATISTICS

    Statistical information for Homers Iliad

    Click to go to text searchClick to enlarge Obtain list of 100 most

    frequent lemmata

  • Exclusions

    Unique occurences vs.

    hapaxes

    STATISTICS

    Important notes

    There are certain lemmata which we ignore when looking at the most frequent lemmata in a subcorpus. Stop-words are ignored in N-grams and Statist ics pages.

    Milesian numerals are also ignored in l is ts of statist ics.

    Unique occurrences vs. hapaxes

    It is easy for the TLG search engine to identify lemmata that are unique to a specific author in the corpus. The statist ics display l ists of those lemmata as unique occurrences.

    Unique occurrences of lemmata are known as hapaxes. We do not use the word hapax, because our corpus is not the same as the classical corpus against which hapax was defined. A word is a hapax in Homer i f i t was not used elsewhere in the classical l i terary canon. But the TLG is not l imited to the classical l i terary canon: i t also includes Homeric dictionaries, Homeric scholia, grammarians, and l i terary theorists, al l of whom discussed the words of Homer at length. So while there are numerous Homeric hapaxes by the classical definition, only 14 lemmata in Homer are restricted to Homer in the TLG corpus.

  • O ver-represented and Under-represented L e m m ata

    T h e word o c cu r s more f requen t l y than any o the r con t en t word in P lu ta rch . T h i s doe s n o t t e l l u s much s in c e i s an ex t reme ly f requen t word in any au th or. To go on t o s ay th a t o c cu r s 100 t ime s i n P lu ta rch , and j u s t 7 t ime s i n P indar, i s no t a mean ing fu l compar i s on e i the r, g i ven tha t we have ove r a m i l l i o n wo rd s o f P lu ta rch i n th e T LG, a n d j u s t 2 8 th o u s a n d wo rd s o f P i n d a r.

    T h e m ore in t e re s t i n g que s t i on t o a s k abou t f requen t l em m ata i n an au thor i s , wh i ch a re th e l em m ata tha t the au thor u s e s much more than expec t ed . T hat he lp s u s i den t i f y wh i ch l em m ata a re cha rac t e r i s t i c o f t he au thor. To work ou t wh i ch l em m ata an au thor u s e s more than expec t ed , we l ook a t the f requency o f l em m ata i n the co r pu s ove ra l l . We ex t rapo l a t e how f requen t we expec t th e l em m a to be i n our g i ven au th or, i f h e u s ed i t a s f r equen t l y a s t h e ave rage ac ro s s t he co r pu s . I f t he au thor u s e s t he l emma s i gn ifican t l y more than th e ave rage, we c an t e l l t h a t i t i s ch a rac t e r i s t i c o f t h e au th o r. Fo r ex am p le : t h e l em m a o c cu r s 374 ,000 t ime s i n the TLG cor pu s , wh i ch co r re s ponds t o 3 . 82 t i m e s p e r 1 0 0 0 wo rd s . T h ere a re j u s t ove r a m i l l i o n wo rd s i n P lu ta rch ; s o i f P lu t a rch u s ed th e wo rd w i th th e s am e f req u en c y a s th e TLG ave rage, we wo u ld expec t 3916 in s t an c e s o f t h e word . In s t ead , P lu t a rch u s e s th e word 6578 t im e s ; s o th e wo rd i s ove r re p re s en t ed i n P lu ta rch . By go i n g th ro u gh a l l t h e l em m ata u s ed by P l u ta rch , we c a n i d en t i f y th e l em m ata mos t ove r re p re s en t ed in P lu ta rch in a s en s e, t h e l em m ata tha t a re mos t d i s t i n c t i ve t o P lu ta rch .

    T h e TLG i s qu i t e un even a s a c o r pu s , w i th a w ide va r i e t y o f t ex t s , an d u s age o f wo rd s ch an ged s i gn i fi c an t l y a c ro s s t h e c en tu r i e s . So r a th e r th an a s k wh e th e r a l em m a i s ove r re p re s en t ed in P lu t a rch , c om pared t o th e ave rage ac ro s s t h e en t i re TLG, a more s en s ib l e que s t i on t o a s k i s , wh e th e r i t i s ove r re p re s en t ed compared t o P lu ta rch s c o n t em p o ra r i e s . To d o th i s we l o o k a t h ow f req u en t th e l em m a i s a c ro s s a l l 1 AD au th or s , i n s t ead o f h ow f requen t i t i s a c ro s s a l l o f t h e TLG. For our examp le o f , we find tha t i t o c cur s 17 ,570 t ime s among 1 s t c . AD, wh o accoun t fo r 3 . 4 m i l l i on word s in o th e r word s , i t o c cur s 5 . 26 t ime s pe r 1000 wo rd s . I f P lu t a rch u s ed th e wo rd w i th th e s am e f req u en c y a s h i s ave rage c o n t em p o ra r y, we wo u ld ex p ec t 5 3 7 4 i n s t a n c e s o f t h e wo rd , i n s t ea d o f t h e 6 5 7 8 t i m e s h e ac tu a l l y u s e s . So th e wo rd i s s t i l l ove r re p re s en t ed i n P lu ta rch .

    STATISTICS

    OVER - VS .

    UNDER -REP RES ENTAT ION

  • This page prov ides s tat i s t i ca l in for mat ion about a par t icu lar word.

    The in for mat ion i s g iven through d iag rams or expandable menus ( l e f t co lumn) .

    Click on Geographic Dis t r ibut ion to v iew th i s in for mat ion on the map.

    LEMMA STATISTICS

    Links to the TLG Canon

  • Click on Geographic Dis t r ibut ion to v iew in for mat ion about a spec ific lemma on the map.

    LEMMA STATISTICS

  • Over-represented and under-represented lemmata per century

    Relat ive f requencies of lemma use, rather than raw counts, are part icularly important when we are comparing usage across d i f ferent centur ies. Through accidents of scr ibal preference and cul tural preservat ion, the quant i ty of preserved Greek l i terature var ies widely by century. The cor pus for A.D. 4 i s around 18 mi l l ion words ; the cor pus for 8 B.C. i s 0 .2 mi l l ion words, which i s 90 t imes smal ler. So i t i s meaning less to say that there are more ins tances of in 4 A.D. than B.C. 8 (17,547 vs. 402) ; wi th 90 t imes more words, we would expect any word to occur more of ten in iv A.D. than v i i i B.C.

    One way of looking at th i s i s by working out what the expected count of the word would be per century, i f the word were dis tr ibuted evenly across the TLG. occurs 111,604 t imes in the 98 mi l l ion words of the TLG. Extrapolat ing from that , i t should occur 20,230 t imes in iv A.D. , and 274 t imes in v i i i B.C. There are over 2 ,300 ins tances of les s than we would expect in iv A.D. ; so we can say that i s underrepresented by 2300 ins tances for that century. There are 127 ins tances of more than we would expect in v i i i B.C. ; so we can say that i s overrepresented by 127 ins tances for that century.

    Over- and under- representation

    STATISTICS