36
1 SUBTLEX-UK: A new and improved word frequency database for Britis En!"is Walter J. B. van Heuven 1  Pawel Mandera 2  Emmanuel Keuleers 2  Marc Brysbaert 2,3  1  University ! "ttin#$am, UK 2  %$ent University, Bel#ium 3  &wansea University Keywrds' Wrd !re(uency, visual wrd rec#nitin, )i*! scale +unnin# $ead' &UB-E/UK 0ddress' r. Walter van Heuven &c$l ! Psyc$l#y University ! "ttin#$am University Par "ttin#$am, "% 2+ P$ne' 455 116 7588383 9a:' 455 116 ;616325 Email' walter.van$euven<nttin#$am.ac.u

Word Frequency for British English

  • Upload
    ireneo

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 1/36

1

SUBTLEX-UK:

A new and improved word frequency database for Britis En!"is

Walter J. B. van Heuven1  Pawel Mandera

2  Emmanuel Keuleers

2  Marc Brysbaert

2,3 

1

 University ! "ttin#$am, UK

2 %$ent University, Bel#ium

3 &wansea University

Keywrds' Wrd !re(uency, visual wrd rec#nitin, )i*! scale

+unnin# $ead' &UB-E/UK

0ddress' r. Walter van Heuven

&c$l ! Psyc$l#y

University ! "ttin#$am

University Par

"ttin#$am, "% 2+

P$ne' 455 116 7588383

9a:' 455 116 ;616325

Email' walter.van$euven<nttin#$am.ac.u

Page 2: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 2/36

2

Abstract

We *resent wrd !re(uencies based n subtitles ! Britis$ televisin *r#rams. We s$w

t$at t$e &UB-E/UK wrd !re(uencies e:*lain mre ! t$e variance in t$e le:ical decisin

times ! t$e Britis$ -e:icn Pr=ect t$an t$e wrd !re(uencies based n t$e Britis$ "atinal

>r*us and t$e &UB-E/U& !re(uencies. ?n additin t t$e wrd !rm !re(uencies, we als

*resent measures ! cnte:tual diversity, *art/!/s*eec$ s*eci!ic wrd !re(uencies, wrd

!re(uencies in c$ildren *r#rams, and wrd bi#ram !re(uencies, #ivin# researc$ers ! Britis$

En#lis$ access t t$e !ull ran#e ! nrms recently made available !r t$er lan#ua#es. 9inally,

we intrduce a new measure ! wrd !re(uency, t$e )i*! scale, w$ic$ we $*e will st* t$e

current misunderstandin#s ! t$e wrd !re(uency e!!ect.

Page 3: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 3/36

3

SUBTLEX-UK:

A new and improved word frequency database for Britis En!"is

Wrd !re(uency ar#uably is t$e mst im*rtant variable in wrd rec#nitin researc$

@Brysbaert, Buc$meier, >nrad, Jacbs, BAlte, BA$l, 2C11aD. Wrds t$at are !ten

encuntered are *rcessed !aster t$an wrds t$at are rarely encuntered. 9i#ure 1 s$ws

t$e curse ! t$e wrd !re(uency e!!ect. ?t includes mean standardised reactin times @/

valuesD !r sam*les ! 1CCC wrds #in# !rm an avera#e !re(uency ! .C8 *er millin wrds

@a l#1C value ! /1.2D t an avera#e !re(uency ! nearly 1,CCC *er millin wrds @a l#1C

value ! nearly 3.CD. $e reactin times cme !rm t$e En#lis$ -e:icn Pr=ect @E-PF red

circlesF Balta, Ga*, >rtese, Hutc$isn, Kessler, -!tis, "eely, "elsn, &im*sn, reiman,

2CCD and t$e Britis$ -e:icn Pr=ect @B-PF blue circlesF Keuleers, -acey, +astle, Brysbaert,

2C12D, w$ic$ cntain le:ical decisin times t ver 5C t$usand wrds ! 0merican En#lis$

@E-PD r ver 27 t$usand mnsyllabic and disyllabic wrds ! Britis$ En#lis$ @B-PD. $e

wrd !re(uencies cme !rm t$e Britis$ "atinal >r*us @B">F available at

$tt*'www.il#arri!!.c.ubnc/readme.$tmlF c$eced n May 13, 2C13D, a 1CC millin wrd

cllectin ! sam*les ! mstly written and sme s*en lan#ua#e !rm a wide ran#e !

surces, cllected between 1;;1 and 1;;5 and desi#ned t re*resent a wide crss/sectin !

Britis$ En#lis$ at t$at time. 0nt$er database ! wrd !re(uency nrms !ten used !r

Britis$ En#lis$ is t$e >E-E le:ical database @Baayen, Pie*enbrc, %uliers, 1;;6D, based

n a cr*us ! 1.; millin wrds assembled aln# t$e same criteria as t$e B">.

/ / / / / / / / / / / / / / / / /

?nsert 9i#ure 1 abut $ere

/ / / / / / / / / / / / / / / / /  

Page 4: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 4/36

5

+esearc$ in 0merican En#lis$ and t$er lan#ua#es $as su##ested t$at wrd !re(uencies

based n !ilm and televisin subtitles are better *redictrs ! wrd *rcessin# times t$an

wrd !re(uencies based n bs and t$er written surces @Brysbaert et al., 2C11aF

Brysbaert, Keuleers, "ew, 2C11bF Brysbaert "ew, 2CC;F >ai Brysbaert, 2C1CF >uets,

%le/"sti, Barbn, Brysbaert, 2C11F imitr*ulu, uIabeitia, 0vils, >rral,

>arreiras, 2C1CF 9errand, "ew, Brysbaert, Keuleers, Bnin, Met, 0u#ustinva, Pallier,

2C1CF Keuleers, Brysbaert, "ew, 2C1CF "ew, 9errand, ernis, Pallier, 2CCD. $is is an

im*rtant !indin#, because t$e mre variance can be e:*lained by wrd !re(uency t$e !ewer

t$er variables are needed t accunt !r wrd *rcessin# times. Brysbaert and >rtese

@2C11D, !r e:am*le, !und t$at wrd !amiliarity did nt e:*lain muc$ e:tra variance in

le:ical decisin times t mnsyllabic En#lis$ wrds w$en t$e &UB-E/U& subtitle

!re(uency measure was used @Brysbaert "ew, 2CC;D instead ! a cmmnly used,

utdated !re(uency measure based n a small cr*us ! written surces @KuLera 9rancis,

1;8D. 

0lt$u#$ wrd !re(uency estimates based n 0merican subtitles can be used @and $ave

been usedD in Britis$ wrd rec#nitin researc$, sme *recisin is lst, because sme wrds

$ave a di!!erent s*ellin# @e.#., labr vs. laburD r a di!!erent meanin# @e.#., biscuits, *antsD

in t$e tw lan#ua#es. $e diver#ences between 0merican and Britis$ wrd usa#e im*ly t$at

Britis$ researc$ers s$uld limit t$eir researc$ t t$e wrds !ully s$ared amn# t$e lan#ua#es

i! t$ey use 0merican subtitle !re(uencies. Else, t$eir !indin#s ris verestimatin# t$e im*act

! nn/!re(uency variables, suc$ as a#e/!/ac(uisitin, wrd !amiliarity, wrd len#t$, r

similarity t t$er wrds. &ub*timal !re(uency estimates als increase t$e ris ! stimulus

Page 5: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 5/36

6

selectin errrs. $is will be t$e case w$en wrds must be selected n t$e basis ! !re(uency

in!rmatin @e.#., wrds $avin# di!!erent numbers ! clsely resemblin# wrds, s/called

rt$#ra*$ic nei#$burs, wit$ $i#$er !re(uenciesD r w$en wrds ! di!!erent cnditins

must be matc$ed n !re(uency @e.#., $i#$ly emtinal wrds vs. neutral wrdsD.

address t$e limitatins t$at researc$ers wrin# wit$ Britis$ En#lis$ are cn!rnted wit$,

we decided t cllect subtitle/based UK wrd !re(uency nrms. ?n additin, because we

were able t directly ca*ture t$e subtitles !rm a variety ! televisin *r#rams, !r t$e !irst

time we als cllected subtitle !re(uencies !rm c$annels s*eci!ically aimed at c$ildren.

Belw we describe t$e cllectin ! t$e data, t$e summary statistics calculated, and t$e !irst

validatin studies we ran. 

Met$d

#orpus co""ection$ ?n line wit$ UK re#ulatins, since 2CC7 t$e Britis$ Bradcastin#

>r*ratin @BB>D subtitles all sc$eduled *r#rams n its main c$annels, t $el* t$e $earin#

im*aired.1 $ese subtitles are nt bradcasted t$ru#$ t$e main c$annel, but can be

su*erim*sed n t$e *r#ram by t$se w$ wis$ s @e.#., by usin# elete:tD. $ave t$e

widest *ssible ran#e ! lan#ua#e in*ut, we cllected t$e wrds and wrd *airs ! t$e

subtitles !rm nine c$annels @BB>1/5, BB> "ews, BB> Parliament, BB> H, >Beebies, and

>BB>D bradcasted ver a *erid ! t$ree years @January 2C1C / ecember 2C12D. ! t$ese

c$annels, BB>1 is t$e mst **ular and e:tensive @aimed at all ty*es ! audiencesD. $e

t$er c$annels $ave mre limited $urs. ! !urt$er interest is t$at t$e >Beebies c$annel is

1  n t$e basis ! anecdtal evidence we can add t$at t$ese subtitles are als a**reciated by viewers wit$

En#lis$ as secnd lan#ua#e.

Page 6: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 6/36

8

meant !r *resc$l c$ildren @C N 8 yearsD and t$e >BB> c$annel !r *rimary sc$l c$ildren

@8 / 12 yearsD. $is allwed us t cm*ile !re(uency nrms !r t$ese #ru*s.

"twit$standin# t$e *rvisins relatin# t O!air dealin# *rvided under sectin 2; >*yri#$t

esi#ns Patents 0ct 1;77, t$e !ull te:tual cntent ! t$e relevant subtitles were nt stred

r re*rduced !r t$e *ur*se ! t$is researc$. 0 cunt ! individual wrds and cnsecutive

wrds was undertaen, btainable !rm *ublic transmissins. $e met$d em*lyed des

nt detract !rm r t$erwise undermine t$e value ! t$is evaluative wr.

Te%t c"eanin!$ $e bradcasts were cleaned semi/autmatically !r dubles @*r#ram

re*eatsD and subtitle/related in!rmatin nt bradcasted t t$e viewers. 0ls t$e *arts !

t$e subtitles nt related t t$e cnversatin were eliminated @e.#., t$e wrds QsilenceR r

Qt$underR t describe t$e n#in# sceneF t$ese are usually *resented in u**ercase, a

di!!erent !nt r clur in t$e subtitleD. 0!ter t$e cleanin# we btained a ttal ! 2C1.

millin wrds, cmin# !rm 56,C;; di!!erent bradcasts. $is is lar#er t$an t$e t$er e:istin#

subtitle cr*ra @Brysbaert "ew, 2CC;F >ai Brysbaert, 2C1CF >uets et al., 2C11F

imitr*ulu et al., 2C1CF Keuleers et al., 2C1CD2, and allwed us t calculate mre *recise

Parts/!/&*eec$ de*endent !re(uencies and wrd bi#rams.

Wrd !re(uency measures

2 Brysbaert and "ew @2CC;D re*rted t$at t$e wrd ty*e !re(uencies t$emselves s$w little di!!erence nce t$e

cr*us cntains 3C millin wrds, a !indin# t$at was re*licated in t$e *resent analyses.

Page 7: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 7/36

&ord frequency counts$ 0 !irst decisin t be made was w$at t d wit$ $y*$enated wrds.

?n Britis$ En#lis$ wrds are !ten $y*$enated w$en t$ey !unctin as ad=ectives. &, a *tin

t$at saves lives can be described as Qa li!e/savin# *tinR. $is *$rase culd be cunted as

cnsistin# ! t$ree wrd ty*es @a, li!e/savin#, *tinD r !ur wrd ty*es @a, li!e, savin#,

*tinD. $e *rblem was *articularly relevant !r t$e BB> subtitles, because nearly ne ut

! !ur wrd ty*es cntained a $y*$en in t$e !irst analysis ! t$e data. $e vast ma=rity !

t$ese $y*$enated entries were ! lw !re(uency @less t$an 1CC bservatins n a ttal !

2CC millin wrdsD. Because t$ere are n a *riri cnsideratins abut $w t $andle t$is

!indin# @als because t$ere is (uite sme individual variability in t$e use ! $y*$ensF

Ku*erman Bertram, 2C13D, we decided t use a *ra#matic criterin and led at w$ic$

wrd !re(uencies crrelated mst wit$ t$e 27 t$usand le:ical decisin times ! t$e B-P

@Keuleers et al., 2C12D. 0s t$is clearly !avured t$e de$y*$enated wrd !re(uencies @a

di!!erence in variance e:*lained ! 6SD, we decided t de$y*$enate t$e data be!re

cuntin# t$e wrds.3 

$e de$y*$enated subtitles resulted in a ttal ! 332,;7 di!!erent wrd ty*es !r a ttal !

2C1,12,23 tens. ! t$ese, 31,387 ty*es were in t$e >Beebies subtitles wit$ a ttal !

6,78C,26 tens, and C,66 ty*es were in t$e >BB> subtitles wit$ a ttal ! 13,855,186

tens. Because t$e vast ma=rity ! wrds bserved in a sin#le bradcast were ty*s and

t$er nnwrd/lie structures @lie Qaaaarrrr#$R r QRD, we decided t tae ut all

entries bserved in a sin#le bradcast nly. $is reduced t$e number ! ty*es t 16;,236

3

 e$y*$enatin als ccurs in autmatic te:t *arsers, suc$ as >-0W& and t$e &tan!rd *arser @t be describedlaterD. Because t$e &tan!rd *arser de$y*$enates mre wrds t$an >-0W&, t$e utcme ! t$is *arser

ut*er!rmed t$at ! >-0W& n t$e raw cr*us, but n ln#er n t$e de$y*$enated cr*us.

Page 8: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 8/36

7

wit$ a ttal ten cunt ! 2C1,336,837 !r t$e cm*lete cr*us, 6,757,C73 !r t$e >Beebies

subcr*us @2,238 ty*esD, and 13,812,27 !r t$e >BB> subcr*us @67,8;1 ty*esD.

A standardised frequency measure: Te 'ipf sca"e$ 0lt$u#$ t$e !re(uency cunts are t$e

mst versatile measure @as will becme clear later, w$en we calculate all ty*es ! derived

measuresD, t$ey $ave ne bi# disadvanta#e. $e inter*retatin ! t$e !re(uency measure

de*ends n t$e sie ! t$e cr*us. $ere!re, aut$rs $ave led !r a standardised

!re(uency measure, an inde: wit$ t$e same inter*retatin acrss all cr*ra cllected.

$us !ar, t$e mst **ular standardised !re(uency measure $as been !re(uency *er millin

wrds @!*mwD. ?t is t$e !re(uency measure we made available in ur *revius wr n

subtitle !re(uencies as well. Hwever, we increasin#ly nticed t$at t$is measure leads t an

incrrect understandin# ! t$e wrd !re(uency e!!ect.

Because t$eir cr*us cntained nly 1 millin wrds, t$e lwest value in t$e wrd

!re(uencies made available by Kucera 9rancis @1;8D was 1 !*mw. $is cntributed t t$e

assum*tin t$at 1 !*mw is t$e lwest *ssible !re(uency. bviusly, t$is is n ln#er t$e

case !r lar#er cr*ra. 0s it $a**ens, abut 7CS ! t$e wrd ty*es in &UB-E/UK $ave a

!re(uency ! less t$an 1 !*mw @i.e., less t$an 2CC ccurrences in all bradcastsD. &ecnd, as

s$wn in 9i#ure 1, nearly $al! ! t$e wrd !re(uency e!!ect is situated belw 1 !*mw and

t$ere is very little di!!erence abve 1C !*mw. $e !re(uency e!!ect ! le:ical decisin times

between .1 !*mw and 1 !*mw is e(ual t r lar#er t$an t$e e!!ect between 1 !*mw and 1C

!*mw. 0 l#arit$mic trans!rmatin ! !re(uency measures, as is rutinely *er!rmed,

alleviates t$is *rblem. Hwever, t$e l#arit$ms ! !*mw becme ne#ative !r !re(uencies

Page 9: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 9/36

;

lwer t$an 1 @as a#ain s$wn in 9i#ure 1D, w$ic$ unin!rmed users tend t avid. Because !

t$ese *r*erties, !*mw as a standardied measure *uts users n t$e wrn# !t.

mae t$e wrd !re(uency e!!ect easier t understand, ne needs a scale wit$ t$e

!llwin# *r*erties'

1.  ?t s$uld be a l#arit$mic scale @e.#., lie t$e decibel scale ! sund ludnessD.

2.  ?t s$uld $ave relatively !ew *ints, wit$ut ne#ative values @e.#., lie a ty*ical

-iert ratin# scale, !rm 1 t D.

3.  $e middle ! t$e scale s$uld se*arate t$e lw/!re(uency wrds !rm t$e $i#$/

!re(uency wrds.

5.  $e scale s$uld $ave a strai#$t!rward unit.

nce we nw w$at t$e scale s$uld l lie, it is nt s di!!icult t cme u* wit$ a #d

trans!rmatin. ?n *articular, w$en we tae t$e l#1C ! t$e !re(uency *er billin wrds

@rat$er t$an !*mwD, t$e scale !ul!ils t$e !irst t$ree re(uirements. meet t$e last

re(uirement, we *r*se t call t$e new scale t$e Zipf scale, a!ter t$e 0merican lin#uist

%er#e Kin#sley )i*! @1;C2N1;6CD w$ !irst t$ru#$ly analysed t$e re#ularities ! wrd

!re(uency distributin and !rmulated a law @)i*!, 1;5;D w$ic$ was later named a!ter $im.

$e unit t$en becmes t$e )i*!.

$e )i*! scale is a l#arit$mic scale, lie t$e decibel scale ! sund intensity, and ru#$ly

#es !rm 1 @very lw !re(uency wrdsD t 8 @very $i#$ !re(uency cntent wrdsD r @a !ew

!unctin wrds, *rnuns, and verb !rms lie Q$aveRD. $e calculatin ! )i*! values is easy

as it e(uals l#1C@!re(uency *er billin wrdsD r l#1C@!re(uency *er millin wrdsD 4 3. &,

Page 10: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 10/36

1C

a )i*! value ! 1 crres*nds t wrds wit$ !re(uencies ! 1 *er 1CC millin wrds, a )i*!

value ! 2 crres*nds t wrds wit$ !re(uencies ! 1 *er 1C millin wrds, a )i*!/value ! 3

crres*nds t wrds wit$ !re(uencies ! 1 *er millin wrds, and s n.

able 1 summarises t$e in!rmatin. ?t als $el*s t clear ne mre misunderstandin# abut

wrd !re(uencies amn# *syc$lin#uists, namely t$at wrds wit$ !re(uencies belw 1 !*mw

are t uncmmn t be nwn. $ere are $undreds ! derived and in!lected wrd !rms

and even lemmas wit$ !re(uencies ! lwer t$an .1 !*mw t$at are *er!ectly nwn, as can

be seen in able 1. >ntent wrds rarely $ave a )i*! value $i#$er t$an 8, s t$at !r mst

*ractical researc$ *ur*ses, t$e )i*!/scale will be a scale !rm 1 t 8 wit$ t$e ti**in# *int

!rm lw/!re(uency t $i#$/!re(uency between 3 and 5.

/ / / / / / / / / / / / / / / / /

?nsert able 1 abut $ere

/ / / / / / / / / / / / / / / / /

ne mre additin t$at is ! interest !r t$e )i*! scale is t$e *ssibility t include wrds wit$

!re(uency cunts ! C @i.e., wrds nt bserved in t$e cr*usD. 0lt$u#$ t$ese wrds are less

cmmn in lar#e cr*ra, t$ey are by n means absent. &uc$ wrds *se a *rblem !r t$e

)i*! scale as a result ! t$e l#arit$mic trans!rmatin @#iven t$at t$e l#arit$m ! C is minus

in!inityD. ?n a recent review ie*endaele and Brysbaert @2C13D cncluded t$at t$e best way

t deal wit$ C wrd !re(uencies is t$e -a*lace trans!rmatin. +at$er t$an wrin# wit$ t$e

raw !re(uency cunts, ne wrs wit$ t$e !re(uency cunts 4 1. $is means t$at all

!re(uency values are @sli#$tlyD elevated. $e *r*er a**licatin ! t$e al#rit$m als im*lies

t$at t$e t$eretical sie ! t$e cr*us is a little lar#er t$an t$e actual sie, because ne is

Page 11: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 11/36

11

leavin# rm !r " unbserved wrd ty*es wit$ !re(uency 1. " is t$e number ! wrd ty*es

in t$e !re(uency list. &, !r t$e !ull cr*us t$e -a*lace trans!rmatin assumes t$at t$ere

are 16;,236 unbserved wrd ty*es e:tra in t$e lan#ua#e, all wit$ a !re(uency ! 1.

?n *ractice, t$e !llwin# e(uatin is needed t calculate t$e )i*! values n t$e basis ! t$e

!re(uency cunts ! t$e ttal cr*us'

0.3159.336.201

1_10log   +

 

  

 

+

+=

count  frequency Zipf   

$e values in t$e denminatr are t$e sie ! t$e cr*us in millins *lus t$e number ! wrd

ty*es in millins. &*eci!ically, t$e )i*! value ! an unbserved wrd ty*e will be'

696.0.3159.336.201

1010log   =+

 

  

 

+

+= Zipf   

$e )i*! value ! a wrd ty*e bserved nce in t$e cm*lete cr*us will be .;;F t$at ! a

wrd bserved 1C times will be 1.3, and s n.

calculate t$e )i*! values !r t$e >Beebies cr*us, we $ave t use t$e !llwin# e(uatin'

0.3027.848.5

1_10log   +

 

  

 

+

+=

  CBeebiescount  frequency Zipf   

9r t$e >BB> subcr*us t$e e(uatin is

0.3059.612.13

1_10log   +

 

  

 

+

+=

  CBBC count  frequency Zipf   

Page 12: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 12/36

12

&*eci!ically, t$is means t$at wrds wit$ a C !re(uency in t$e >Beebies cr*us #et a )i*! value

! 2.231F t$se wit$ a C !re(uency in t$e >BB> cr*us #et a )i*! value ! 1.785. $e $i#$er

values !r unbserved wrd ty*es are due t t$e smaller sies ! t$e cr*ra and als mean

t$at ne s$uld be sensible in t$eir use. $ere is n *int in blindly usin# t$ese values !r all

missin# wrds in t$e lists, as ne assumes t$at t$e missin# wrds are nwn t *resc$lers

@>BeebiesD r *rimary sc$l c$ildren @>BB>D. 0s we will see belw, t$is may be ne reasn

w$y t$e c$ild$d !re(uencies are nt crrelatin# very well wit$ t$e le:ical decisin times !

t$e Britis$ -e:icn Pr=ect w$en calculated acrss all wrds.

#ive readers a better !eelin# !r t$e )i*! scale, able 2 tabulates t$e summary statistics !

t$e )i*! values used in tw classic wrd !re(uency studies in Britis$ En#lis$ @Mnsell, yle,

Ha##ard, 1;7;F Mrrisn Ellis, 1;;6D. w interestin# bservatins can be made. 9irst,

t$e standard deviatins ! t$e )i*! values are similar !r t$e $i#$ and t$e lw !re(uency

wrds @as t$ey s$uld beD, w$ereas !r !*mw t$e standard deviatins are cnsiderably lar#er

in t$e cnditins wit$ $i#$ !re(uency wrds t$an in t$e cnditins wit$ lw !re(uency

wrds. &ecnd, we see t$at in bt$ studies t$e lw !re(uency wrds $ad )i*! values abve 3,

because t$e researc$ers derived t$eir !re(uency estimates !rm t$e Kucera and 9rancis list

and cnsidered 1 !*mw as t$e lwer end ! t$e !re(uency ran#e. Wit$ t$e availability !

mre re!ined wrd !re(uency measures, we $*e t$at in t$e !uture mre use will be made

! wrds wit$ )i*! values belw 3. 0s 9i#ure 1 indicates, t$is is a sensible t$in# t d, as in

t$is ran#e t$e wrd !re(uency e!!ect is at its strn#est. 9urt$ermre, abut 7CS ! t$e wrd

ty*es in &UB-E/U& $ave )i*! values belw 3 @i.e., belw 1 !*mwD. &, t$ere is muc$ mre

c$ice at t$e lw end ! t$e distributin t$an at t$e $i#$ end. ?n ur current estimate, lw/

Page 13: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 13/36

13

!re(uency wrds ideally $ave a mean )i*! value at @r belwD 2.6 and $i#$/!re(uency wrds

$ave a mean )i*! value ! 5.6.

/ / / / / / / / / / / / / / / / /

?nsert able 2 abut $ere

/ / / / / / / / / / / / / / / / /

#onte%tua" diversity$ 0delman, Brwn, and Tuesada @2CC8F see als 0delman Brwn,

2CC7F Perea, &ares, >mesana, 2C13F Ga*, an, Pe:man, Har#reaves, 2C11D ar#ued t$at

nt s muc$ t$e !re(uency ! ccurrence ! a wrd matters, but t$e number ! cnte:ts in

w$ic$ t$e wrd a**ears. Wrds nly encuntered in a small number ! cnte:ts @say, a wrd

wit$ a !re(uency ! 1CC ccurrin# in ne r tw televisin e*isdesD will be mre di!!icult t

*rcess t$an e(ually !re(uent wrds encuntered in a variety ! cnte:ts @e.#., a wrd wit$

a !re(uency cunt ! 1CC used in 7C di!!erent bradcastsD. 0 #d *r:y !r cnte:tual

diversity @>D is t$e number ! televisin *r#rams!ilms @r t$e *ercenta#e !

*r#rams!ilmsD in w$ic$ t$e wrd a**ears. Brysbaert and "ew @2CC;D indeed bserved t$at

l#@>D e:*lained u* t 5S ! variance mre in le:ical decisin times t$an l#@!re(uencyD.

Part ! t$e advanta#e was met$dl#ical, $wever. w !actrs were invlved. 9irst, t$e

e!!ect ! l#@>D n +s is mre linear t$an t$e e!!ect ! l#@!re(uencyD, w$ic$ becmes !lat

!r $i#$ !re(uency wrds, as can be seen in 9i#ure 1. W$en nn/linear re#ressin analysis

was used, t$e di!!erence between > and !re(uency became smaller t$an 2S. 0nt$er *art

! t$e di!!erence was due t t$e !act t$at sme wrds ccurred wit$ very $i#$ !re(uency in a

!ew !ilms because t$ey were t$e names ! main c$aracters @e.#., arc$er, bay, brwnD. $e >

statistic is less in!luenced by t$ese instances t$an t$e !re(uency statistic.

Page 14: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 14/36

15

&till, t$e > measure seems t $ave added value. $ere!re, we *rvide t$is in!rmatin !r

t$e di!!erent cr*ra we used @!ull cr*us, >Beebies, >BB>D. $e values are available bt$ as

t$e ttal number ! televisin *r#rams in w$ic$ t$e wrd ccurred, and t$e *ercenta#e !

televisin *r#rams in w$ic$ t$e wrd was encuntered. 0s indicated abve, t$e ttal

number ! bradcasts in t$e cm*lete cr*us was 56,C;;. $e number ! bradcasts in

>Beebies was 5,75F in >BB> it was 5,757.5 

(art-of-Speec )(oS* dependent frequencies$ 9r many *ur*ses it is #d t nw w$at

rles wrds *lay in sentences and t$e relative !re(uencies ! t$ese rles @Brysbaert, "ew,

Keuleers, 2C12D. $is enables researc$ers interested in nuns, !r instance, t limit t$eir

stimulus materials t wrds t$at are always @r mstlyD used as nuns. ?t als allws

researc$ers t nw w$et$er an in!lected wrd is used mre !ten as an ad=ective @e.#.,

a**allin#D r as a verb @e.#., *layedD. $is is im*rtant in!rmatin t decide w$ic$ wrds t

include in ratin# studies @e.#., Ku*erman, &tadt$a#en/%nale, Brysbaert, 2C12D.

P& !re(uencies can nly be btained a!ter t$e cr*us $as been *arsed @i.e., t$e sentences

bren dwn int t$eir cnstituent *artsD and ta##ed @i.e., t$e wrds #iven t$eir crrect

*art/!/s*eec$ in t$e sentenceD. 9r a ln# time t$is was virtually im*ssible #iven t$e

amunt ! wr invlved. Hwever, t$e devel*ment ! autmatic P& ta##ers $as made it

*ssible t #et a reasnably #d @t$u#$ nt *er!ectD utcme in reasnable time and at an

a!!rdable *rice. 9r a ln# time, t$e >-0W& ta##er devel*ed at t$e University ! -ancaster

was t$e #lden standard @available at $tt*'ucrel.lancs.ac.uclaws, c$eced n May 1,

2C13D. ?t was used !r t$e B"> cr*us and we als used it !r ur &UB-E/U& cr*us

5 $e reasn w$y t$ese numbers are very similar is t$at bt$ c$annels $ave a similar rtatin ! *r#rams wit$

re*eats a!ter a rat$er s$rt *erid ! time.

Page 15: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 15/36

16

@Brysbaert et al., 2C12D. Hwever, in recent years t$e &tan!rd ta##er @initial versin'

utanva, Klein, Mannin#, &in#er, 2CC3F latest u*date available at $tt*'www/

nl*.stan!rd.edus!twarele:/*arser.s$tml, c$eced n May 1, 2C13D $as becme a

wrt$y cm*etitr. 0s it $a**ens, t$e utcme ! t$e !irst analyses wit$ t$e &tan!rd ta##er

crrelated mre wit$ t$e B-P wrd *rcessin# times t$an t$e utcme ! t$e >-0W& ta##er.

0s indicated in !tnte 2, t$is was due t t$e !act t$at t$e &tan!rd ta##er is mre

cnsistent in de$y*$enatin# wrds t$an >-0W&. W$en t$e subtitles were cleared !

$y*$ens be!re runnin# t$e ta##ers, bt$ #ave cm*arable ut*ut.

0nt$er advanta#e ! t$e &tan!rd s!tware6 is t$at it #ives t$e mst liely lemma

assciated wit$ an in!lected !rm. $e lemmatisatin is based n an al#rit$m devel*ed by

Minnen, >arrll, and Pearce @2CC1D. ?t wrs n tw main *rinci*les. 9irst, it ls u*

w$et$er a wrd !rm is *resent in t$e dictinary. ?! s, t$en t$e assciated lemma can be

read ut. ?! a wrd is lacin#, t$e mst liely lemma is allcated n t$e basis ! rules and

*attern cm*arisns @e.#., t$e mst liely lemma ! t$e stimulus QmartialisatinsR, identi!ied

as a nun, is QmartialisatinRF and t$e mst liely lemma ! t$e stimulus QMartialisR,

identi!ied as a name, is QMartialisRD. 0s discussed at #reater len#t$ in Brysbaert et al. @2C12D,

t$e utcme ! t$ese al#rit$ms is nt 1CCS crrect8 and, $ence, s$uld always be c$eced

by t$e user, certainly !r lw !re(uency wrds. Hwever, t$ey are a bi# ste* !rward @wit$

accuracy estimates ! ;S and $i#$erD and, t$ere!re, are *rvided in ur database. Mre

*recisely, we #ive in!rmatin abut t$e mst !re(uent P& assciated wit$ eac$ wrd ty*e,

6 0 disadvanta#e ! t$e &tan!rd ta##er is t$at in its de!ault mde it 0mericanies t$e s*ellin#s ! t$e wrds. &,

ne must be care!ul t c$an#e t$is w$en ne is wrin# wit$ Britis$ s*ellin#s.8 0 ntrius e:am*le is Q$rse!lyR, w$ic$ bt$ >-0W& and &tan!rd *arse as an adverb @ar#uably because t$e

wrd is nt in t$e *r#rams le:icn, s t$at t muc$ reliance is *ut n t$e end letters NlyD. ?rnically, &tan!rddes crrectly classi!y Q$rse!liesR as a nun assciated wit$ t$e lemma Q$rse!lyR @*resumably because t$e

end letters Nlies are mre liely t be assciated wit$ *lural nuns t$an wit$ t$er *arts/!/s*eec$D.

Page 16: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 16/36

18

t$e !re(uency ! t$is P& and t$e lemma assciated wit$ it, ne:t t all t$e *arts/!/s*eec$

assciated wit$ t$e wrd ty*e and t$eir res*ective !re(uencies. Because ! t$e

lemmatisatin and because t$e ut*ut was as #d as t$at ! >-0W&, t$e data *resented in

t$e &UB-E/UK database are based n t$e &tan!rd *arser and ta##er. 9i#ure 2 #ives an

e:am*le ! t$e ut*ut. 0ll !re(uencies are #iven as raw !re(uency cunts based n t$e

entire cr*us, because t$is value is t$e mst in!rmative t calculate derived statistics !rm

@e.#., t$e *ercenta#e use as t$e dminant P&D.

/ / / / / / / / / / / / / / / / /

?nsert 9i#ure 2 abut $ere

/ / / / / / / / / / / / / / / / /

Bi!ram frequencies$ Because e:tra in!rmatin can be btained !rm wrd cmbinatins

@0rnn &nyder, 2C1CF Baayen, Milin, 9ili*vic urdevic, Hendri:, Marelli, 2C11F &iyanva/

>$anturia, >nlin, van Heuven, 2C11D, we als cllected wrd bi#ram !re(uencies in t$e

entire cr*us @i.e., t$e !re(uency wit$ w$ic$ wrd *airs were bservedD. $is resulted in

ver 1.6 millin lines ! cnsecutive wrd *airs bserved in t$e cr*us. 9r eac$ *air we #ive

in!rmatin abut t$e number ! times it was bserved, t$e symbls written between t$e

wrds @s*ace, *unctuatin mar, $y*$en, ...D and t$eir res*ective !re(uencies. $is maes it

*ssible !r everyne t calculate interestin# additinal metrics. 9r instance, it allwed us

t add t$e 7 $y*$enated wrds wit$ a !re(uency cunt ! mre t$an 1CC @!*wm .6D t

t$e database. ?t als allwed us t warn researc$ers w$en a cm*und wrd is mre liely

t be written as tw se*arate wrds t$an as a sin#le wrd @!r instance, t$e wrd Qmaeu*R

is bserved 3C7 times in t$e subtitles @)i*! 3.17D, but t$e s*ellin#s Qmae/u*R and Qmae

 $ese !re(uencies were nt subtracted !rm t$e !re(uencies ! t$e individual wrds, under t$e assum*tin

t$at t$e cm*nent wrds ! a $y*$enated wrd #et c/activated u*n seein# t$e $y*$enated wrd.

Page 17: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 17/36

1

u*R $ave a cmbined !re(uency ! 7,;;7, main# Qmaeu*R a bad c$ice !r a lw !re(uency

wrdD.

>rrelatins wit$ le:ical decisin measures

%iven t$e ease wit$ w$ic$ wrd !re(uencies can be cllected nwadays, it is im*rtant t

c$ec w$et$er a new !re(uency measure adds smet$in# e:tra t t$e e:istin# nes. n t$e

basis ! *revius researc$, we can e:*ect t$is t be t$e case #iven t$e su*eririty ! subtitle/

based !re(uency estimates, but still it is #d t test t$is e:*licitly, als t mae sure n

calculatin errrs $ave been made. $e mst interestin# dataset is t$e B-P @Keuleers et al.,

2C12D, w$ic$ *rvides le:ical decisin reactin times and accuracy measures ! Britis$

students !r ver 27 t$usand mnsyllabic and disyllabic wrds. $e main cm*etitrs t

t$e &UB-E/UK wrd !re(uencies are t$e B"> !re(uencies, t$e >E-E !re(uencies, and t$e

&UB-E/U& !re(uencies. Wrds nt bserved in a cr*us were assi#ned a !re(uency ! C

and l# !re(uencies were t$e )i*! values @wit$ -a*lace trans!rmatinD. $e -a*lace

trans!rmatin was als used !r t$e > measure.

able 3 s$ws t$e results !r t$e accuracy data. 0s e:*ected t$e &UB-E/UK !re(uencies

ut*er!rm t$e t$er measures, mre s !r t$e > measure t$an !r t$e )i*! measure.

Because ! t$e lar#e number ! bservatins, t$e di!!erences are all $i#$ly si#ni!icant. 9r

instance, t$e t/value ! t$e Htellin#/Williams test @&tei#er, 1;7CD7 ! t$e di!!erence in

crrelatin wit$ &UB-E/UK @)i*!D and B"> @)i*!D e(uals 18.7 @d! 27,272, * V .CC1D. ?n

terms ! *ercenta#e variance e:*lained, t$e di!!erence is nearly 3S, w$ic$ is $i#$ #iven t$at

7  0n easy intrductin t t$e test and an E:cel !ile t calculate t$e e:act values is available n t$e website

$tt*'crr.u#ent.bearc$ives658.

Page 18: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 18/36

17

many variables e:*lain less t$an 1S ! variance, nce t$e e!!ects ! wrd !re(uency, wrd

len#t$ and similarity t t$er wrds are *artialled ut @Brysbaert >rtese, 2C11F Brysbaert

et al., 2C11aF Ku*erman et al., 2C12D.

/ / / / / / / / / / / / / / / / /

?nsert able 3 abut $ere

/ / / / / / / / / / / / / / / / /

?nterestin#ly, t$e crrelatins wit$ t$e c$ild$d !re(uencies are muc$ lwer, in *articular

t$e crrelatin wit$ t$e >Beebies !re(uencies @*resc$l c$ildrenD. w reasns !r t$is are

t$e smaller sies ! t$e cr*ra @includin# t$e many missin# wrds nt nwn t c$ildren

but #iven rat$er $i#$ )i*! estimatesD and t$e !act t$at t$e verall &UB-E/UK !re(uencies

include t$e subtitles !rm >Beebies and >BB> televisin *r#rams @almst 1CS ! t$e ttal

&UB-E/UKD.

able 5 s$ws t$e crrelatins !r t$e reactin times @+sD t t$e wrds. Because +s are

nly interestin# w$en t$e wrds are nwn, we set *ercenta#e accuracy t 88S @"

2C,66D. ery muc$ t$e same *icture a**ears, wit$ su*erir *er!rmance !r t$e &UB-E/

UK measures @> sli#$tly mre s t$an )i*!D.

/ / / / / / / / / / / / / / / / /

?nsert able 5 abut $ere

/ / / / / / / / / / / / / / / / /

mae sure t$at t$e $i#$er crrelatins between &UB-E/UK and t$e B-P measures t$an

between &UB-E/U& and B-P were due t lan#ua#e cn#ruency and nt t t$e better

Page 19: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 19/36

1;

(uality ! &UB-E/UK verall, we ran similar analyses ! t$e E-P data, w$ic$ were cllected

n 0merican students. 0s can be seen in able 6, t$e di!!erence between &UB-E/UK and

&UB-E/U& indeed $as t d wit$ di!!erences in wrd use between t$e tw lan#ua#es

rat$er t$an wit$ t$e in$erent (ualities ! t$e !re(uency lists. W$ereas t$e &UB-E/UK

!re(uencies are better !r t$e Britis$ B-P data @see ables 3 and 5D, t$e &UB-E/U& data are

better !r t$e 0merican E-P data @able 6D.

/ / / / / / / / / / / / / / / / /

?nsert able 6 abut $ere

/ / / / / / / / / / / / / / / / /

>rrelatins wit$ t$e >$ildrens Printed Wrd atabase @>PWD

$e best e:istin# Britis$ database ! wrd !re(uencies !r c$ildren is t$e >$ildrens Printed

Wrd atabase @>PWF available at $tt*'www.esse:.ac.u*syc$l#yc*wdF c$eced n

May 21, 2C13D. ?t includes t$e !re(uencies wit$ w$ic$ 12,1;3 di!!erent wrd ty*es are

bserved in 1C11 bs @;;6,;2 tensD !r 6/; year ld c$ildren in t$e UK @Mastersn,

&tuart, i:n, -ve=y, 2C1CD. We culd dwnlad data !r ;86; wrd ty*es !rm t$e

database, ;126 ! w$ic$ were als in t$e &UB-E/UK list @t$e nes nt in t$e list were

mainly #enitive !rms, $y*$enated !rms, and numbersD. able 8 #ives t$e crrelatins

between l# >PW !re(uencies and varius &UB-E/UK !re(uencies !r t$e ;126 s$ared

wrd ty*es. 0s can be seen, t$e crrelatins are reasnably $i#$, in *articular wit$ t$e

>Beebies wrd !re(uencies. $e Htellin#/Williams test indicated si#ni!icant di!!erences

between t$e >Beebies !re(uencies and t$e t$er !re(uencies @e.#., di!!erence between

>Beebies and >BB>, t@;122D 16.8, * V .CC1D. $is cn!irms t$at t$e &UB-E/UK c$ildren

Page 20: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 20/36

2C

!re(uencies are an interestin# additin t t$e >PW !re(uencies and can be used t study

!re(uency tra=ectries !rm c$ild$d t adult$d; @-t Bnin, 2C13D.

/ / / / / / / / / / / / / / / / /

?nsert able 8 abut $ere

/ / / / / / / / / / / / / / / / /

iscussin

?n t$is *a*er we *resented a new database ! wrd !re(uencies !r Britis$ En#lis$, based n

televisin subtitles. n t$e basis ! ur *revius researc$, we e:*ected t$at t$ese

!re(uencies wuld better *redict wrd *rcessin# *er!rmance t$an wrd !re(uencies

based n written surces @in *articular, t$e Britis$ "atinal >r*usD. $is indeed turned ut

t be t$e case, w$en we tried t *redict t$e le:ical decisin times and accuracies ! t$e

Britis$ -e:icn Pr=ect @ables 3 and 5D. $e Britis$ subtitle !re(uencies were als better t

*redict t$e B-P data t$an t$e 0merican subtitle !re(uencies, but t$ey were in!erir t

accunt !r t$e E-P data, in line wit$ t$e bservatin t$at wrd usa#e is nt cm*letely t$e

same in Britis$ and 0merican En#lis$. $e e:tra variance accunted !r amunted t 3/6S,

w$ic$ is cnsiderable #iven t$at many variables e:*lain less t$an 1S ! t$e variance nce

t$e e!!ects ! wrd !re(uency, len#t$, and similarity t t$er wrds are *artialed ut

@Brysbaert >rtese, 2C11F Brysbaert et al., 2C11aF Ku*erman et al., 2C12D.

W$ile analysin# t$e !indin#s, we were nce a#ain struc by $w misleadin# t$e standardised

wrd !re(uency measure !*mw @!re(uency *er millin wrdsD is t understand t$e wrd

;  &UB-E/UK !re(uencies nt includin# c$ild$d !re(uencies can easily be btained by subtractin# t$e

>Beebies and >BB> !re(uency cunts !rm t$e ttal !re(uency cunts.

Page 21: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 21/36

21

!re(uency e!!ect. $ere!re, we *r*sed an alternative, t$e )i*! scale, w$ic$ is better

suited t t$e use ! wrd !re(uencies in *syc$l#ical researc$. $is scale #es !rm sli#$tly

less t$an 1 t sli#$tly mre t$an and can easily be inter*reted as !llws' alues ! 3 and

less are lw/!re(uency wrds, values ! 5 r mre are $i#$/!re(uency wrds. Wrds nt in

&UB-E/UK #et a )i*! value ! .8;8 w$en t$e !re(uencies are based n t$e cm*lete

cr*us, 1.785 w$en t$e >BB> !re(uencies are used, and 2.231 w$en t$e >Beebies

!re(uencies are used. $e di!!erences in minimal values are caused by t$e di!!erences in

cr*us sie and a#ree wit$ t$e !act t$at missin# wrds ! interest in >Beebies r >BB> are

liely t be mre !amiliar t$an wrds nt !und in t$e entire cr*us. 

?n additin t t$e wrd !re(uencies, t$e new database !!ers t$er in!rmatin, w$ic$ will

allw Britis$ researc$ers t d cuttin#/ed#e investi#atins. $ese are'

/  Part/!/&*eec$ related !re(uencies, w$ic$ mae it *ssible !r researc$ers t better

cntrl t$eir stimulus materials,

/  0 measure ! cnte:tual diversity @>D, w$ic$ is *articularly interestin# t *redict

w$ic$ wrds will be nwn and w$ic$ nt @cm*are ables 3 and 5D,

/  Wrd !re(uencies in materials aimed at very yun# @*resc$lD and yun# @*rimary

sc$lD c$ildren,

/  ?n!rmatin abut wrd bi#rams. 

0vailability 

Page 22: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 22/36

22

$e &UB-E/UK data are available in t$ree easy t use !iles. $e !irst ne @&UB-E/UKXallD

is a 332,;77 : 16 matri: cntainin# in!rmatin ! all wrd ty*es @includin# numbersD

encuntered in t$e de$y*$enated subtitles. $e 16 clumns #ive in!rmatin abut'

/  $e s*ellin# ! t$e wrd ty*e @&*ellin#D,

/  $e number ! times t$e wrd $as been cunted in all subtitles @9re(D,

/  $e number ! times t$e wrd started wit$ a ca*ital @>a*it9re(D,

/  $e *ercenta#e ! bradcasts cntainin# t$e wrd ty*e in all subtitles @>D,

/  $e number ! bradcasts cntainin# t$e wrd in all subtitles @>>untD,

/  $e mst !re(uent *art/!/s*eec$ ! t$e wrd @mP&D,

/  $e number ! times t$is dminant Ps was bserved @mPs9re(D,

/  $e lemma assciated wit$ t$e dminant Ps @m-emmaPsD,

/  $e number ! times t$is lemma was bserved in all subtitles @m-emmaPs9re(D,

/  $e summed !re(uencies ! all t$e times t$is lemma was bserved irres*ective ! t$e

P& @m-emmaPstal9re(D,

/  0ll *arts/!/s*eec$ taen by t$e wrd ty*e @0llPsD,

/  $e res*ective !re(uencies ! t$ese P& @0llPs9re(D,

/  0nd t$e assciated lemma in!rmatin @0ll-emmaPs, 0ll-emmaPs9re(,

0ll-emmaPstal9re(D. 

$e secnd !ile @&UB-E/UKD cntains mre in!rmatin abut t$e 18C,C22 wrd ty*es

@16;,236 sin#le wrds and 7 $y*$enated wrdsD w$ic$ are bserved in mre t$an ne

bradcast and w$ic$ nly cntain letter in!rmatin @i.e., n di#its r nn/al*$anumerical

symblsD. $is !ile is t$e !ile mst *syc$lin#uistic researc$ers will want t use. ?t $as 2

clumns, cntainin#'

Page 23: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 23/36

23

/  $e wrd ty*e,

/  $e !re(uency cunts in all subtitles, t$e >Beebies subtitles, t$e >BB> subtitles, and

t$e Britis$ "atinal cr*us,

/  $e )i*! values assciated wit$ t$e varius !re(uencies,

/  $e > cunts and *ercenta#es in t$e t$ree &UB-E cr*ra,

/  $e dminant P&, its assciated lemma, and t$eir !re(uencies,

/  0ll t$e P& and !re(uencies ! t$e wrd,

/  $e !re(uency ! t$e wrd startin# wit$ a ca*ital,

/  W$et$er t$e lwercase s*ellin# ! t$e wrd ty*e was acce*ted by a UK wrd s*ell

c$ecer @UKD, a U& wrd s*ell c$ecer @U&D, bt$ s*ell c$ecers @UKU&D, r nne @D1C

.

$is is an interestin# clumn w$en wrds must be selected and ne wants t avid

t$e inclusin ! names r t$er uninterestin# entries.

/  W$et$er t$e entry cntains a $y*$en @c!. t$e 7 added entries wit$ $y*$ensD,

/  W$et$er t$e entry $as ant$er $m*$nic entry. $is is interestin# t !ind

$m*$nes, but als t mae sure selected lw !re(uency wrds d nt $ave a

$i#$er !re(uency s*ellin# alternative.

/  W$et$er r nt t$e wrd ty*e $as been encuntered as a bi#ram in t$e subtitles,

/  $e !re(uency ! t$e bi#ram @summed acrss all ty*es ! intervenin# symbls, in

*articular blan s*aces, *unctuatin mars, and $y*$ensD. 

9inally, t$e t$ird !ile @&UB-E/UKXbi#ramsD cntains in!rmatin abut wrd *airs. Because

t$is !ile $as nearly 2 millin lines ! in!rmatin, it cannt be made available as an E:cel !ile

@alt$u#$ we $ave suc$ a !ile wit$ all entries bserved 12 times r mreD. Eac$ line cntains

1C $e s*eller was t$e M& !!ice 2CC s*ellc$ecer, au#mented wit$ a list ! lemmas ne ! t$e aut$rs @MBD is

cm*ilin#.

Page 24: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 24/36

25

in!rmatin abut wrd 1 and wrd 2, t$e !re(uency ! t$e cmbinatin, t$e > cunt !

t$e cmbinatin, w$ic$ symbls were !und between t$e tw wrds wit$ w$ic$

!re(uencies. $is is im*rtant in!rmatin w$en researc$ers want t include transitin

*rbabilities in t$eir investi#atins, r w$en e:*ressins @e.#., b=ect names, *article verbsD

cnsist ! tw wrds. 

$e !iles are available as su**lementary materials t t$e *resent article. $ey can als be

dwnladed !rm ur websites @$tt*'crr.u#ent.be, r

$tt*'www.*syc$l#y.nttin#$am.ac.usubtle:/uD, w$ere we in additin intend t

mae t$em available as nline cnsultable internet databases. 

Page 25: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 25/36

26

+e!erences

0delman, J. &., Brwn, %. . @2CC7D. Mdelin# le:ical decisin' $e !rm ! !re(uency and

diversity e!!ects. Psychological Review, 115(1), 215/22.

0delman, J. &., Brwn, %. . 0., Tuesada, J. 9. @2CC8D. >nte:tual diversity, nt wrd

!re(uency, determines wrd namin# and le:ical decisin times. Psychological Science,

17 , 715N723.

0rnn, ?., &nider, ". @2C1CD. Mre t$an wrds' 9re(uency e!!ects !r multi/wrd *$rases.

 Journal of Memory an !anguage, "#(1), 8/72.

Baayen, +. H., Milin, P., 9ili*vic urdevic, ., Hendri:, P. and Marelli, M. @2C11D, 0n

amr*$us mdel !r mr*$l#ical *rcessin# in visual cm*re$ensin based n

naive discriminative learnin#. Psychological Review, 11$, 537/572.

Baayen, +. H., Pie*enbrc, +. %uliers. -.@1;;6D. %he &'!' leical a*a+ase Y>/+MZ.

P$iladel*$ia' University ! Pennsylvania, -in#uistic ata >nsrtium.

Balta, .0., Ga*, M.J., >rtese, M.J., Hutc$isn, K.0., Kessler, B., -!tis, B., "eely, J.H.,

"elsn, .-., &im*sn, %.B., reiman, +. @2CCD. $e En#lis$ -e:icn Pr=ect.

ehavior Research Me*hos, -., 556/56;.

Brysbaert, M., Buc$meier, M., >nrad, M., Jacbs, 0.M., BAlte, J., BA$l, 0. @2C11aD. $e

wrd !re(uency e!!ect' 0 review ! recent devel*ments and im*licatins !r t$e

c$ice ! !re(uency estimates in %erman. 'perimen*al Psychology, 5$, 512/525.

Brysbaert, M. >rtese, M.J. @2C11D. t$e e!!ects ! sub=ective !re(uency and a#e !

ac(uisitin survive better wrd !re(uency nrms[ /uar*erly Journal of 'perimen*al

Psychology, "0, 656/66;.

Page 26: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 26/36

28

Brysbaert, M., ie*endaele, K. @2C13D. ealin# wit$ er wrd !re(uencies' 0 review ! t$e

e:istin# rules ! t$umb and a su##estin !r an evidence/based c$ice. ehavior

Research Me*hos, 05, 522/53C.

Brysbaert, M., Keuleers, E., "ew, B. @2C11bD. 0ssessin# t$e use!ulness ! %#le Bs

wrd !re(uencies !r *syc$lin#uistic researc$ n wrd *rcessin#. ron*iers in

Psychology, #2#7 .

Brysbaert, M., "ew, B. @2CC;D. Mvin# beynd Kucera and 9rancis' 0 critical evaluatin !

current wrd !re(uency nrms and t$e intrductin ! a new and im*rved wrd

!re(uency measure !r 0merican En#lis$. ehavior Research Me*hos, 01, ;/;;C.

Brysbaert, M., "ew, B., Keuleers, E. @2C12D. 0ddin# Part/!/&*eec$ in!rmatin t t$e

&UB-E/U& wrd !re(uencies. ehavior Research Me*hos, 00, ;;1/;;.

>ai, T. Brysbaert, M. @2C1CD. &UB-E/>H' >$inese wrd and c$aracter !re(uencies based

n !ilm subtitles. P!3S 34', 5, e17#..

>uets, 9., %le/"sti, M., Barbn, 0., Brysbaert, M. @2C11D. &UB-E/E&P' &*anis$ wrd

!re(uencies based n !ilm subtitles. Psicologica, -#, 133/153.

imitr*ulu, M., uIabeitia, J. 0., 0vils, 0., >rral, J., >arreiras, M. @2C1CD. &ubtitle/

based wrd !re(uencies as t$e best estimate ! readin# be$avir' $e case ! %ree.

ron*iers in psychology, 12#1$, 1/12.

9errand, -., "ew, B., Brysbaert, M., Keuleers, E., Bnin, P., Met, 0., 0u#ustinva, M.,

Pallier, >. @2C1CD. $e 9renc$ -e:icn Pr=ect' -e:ical decisin data !r 37,75C 9renc$

wrds and 37,75C *seudwrds. ehavior Research Me*hos, 0#, 577/5;8.

Keuleers, E., Brysbaert, M., "ew, B. @2C1CD. &UB-E/"-' 0 new !re(uency measure !r

utc$ wrds based n !ilm subtitles. ehavior Research Me*hos, 0#, 853/86C.

Page 27: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 27/36

2

Keuleers, E., -acey, P., +astle, K., Brysbaert, M. @2C12D. $e Britis$ -e:icn Pr=ect' -e:ical

decisin data !r 27,3C mnsyllabic and disyllabic En#lis$ wrds. ehavior

Research Me*hos, 00, 27/3C5.

KuLera, H., 9rancis, W. @1;8D. &ompu*a*ional analysis of presen*6ay merican 'nglish.

Prvidence, +?' Brwn University Press.

Ku*erman, ., Bertram, +. @2C13D. Mvin# s*aces' &*ellin# alternatin in En#lis$ nun/

nun cm*unds. !anguage an &ogni*ive Processes, @a$ead/!/*rintD, 1/27.

Ku*erman, ., &tadt$a#en/%nale, H., Brysbaert, M. @2C12D. 0#e/!/ac(uisitin ratin#s

!r 3C t$usand En#lis$ wrds. ehavior Research Me*hos, 00, ;7/;;C.

-t, B., Bnin, P. @2C13D. es !re(uency tra=ectry in!luence wrd identi!icatin[ 0 crss/

tas cm*arisn. %he /uar*erly Journal of 'perimen*al Psychology, ""(5), ;3/1CCC.

Mastersn, J., &tuart, M., i:n, M., -ve=y, &. @2C1CD. >$ildren\s *rinted wrd database'

>ntinuities and c$an#es ver time in c$ildren\s early readin# vcabulary. ri*ish

 Journal of Psychology, 11(#), 221/252.

Minnen, %., >arrll, J., Pearce, . @2CC1D. 0**lied mr*$l#ical *rcessin# ! En#lis$.

4a*ural !anguage 'ngineering, 7(-), 2C/223.

Mnsell, &., yle, M.>., Ha##ard, P.". @1;7;D. E!!ects ! !re(uency n visual wrd

rec#nitin tass / W$ere are t$ey[ Journal of 'perimen*al Psychology2 8eneral,

11$, 53/1.

Mrrisn, >. M., Ellis, 0. W. @1;;6D. +les ! wrd !re(uency and a#e ! ac(uisitin in wrd

namin# and le:ical decisin. Journal of eperimen*al psychology9 !earning, memory,

an cogni*ion, #1(1), 118/133.

"ew, B., Brysbaert, M., ernis, J., Pallier, >. @2CCD. $e use ! !ilm subtitles t estimate

wrd !re(uencies. pplie Psycholinguis*ics, #$, 881/8.

Page 28: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 28/36

27

Perea, M., &ares, 0. P., >mesaIa, M. @2C13D. >nte:tual diversity is a main determinant

! wrd identi!icatin times in yun# readers. Journal of 'perimen*al &hil

Psychology . @a$ead ! *rint *ublicatinD

&iyanva/>$anturia, 0., >nlin, K., van Heuven, W. J. B. @2C11D. &eein# a P$rase] ime and

0#ain] Matters' $e +le ! P$rasal 9re(uency in t$e Prcessin# ! Multiwrd

&e(uences. Journal of 'perimen*al Psychology6!earning Memory an &ogni*ion,

-7(-), 8/75.

&tei#er, J.H. @1;7CD. ests !r cm*arin# elements ! a crrelatin matri:. Psychological

ulle*in, $7 , 256/261.

utanva, K., Klein, ., Mannin#, >. ., &in#er, G. @2CC3D. 9eature/ric$ *art/!/s*eec$

ta##in# wit$ a cyclic de*endency netwr. ?n Proceeings of *he #- &onference of

*he 4or*h merican &hap*er of *he ssocia*ion for &ompu*a*ional !inguis*ics on

:uman !anguage %echnology6;olume 1 @**. 13/17CD. 0ssciatin !r >m*utatinal

-in#uistics.

Ga*, M. J., an, &. E., Pe:man, P. M., Har#reaves, ?. &. @2C11D. ?s mre always better[ E!!ects

! semantic ric$ness n le:ical decisin, s*eeded *rnunciatin, and semantic

classi!icatin. Psychonomic ulle*in < Review, 1$(0), 52/6C.

)i*!, %.K. @1;5;D. :uman ehavior an *he Principle of !eas* 'ffor* . >ambrid#e,

Massac$usetts' 0ddisn/Wesley

Page 29: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 29/36

2;

9i#ure 1' $e wrd !re(uency e!!ect. Mean standardied le:ical decisin times @/scresD !r

sam*les ! 1CCC wrds as a !unctin ! l#1C wrd !re(uency *er millin wrds. $e red

circles re*resent data !rm t$e En#lis$ -e:icn Pr=ect @Balta et al., 2CCDF t$e blue circles

data !rm t$e Britis$ -e:icn Pr=ect @Keuleers et al., 2C12D. Wrd !re(uencies are based n

t$e 1CC millin wrds Britis$ "atinal >r*us @available at $tt*'www.natcr*.:.ac.uD.

&urce' Keuleers et al., 2C12, 9i#ure 5.

Page 30: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 30/36

3C

9i#ure 2' &creens$t ! t$e P& analysis. 9r eac$ wrd ty*e @in t$e clumn O&*ellin#D, t$e

mst !re(uent P& is determined, t$e assciated lemma, t$e number ! times t$is P& is

bserved in all &UB-E/UK subtitles, t$e ttal !re(uency ! t$e lemma in t$e subtitles, all

*arts/!/s*eec$ assciated wit$ t$e wrd ty*e, and t$e !re(uencies ! t$ese *arts/!/s*eec$

in all subtitles. 9rm t$is !i#ure, we see t$at accrdin# t t$e &tan!rd ta##er t$e wrd ty*e

Q!inaliseR is used mstly @185 timesD as a verb @assciated wit$ t$e lemma Q!inaliseRD, but als

ccasinally @8 timesD as a nun. $e ttal !re(uency ! t$e verb lemma Q!inaliseR @w$ic$ als

includes t$e !re(uencies ! t$e wrd ty*es Q!inalisesR, Q!inalisedR, and Q!inalisin#RD is 588.

Page 31: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 31/36

31

able 1' $e )i*! scale ! wrd !re(uency

$e )i*! scale is a wrd !re(uency scale #in# !rm 1 t . Wrds wit$ )i*! values ! 3 r

lwer are lw/!re(uency wrdsF wrds wit$ )i*! values ! 5 and $i#$er are $i#$/!re(uency

wrds. E:am*les are based n t$e &UB-E/UK wrd !re(uencies.

)i*! value !*mw E:am*les

1 .C1 anti!un#al, bien#ineerin#, !arsi#$ted, $areli*, *r!read

2 .1 airstream, dree*er, necwear, utsied, suns$ade

3 1 beanstal, crnerstne, dum*lin#, insatiable, *er*etratr

5 1C dirt, !antasy, mu!!in, !!ensive, transitin, wides*read

6 1CC basically, bedrm, drive, issues, *erid, s*t, wrse

8 1,CCC day, #reat, t$er, s$uld, smet$in#, wr, years

1C,CCC and, !r, $ave, ?, n, t$e, t$is, t$at, yu

Page 32: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 32/36

32

able 2' 9re(uencies used in tw classical studies ! t$e wrd !re(uency e!!ect, bt$ w$en

e:*ressed as !re(uency *er millin wrds and as )i*! values. Means and standard deviatins

@between bracetsD. 9re(uencies based n &UB-E/UK.

9*mw )i*!

Mnsell et al. @1;7;, E:*eriments 1/2D

-w !re(uency wrds @" 57D 2.12 @2.22D 3.16 @.3;D

Medium !re(uency wrds @" 57D 16.5C @1C.71D 5.C; @.2;D

Hi#$ !re(uency wrds @" 57D 75.86 @82.88D 5.7 @.5CD

Mrrisn Ellis @1;;6D

-w !re(uency wrds @" 25D 8.62 @5.81D 3.88 @.55D

Hi#$ !re(uency wrds @" 25D 188.C3 @187.5D 6.C @.3D

Early ac(uired wrds @" 25D 33.5; @35.7D 5.35 @.55D

-ate ac(uired wrds @" 25D ;.;1 @18.6D 3.83 @.66D

Page 33: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 33/36

33

able 3' >rrelatins between t$e varius !re(uency measures and t$e B-P accuracy data @"

27,276D. $e u**er *art s$ws t$e crrelatins. $e lwer *art s$ws t$e *ercenta#es !

variance accunted !r by nn/linear re#ressin analyses @lm/*rcedure in +, restricted cubic

s*lines wit$ 5 ntsD.

&UB-E/UK &UB-E/UKX> &UB-E/U& B"> >ele: >Beebies >BB>

0ccuracy .8CC .827 .66 .685 .663 .3;C .636

&UB-E/UK .;;2 .771 .7;7 .767 .25 .77

&UB-E/UKX> .7 .;C5 .788 .C2 .78

&UB-E/U& .73C .73C .C6 .761

B"> .;2 .833 .7;>ele: .852 .7

>Beebies .721

Percenta#e ! variance accunted !r by nn/linear re#ressin analysis @s*lines, rcs !unctin

in + wit$ 5 ntsD

&UB-E/UK @)i*!D 5C.5S

&UB-E/UK @l#@>41DD 5.1S

&UB-E/U& @)i*!D 36.SB"> @)i*!D 36.;S

>ele: @)i*!D 35.8S

Page 34: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 34/36

35

able 5' >rrelatins between t$e varius !re(uency measures and t$e B-P + data @"

2C,66D. $e u**er *art s$ws t$e crrelatins. $e lwer *art s$ws t$e *ercenta#es !

variance accunted !r by nn/linear re#ressin analyses @lm/*rcedure in +, restricted cubic

s*lines wit$ 5 ntsD.

&UB-E/UK &UB-E/UKX> &UB-E/U& B"> >ele: >Beebies >BB>

+ /.885 /.85 /.856 /.837 /.825 /.636 /.852

&UB-E/UK .;;1 .776 .;CC .782 .2 .7;3

&UB-E/UKX> .77 .;C8 .78; .C1 .77C

&UB-E/U& .722 .727 .8;7 .75

B"> .;3 .811 .1>ele: .828 .82

>Beebies .71

Percenta#e ! variance accunted !r by nn/linear re#ressin analysis @s*lines, rcs !unctin

in + wit$ 5 ntsD

&UB-E/UK @)i*!D 58.1S

&UB-E/UK @l#@>41DD 5.1S

&UB-E/U& @)i*!D 53.3SB"> @)i*!D 52.2S

>ele: @)i*!D 5C.S

Page 35: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 35/36

36

able 6' Percenta#es ! variance accunted !r by t$e varius !re(uency measure in t$e E-P

data.

0ccuracyX- +X- +Xnam

@" 5C,587D @" 33,;;D @" 33,;;D

&UB-E/U& @)i*!D 2C.6S 38.S 28.CS

&UB-E/U& @>D 22.3S 3.2S 28.1S

&UB-E/UK @)i*!D 1;.CS 35.7S 25.2S

&UB-E/UK @>D 2C.6S 35.7S 25.2S

Page 36: Word Frequency for British English

7/25/2019 Word Frequency for British English

http://slidepdf.com/reader/full/word-frequency-for-british-english 36/36

able 8' >rrelatins ! t$e &UB-E/UK !re(uencies wit$ t$e >PW wrd !re(uencies @all

values l# trans!rmed a!ter -a*lace trans!rmatinF " ;,126 wrd ty*es s$ared between

bt$ listsD.

&UB-E/UK @)i*!D >Beebies @)i*!D >BB> @)i*!D

>PW .885 .68 .8;C

&UB-E/UK @)i*!D .35 .;26

>beebies @)i*!D .7C3