Crosswords and Information Theory

Embed Size (px)

DESCRIPTION

Shannon 3D Crosswords

Citation preview

  • Crosswords and Information Theory

    Peter Andreasen

    Deember 17, 2000

    Abstrat

    A gentle introdution to the wonders of the infor-

    mation theoretial onept of entropy through ele-

    mentary alulation of the number of rosswords.

    What is a rossword, really?

    Most people have solved a rossword puzzle or

    played Srabble. The existene of word-games like

    those are not to be taken for granted, though. As

    we are going to see, the existene of rosswords is

    entirely at the mery of the underlying language. In

    fat, there is a onnetion between the information

    theoreti onept \entropy", and the possibility of

    reating rossword puzzles!

    Before we proeed, we need to get a few deni-

    tions straight. First, what is a rossword? Let us

    take a look at one:

    g e m

    a r e

    m a t h

    e e

    What we see are rows and olumns of words (single

    letters are a

    epted as words) separated by white

    squares.

    1

    Now, the words in a rossword need not

    be English as in the example above. We might

    want to reate a Danish rossword or we might

    even want to have the olumns and rows be quotes

    from Shakespeare's sonnets. To be able to handle

    suh omplex rules for the reation of rosswords

    we make the following denition.

    1

    It is in the white squares you will normally nd the hints

    needed to solve the puzzle { and in most rosswords the

    topmost row and the leftmost olumn are lled with these

    hints. For simpliity we will make no assumptions about the

    plaement of the white squares.

    Denition 1 A language L is a set of sequenes

    of letters from an alphabet A (say, the letters 'a'

    through 'z' and the symbol ''). A rossword of

    size n is a matrix with the dimensions nn where

    all of the rows are sequenes (of length n) from L

    and all of the olumns are sequenes (of length n)

    from L.

    So if we want to make a really sophistiated

    rossword, we may let L be all the possible quotes

    from Shakespeare. In that ase we should use an al-

    phabet whih inluded the letters as well as spae

    and the various puntation symbols. If we want-

    ed to make a 'lassi' rossword, we would have L

    be equal to any sequene you an make by taking

    words from a ditionary and gluing them together

    with one or more 's inbetween. In this ase the

    alphabet A would just be the letters and the speial

    symbol .

    How many are there?

    It is obvious, that very small rosswords are easi-

    ly onstruted. Espeially rosswords whih have

    only one row or one olumn. It is also easy to re-

    ate a few very big but very dull rosswords: If you

    keep alternating the rows between 'I I I ' and

    ' I I ', you ertainly get a valid (and as big

    as you like) 'Classi English' rossword. So we want

    to not only onsider the existene of big rosswords,

    but also hek if there are many dierent of them.

    We are going to alulate the number of big ross-

    words now.

    Assume we have hosen an alphabet A and a lan-

    guage L over whih the rosswords must be made.

    We use the notation jAj for the number of letters

    and symbols in the alphabet. Let us introdue the

    following number as well,

    L

    (n) = number of sequenes from L of length n:

    1

  • So for onstruting a square rossword of size nn

    over the language L, there are

    L

    (n) possible hoi-

    es for the rst row. We will now use a small trik

    and for a moment employ a bit of probability the-

    ory: If we piked an absolutely random sequene

    of n letters from A, what are the hane that we

    got a 'valid' row, that is, a sequene from L? The

    answer is

    L

    (n)

    jAj

    n

    beause there are

    L

    (n) valid sequenes and jAj

    n

    possible sequenes. An example: suppose we want-

    ed to reate a normal English rossword. In my

    ditionary, there are 1 word (namely 'I') of length

    1 and 49 words of length 2. Valid sequenes of

    length 2 are '', ' I', 'I', and then the 49

    words of length 2. A total of 52 sequenes, that is,

    L

    (2) = 52. The total number of possible sequenes

    of length 2 is jAj

    2

    = 2727 = 729 (note, that even

    though we only have 26 letters, the size of A is 27,

    beause we need the symbol as well). And thus

    the probability of getting a valid sequene of length

    2 would be 52=729 0:07, in this example.

    However, that was the probability of just one

    valid row. What about the rest? The probability of

    all n rows being valid equals the above probability

    multiplied with itself n times

    2

    ,

    L

    (n)

    jAj

    n

    n

    =

    L

    (n)

    n

    jAj

    n

    2

    :

    Now for the olumns the situation is idential. And

    beause the olumns are as high as the rows are

    wide, the result is the same: The probability of all

    n olumns being valid (that is, from L) equals

    L

    (n)

    n

    jAj

    n

    2

    :

    Now we may alulate the probability of a random-

    ly seleted matrix of n n letters from A being in

    fat a rossword: We want both its rows and its

    olumns to be valid, so we multiply:

    L

    (n)

    n

    jAj

    n

    2

    2

    =

    L

    (n)

    2n

    jAj

    2(n

    2

    )

    2

    This is basi probability theory. It is omparable to

    when we say that the probability of a oin landing heads up

    equals

    1

    2

    and then proeed to alulate the probability of two

    heads in a row as

    1

    2

    1

    2

    =

    1

    4

    . We multiply the probabilities

    when we want the probability of both events.

    We now return to our original question: How

    many (big) rosswords are there? Well, we know

    the probability of a randomly seleted matrix of

    n n letters being a rossword, and there are a

    total of jAj

    nn

    = jAj

    n

    2

    possible n n matries, so

    we may write

    3

    N

    n

    = jAj

    n

    2

    L

    (n)

    2n

    jAj

    2(n

    2

    )

    =

    L

    (n)

    2n

    jAj

    n

    2

    :

    This makesN

    n

    our symbol for the mumber of ross-

    words of size n n.

    Explosive numbers

    To get to the ore of the matter, we need to do a

    bit of mathematial wizardry, so now is the time to

    wear your pointed hat! First, we apply the loga-

    rithm

    4

    to N

    n

    :

    logN

    n

    = 2n log

    L

    (n) n

    2

    log jAj

    = 2n

    2

    log

    L

    (n)

    n

    log jAj

    2

    =

    2n

    2

    log jAj

    log

    jAj

    L

    (n)

    n

    1

    2

    The speial symbol log

    jAj

    is simply the logarithm to

    the base of jAj, that is, jAj

    log

    jAj

    x

    = x. Reall that

    we are interested in the number N

    n

    when n grows

    large. In the expression above, the rst fration,

    2n

    2

    log jAj

    ;

    just grows towards innity as n does the same. The

    seond fration,

    n

    =

    log

    jAj

    L

    (n)

    n

    is more interesting (so we name it

    n

    ). The value

    of

    L

    (n) must be between 0 and jAj

    n

    (that should

    be lear from the denition of

    L

    (n)). So (assuming

    3

    This is another appliation of basi probability theory:

    The number of valid rosswords are alulated as the proba-

    bility of a random matrix being valid times the total number

    of possible matries.

    4

    Reall, that taking the logarithm of a produt yields a

    sum (log ab = log a + log b), a logarithm of a fration yields

    a dierene (log a=b = log a log b) and the logarithm of a

    power turns into a produt (log a

    b

    = b log a).

    2

  • L

    (n) > 0)) we see that log

    jAj

    L

    (n) is between 1

    and n. Thus, when n grows large, value of

    n

    stays

    between 0 and 1. Let us assume that

    n

    in fat

    onverges

    5

    to some number between 0 and 1. We

    may now onlude, that if 3) ross-

    words. The probability of one dimension (think:

    row) of the rosswords being valid equals

    L

    (n)

    jAj

    n

    n

    d1

    =

    L

    (n)

    n

    d1

    jAj

    n

    d

    :

    This is almost the same result as before, but note

    the exponent n

    d1

    . In the ase d = 3, where we

    might imagine the rossword as a ube made up

    of 'stiks' of sequenes from L, the exponent orre-

    sponds to the fat that in eah dimension there are

    n

    2

    stiks. The probability of all dimensions (think:

    rows and olumns) being valid equals

    L

    (n)

    n

    d1

    jAj

    n

    d

    !

    d

    =

    L

    (n)

    dn

    d1

    jAj

    dn

    d

    :

    Again, this should ome as no shok. The total

    number of possible rosswords (think: any matrix)

    is multipliated with the probability and we nd:

    N

    (d)

    n

    = jAj

    n

    d

    L

    (n)

    dn

    d1

    jAj

    dn

    d

    =

    L

    (n)

    dn

    d1

    jAj

    n

    d

    (d1)

    :

    Applying logarithm yields

    logN

    (d)

    n

    = dn

    d1

    log

    L

    (n) n

    d

    (d 1) log jAj

    and reorganizing the terms,

    logN

    (d)

    n

    =

    dn

    d

    log jAj

    log

    jAj

    L

    (n)

    n

    d 1

    d

    : (1)

    The seond fration in the above expression is

    reognized from before. We reall that the number

    is used to denote the limiting value of the fration

    as n beomes very big. We nd, that if, say, d = 3

    the value of must be at least

    2

    3

    if we want to

    have many, big rosswords. As the dimension of

    the rosswords grow, the languages L must have

    larger and larger -value to sustain the notion of

    many rosswords.

    It seems like

    L

    expresses something fundamen-

    tal about the language L. So information theorists

    have a name for that value:

    Denition 2 Let L be a language. The entropy of

    L is dened as

    ~

    H(L) = lim

    n!1

    log

    jAj

    L

    (n)

    n

    :

    3

  • We reognize the entropy as the same thing as

    we know as

    L

    . The little symbol above

    ~

    H is there

    to remind us that this is a speial kind of entropy:

    The theory leading up this denition is not as on-

    ise and rigid as many information theorists would

    want. But we should not feel nothing has been a-

    omplished: Our entropy aptures some very deep

    aspets of the onept.

    We may wonder what happens if, say,

    ~

    H(L) =

    1

    4

    . What kind of rosswords are possible? Why,

    rosswords of dimension d >

    4

    3

    of ourse! If d =

    4

    3

    we have (d1)=d =

    1

    4

    . How to visualize a rossword

    in 1:333 dimensions is probably better left as an

    exerise to the reader!

    A reapitulation

    Let us briey examine what we have learnt so far:

    We introdued the onept of language whih is

    nothing but a set of sequenes of letters. We have

    then made a satisfatory denition of what is a

    rossword over a language. Using elementary om-

    binatoris and probability theory we have alulat-

    ed the number of valid rosswords of size n n

    (or, in the ase of other dimensions, size n

    d

    ). This

    number depends on the size of the alphabet, jAj,

    as well as the speial funtion

    L

    (n). We then ob-

    served, that there are essentially two dierent ases:

    In the rst ase (

    L

    d1

    d

    ), the same number be-

    omes innitely big as the size grows.

    This alls for a reformulation of our initial ques-

    tion: While we opened this paper asking about the

    existene of rosswords, we are now tempted to ask:

    \Given a language L, what is the greatest dimen-

    sion d for whih there are many (big) rosswords?".

    This move enourages us to onsider non-integer

    values of d, and thus we have denitely left the

    realm of ordinary rosswords puzzles, and maybe

    the real world as well! The answer to the new ques-

    tion is related to the entropy as we have just seen.

    In fat, ombining the denition of entropy and for-

    mula (1), shows

    ~

    H(L) =

    d

    0

    1

    d

    0

    and d

    0

    =

    1

    1

    ~

    H(L)

    ;

    where d

    0

    is exatly the largest dimension where it

    is possible to reate (many) rosswords over L.

    Note how d

    0

    may be arbitrarily big, even 1 if

    the entropy equals 1. How is that for a rossword

    puzzle! Atually, if

    ~

    H(L) = 1 it is quite trivial to

    reate rossword puzzles (in any dimension). An

    example of suh a language L is that whih is made

    up of every integer. The alphabet A is just the dig-

    its, and

    L

    (n) = jAj

    n

    = 10

    n

    (beause any sequene

    of length n whih is made up from digits, is a valid

    number) so learly

    ~

    H(L) = 1.

    We have arrived at the onept of entropy by

    a quite unusual method. Aside from (hopefully)

    some pedagogial advantages there are other rea-

    sons for piking this approah: We now have an

    entropy onept dened on any language or, whih

    is the same, any set of sequenes made up of letters

    from A. This is not true for the traditional entropy

    whih is introdued by the onept of information

    soures (whih are also known as stohasti pro-

    esses, and are based on a quite tehnial proba-

    bility theoreti framework). In addition, while our

    entropy, in its urrent form, does not handle ran-

    dom languages (e.g. the language of all possible se-

    quenes of 0's and 1's reated by ipping a oin),

    it is possible to rene our denitions to over (and

    indeed generalize the probability theoreti entropy)

    these important ases as well.

    Entropy

    Something should probably be said about why en-

    tropy is so entral to information theory. But where

    to start and where to end! We will look at only two

    aspets, one of somewhat philosophial nature and

    the other of very pratial nature.

    Entropy is often said to be a measure of how

    'omplex' or even 'aoti' things are. This orre-

    sponds niely with the observations given above: A

    language made up of every possible integer is devoid

    of any form or struture. Anything goes. It is im-

    possible to distinguish between the a sequene from

    the language and a sequene of ompletely random

    digits. This language, as explained above, has the

    entropy 1. On the other hand, a language made up

    of sequenes of only one letter, say 'a', is omplete-

    ly strutured. No room for hoies. The funtion

    L

    (n) is onstantly 1 regardless of the value of n.

    This orresponds to the ase where

    ~

    H(L) = 0.

    The other use of entropy whih we will touh up-

    on is data ompression. We mention the following

    4

  • theorem in sketh form:

    Theorem 1 (Shannon) Let L be a language over

    A. There exists an enoding suh that any sequene

    x 2 L of length n may be enoded into a sequene

    of no more than n(

    ~

    H(L)+) letters. This holds for

    any positive number however small, provided the

    length n of x is large enough.

    This formulation only aptures the essene of

    Shannon's theorem. What is important is the or-

    der in whih things happen: First we hoose the

    value as small as we want it. This determines

    how \lose" to the entropy we want our enoding

    to be. Then the theorem tells us that there exists a

    number N and a ode so that any sequene x 2 L

    whih are at least N letters long an be enoded

    into just jxj(

    ~

    H(L) + ) letters. So if the entropy of

    L is

    1

    2

    we an ompress long sequenes from L by

    a fator 2.

    This onludes our tour. The onnetion be-

    tween the omplexity of a language and the ability

    to reate rosswords may not ome as a surprise.

    But that this onnetion leads diretly to entropy,

    the ornerstone of information theory is, at the very

    least, rather neat.

    Notes

    This setion ontains some notes about the history

    of the results. It is probably most interesting to

    readers already familiar with the onepts in this

    paper. The idea of linking rossword puzzles and

    entropy is, in fat, as old as [Shannon, 1948, from

    whih we quote the last paragraph of setion 7:

    The redundany of a language is related

    to the existene of rossword puzzles. If

    the redundany is zero any sequene of let-

    ters is a reasonable text in the language

    and any two-dimensional array of letters

    forms a rossword puzzle. I fthe redun-

    dany is too high the language imposes

    too many onstraints for large rossword-

    s puzzles to be possible. A more detailed

    analysis shows that if we assume the on-

    straints imposed by the language are of a

    rather haoti and random nature, large

    rossword puzzles are just pussible when

    the redundany is 50%. If the redundany

    is 33%, three-dimensional rossword puz-

    zles should be possible, et.

    For a more detailed disussion as well as a bit of

    history on the result see the last part of [Immink

    et al., 1998. The entropy

    ~

    H introdued in this

    paper is releated to the Hartley entropy and Haus-

    dor dimensions of 'nie' subsets of A

    1

    (onsid-

    ered as subsets of [0; 1). Or, if one onsiders arbi-

    trary subsets of A

    1

    , the entropy

    ~

    H might be inter-

    pretated as a form of the box ounting dimension,

    see e.g. [Faloner, 1990. The onnetion between

    entropy and Hausdor dimension is desribed in

    [Billingsley, 1965 and interesting results in this di-

    retion an be found in [Ryabko, 1986.

    Referenes

    Billingsley, P. Ergodi theory and Information.

    John Wiley & Sons, 1965.

    Faloner, K. Fratal Geometry - Mathematial

    Foundations and Appliations. John Wiley &

    Sons, 1990.

    Immink, K.A.S., P.H. Siegel and J.K. Wolf. Codes

    for digital reorders. IEEE Trans. Inform. The-

    ory, 44(6):2260{2299, 1998.

    Ryabko, B. Y. Noiseless oding of ombinatori-

    al soures, hausdor dimensoin, and kolmogorov

    omplexity. Problems of Inform. Trans., 22(3):

    170{179, 1986.

    Shannon, C.E. A mathematial theory of ommu-

    niation. Tehnial report, Bell System, 1948.

    5