‘Ideal learning’ of natural language: Positive results about learning from positive evidence

  • Published on

  • View

  • Download


  • Journal of Mathematical Psycholo









    input that appears noisy and partial. How can such an

    Hornstein & Lightfoot, 1981).How can the poverty of the stimulus argument be

    assessed? At an abstract level, a natural approach is to

    to learn language from the specic linguistic data avai-

    that the project of nding an optimal way of learninglanguage is inherently open-ended; and our present under-standing both of the mechanisms of human learning, and

    ARTICLE IN PRESSthe computational and mathematical theory of learning, issufciently undeveloped that it is clear that the project is, infull generality, well beyond the scope of current research.

    0022-2496/$ - see front matter r 2006 Elsevier Inc. All rights reserved.


    Corresponding author. Fax: +447436 4276.E-mail address: n.chater@ucl.ac.uk (N. Chater).impoverished stimulus support such impressive learning?One inuential line of argument is that it cannotthispoverty of the stimulus argument (Chomsky, 1980) istypically used to argue that language acquisition is guidedby innate knowledge of language, often termed universalgrammar, that the child brings to bear on the learningproblem (e.g., Chomsky, 1965, 1980; Hoekstra & Kooij,1988). This type of argument for universal grammar is ofcentral importance for the study of human language andlanguage acquisition (e.g., Crain & Lillo-Martin, 1999;

    lable to the child, then we might reasonably concludethat some innate information must be available. Indeed,a second step, although not one we will consider inthis paper, would be to attempt to prove that theideal language learner, when provided with some appro-priate innate information, is then able to learn langu-age successfully from data of the sort available to thechild.Clearly, the task of constructing such an ideal language

    learner is a formidable one. We might reasonably suspectsources of information, including innate constraints on learning. We consider an ideal learner that applies a Simplicity Principle to the

    problem of language acquisition. The Simplicity Principle chooses the hypothesis that provides the briefest representation of the available

    datahere, the data are the linguistic input to the child. The Simplicity Principle allows learning from positive evidence alone, given

    quite weak assumptions, in apparent contrast to results on language learnability in the limit (e.g., Gold, 1967). These results provide a

    framework for reconsidering the learnability of various aspects of natural language from positive evidence, which has been at the center

    of theoretical debate in research on language acquisition and linguistics.

    r 2006 Elsevier Inc. All rights reserved.

    Keywords: Learnability; Language acquisition; Algorithmic complexity; Kolmogorov; Identication in the limit; Formal languages

    1. Introduction

    Language acquisition involves the rapid mastery oflinguistic structure of astonishing complexity based on an

    attempt to somehow dene an ideal language learner,which lacks universal grammar, but that can make thebest use of the linguistic evidence that the child is given. Ifit were possible to show that this ideal learner is unablefrom mere exposure to linguistic input). This provides one, of several, lines of argument that language acquisition must draw on otherIdeal learning of natural lalearning from p

    Nick Chatera,

    aDepartment of Psychology, UniverbCentrum voor Wiskunde en Infor

    Received 25 November 2005; recei

    Available online


    Golds [1967. Language identication in the limit. Information a

    been taken, by many cognitive scientists, to have powerful negativegy 51 (2007) 135163

    uage: Positive results aboutsitive evidence

    Paul Vitanyib

    College, London WC1E 6BT, UK

    ica, Amsterdam, The Netherlands

    in revised form 5 September 2006

    December 2006

    Control, 16, 447474] celebrated work on learning in the limit has

    plications for the learnability of language from positive data (i.e.,


  • ARTICLE IN PRESShemHow is it, then, that many researchers are alreadyconvinced that, whatever form such an ideal learner mighttake, there is not enough information in the childslanguage input to support language acquisition, withoutrecourse to universal grammar? Two types of argumenthave been proposed. The rst considers the problem oflanguage acquisition to be in principle problematic. It isargued that there is a logical problem of languageacquisitionessentially because the child has access onlyto positive linguistic data (Baker & McCarthy, 1981;Hornstein & Lightfoot, 1981). Despite some controversy, itis now widely assumed that negative linguistic data is notcritical in child language acquisition. Children acquirelanguage even though they receive little or no feedbackindicating that particular utterances are ungrammatical;and even where they do receive such feedback, they seemunwilling to use it (e.g., Brown & Hanlon, 1970). Butlearning from positive data alone seems logicallyproblematic, because there appears to be no available datato allow the child to recover from overgeneralization. Ifthis line of argument is correct, then whatever ideal learnerwe might describe, it will inevitably face these logicalproblems; and hence it will be unable to learn from positiveevidence alone.The second, and closely related, type of argument,

    focuses on patterns of acquisitions of particular types oflinguistic construction and argues that these specicconstructions cannot be learned from positive data only(e.g., Baker, 1979; Chomsky, 1980). This type of argumentis sometimes labeled Bakers paradox, to which we returnbelow. The nature of this type of argument is necessarilyinformalvarious possible mechanisms for learning theproblematic construction of interest are considered, andrejected as unviable.We shall present mathematical results concerning an ideal

    learner, with relevance to both issues. In particular, we shallshow that, in a specic probabilistic sense, language islearnable, given enough positive data, given only extremelymild computability restrictions on the nature of the languageconcerned. Thus, the apparent logical problem of languageacquisition must be illusorybecause a specic ideal learnercan demonstrably learn language. Our arguments also, afortiori, address the construction-specic version of thepoverty of the stimulus argument. If language as a wholecan be learned from positive data, then any specic linguisticconstruction can be learned from positive data, despiteintuitions to the contrary. Indeed, we shall see that there is ageneral mechanism for such learningone that has fre-quently been described, though sometimes dismissed, indiscussions of poverty of the stimulus arguments (e.g.,Pinker, 1979, 1984). That is, the absence of particularlinguistic constructions in the positive data can serve asevidence that these structures are not allowed in thelanguage. This point is discussed further below in ourdiscussion of what we call the overgeneralization theorem,

    N. Chater, P. Vitanyi / Journal of Mat136which shows how the ideal learner is able to eliminate over-general models of the language.The results presented here should not, though, be viewedas showing that language is learnable by children frompositive data. This is for two reasons. First, resultsconcerning an ideal learner merely show that the informa-tion required to support learning is present in principle. Itdoes not, of course, show that the child has the learningmachinery required to extract it. Indeed, the ideal learnerwe consider here is able to make calculations that areknown to be uncomputableand it is typically assumedthat the brain is limited to the realm of the computable.Thus an interesting open question for future researchconcerns learnability results that can be proved with amore restricted ideal learner. The second reason that thiswork does not show that the child can learn language frompositive data is that the results we describe are asympto-ticthat is, we allow that the child can have access to asmuch positive data as required. In practice, though,children learn specic linguistic constructions having heardspecic amounts of positive linguistic data, with a specicdegree of incompleteness, errorfulness, and so on. Theformal results that we describe here do not directly addressthe question of the speed of learning. Nonetheless, this istypically also true of both logical and construction-specicpoverty of the stimulus arguments. Both types of argumenttypically suggest that, however much positive data isprovided, learning will not be successful. These resultspresented here therefore address these arguments; and raisethe question of how to provide specic bounds on theamount of positive data that is required by the learner tolearn specic linguistic phenomena.The formal results in this paper, then, aim to sharpen the

    discussion of poverty of the stimulus arguments, ratherthan to resolve the issue one way or the other. According tothe results we present, there is enough information inpositive linguistic data for language to be learnable, inprinciple, in a probabilistic sense, given sufcient linguisticdata. A challenge for future work on the poverty of thestimulus argument is to sharpen existing arguments, andcurrent formal results, to address the question of whatincreasingly realistic learners might be able to acquire fromincreasingly realistic models of the amount and propertiesof the linguistic input available to the child. In particular, itis interesting to ask whether it is possible to scale-downthe methods that we describe here, to explore the questionof whether there is sufcient information in the linguisticinput available to the child to acquire specic linguisticphenomena that have been viewed as posing particularlydifcult problems for the language learner. We hope thiswork will feed into the current debate in linguistics andpsychology concerning the scope and validity of poverty ofthe stimulus arguments (e.g., Akhtar, Callanan, Pullum,& Scholz, 2004; Fodor & Crowther, 2002; Legate & Yang,2002; Lidz, Waxman, & Freedman, 2003; Perfors, Tenen-baum, & Regier, 2006; Pullum & Scholz, 2002; Regier& Gahl, 2004; Tomasello, 2004).

    atical Psychology 51 (2007) 135163The ideal learner that we analyse is based on a SimplicityPrinciple. Roughly, the idea is that the learner postulates

  • ARTICLE IN PRESSthemthe underlying structure in the linguistic input thatprovides the simplest, i.e., briefest, description of thatlinguistic input. We require that the description canactually be used to reconstruct the original linguistic inputusing some computable processthus, the goal of the ideallearner is to nd the shortest computer program thatencodes the linguistic input. The general idea thatcognition is a search for simplicity has a long history inpsychology (Koffka, 1962/1935; Mach, 1959/1886), andhas been widely discussed, in a relatively informal way, inthe eld of language and language learning (e.g., Fodor &Crain, 1987). We describe a formal theory of inductivereasoning by simplicity, based on the branch of mathe-matics known as Kolmogorov complexity theory (Li &Vitanyi, 1997). Kolmogorov complexity was developedindependently by Solomonoff (1964), Kolmogorov (1965)and Chaitin (1969). Solomonoffs primary motivation indeveloping the theory was to provide a formal model oflearning by simplicity. Kolmogorov complexity andderivatives from it have been widely used in mathematics(e.g., Chaitin, 1987), physics (Zurek, 1991), computerscience (e.g., Paul, Seiferas, & Simon, 1981), articialintelligence (Quinlan & Rivest, 1989), and statistics(Rissanen, 1987, 1989; Wallace & Freeman, 1987). Thismathematical framework provides a concrete and well-understood specication of what it means to learn bychoosing the simplest explanation, and provides a way ofprecisely dening the Simplicity Principle for cognitivescience (Chater, 1996, 1997, 1999; Chater & Vitanyi, 2002).Moreover, the Simplicity Principle has been used practi-cally in a wide range of models of language processing andstructure (e.g., Brent & Cartwright, 1996; Dowman, 2000;Ellison, 1992; Goldsmith, 2001; Onnis, Roberts & Chater,2002; Wolff, 1988). This framework will prove to be usefulin considering the amount of information available aboutthe language that is inherent in positive evidence alone.The rst substantive section of this paper, Ideal language

    learning by simplicity, outlines the framework for ideallanguage learning. Roughly, as we have said, the learnernds the shortest computer program that can recon-struct the linguistic data that has so far been encountered;it then makes predictions about future material based onwhat that computer program would produce next. Thethird section, The prediction theorem and ideal languagelearning, presents a remarkable mathematical result, due toSolomonoff (1978), that shows that this learning method isindeed, in a certain sense, ideal. This method learns tomake accurate predictions (with high probability) aboutthe language input, given mild computability constraintson the processes generating the linguistic data. Thesubsequent two sections presents and proves new mathe-matical results.In The ideal learning of grammaticality judgments, we

    show how Solomonoffs prediction theorem can be used toshow that the ideal learner can, in a probabilistic sense,

    N. Chater, P. Vitanyi / Journal of Malearn to make arbitrarily good grammaticality judgments.This result is particularly important, given that grammati-cality judgments are the primary data of modern linguistictheory, and they are frequently thought to embodyinformation that cannot readily be extracted from corporaof language. Intuitively, the idea is that sentences which arepredicted with non-zero probability are judged to begrammatical; sentences that have zero probability arejudged to be ungrammatical. Note that this result doesnot allow for the errorful character of linguistic inputalthough extensions to this case may be feasible.In The ideal learning of language production, we show

    that prediction can also allow the ideal language learner toproduce language that is, with high probability, indis-tinguishable from the language that it has heard. Intui-tively, the idea is that the ability to predict what othersmight say can be recruited to determine what the speakershould say. Of course, language production is much morethan thisin particular, it requires the ability not merely tocontinue conversations plausibly, but to say things thatreect ones particular beliefs and goals. Nonetheless, theresult that an ideal language learners can continueconversations plausibly is non-trivial. It requires, amongother things, the ability to produce language that respectsthe full range of phonological, grammatical, pragmatic,and other regularities governing natural language.Finally, in The poverty of the stimulus reconsidered, we

    relate these results to the logical version of the poverty ofthe stimulus (relating these results to Golds (1967) results,and later work...


View more >