Deep Misconceptions and the Myth of Data-Driven NLU - On Putting Logical Semantics Back to Work

On Putting Logical Semantics Back to Work

Walid S. Saba

Deep Misconceptions and the Mythof Data-Driven Language Understanding

IMMANUEL KANT

Every thing in nature, in the inanimate as well as in the animate world, happens according to some rules, though we do not always know them

I reject the contention that an important theoretical

difference exists between formal and natural languages

RICHARD MONTAGUE

One can assume a theory of the world that is isomorphic to the way we talk about it… in this case, semantics becomes very nearly trivial

JERRY HOBBS

Early efforts to find theoretically elegant formal models for various linguistic phenomena did not result in any noticeable progress, despite nearly three decades of intensive research (late 1950’s through the late 1980’s ). As the various formal (and in most cases mere symbol manipulation) systems seemed to reach a deadlock, disillusionment in the brittle logical approach to language processing grew larger, and a number of researchers and practitioners in natural language processing (NLP) started to abandon theoretical elegance in favor of attaining some quick results using empirical (data-driven) approaches.

All seemed natural and expected. In the absence of theoretically elegant models that can explain a number of NL phenomena, it was quite reasonable to find researchers shifting their efforts to finding practical solutions for urgent problems using empirical methods. By the mid 1990’s, a data-driven statistical revolution that was already brewing overtook the field of NLP by a storm, putting aside all efforts that are rooted in over 200 years of work in logic, metaphysics, grammars and formal semantics.

We believe, however, that this trend has overstepped the noble cause of using empirical methods to find reasonably working solutions for practical problems. In fact, the data-driven approach to NLP is now believed by many to be a plausible approach to building systems that can truly understand ordinary spoken language. This is not only a misguided trend, but is a very damaging development that will hinder significant progress in the field. In this regard, we hope this study will help start a sane, and an overdue, semantic (counter) revolution.

Copyright © 2017 WALID S. SABA

a spectre is haunting NLP

February 7, 2017


AI software can now make AI software?

What !?


now back

Notwithstanding achievements in data-centric tasks (e.g., image and speech recognition, or numerically specifiable and finite-space problems, such as the game Go), statistical and other data-driven models (e.g., neural networks) cannot model human language comprehension because these models cannotexplain, model or account for very important phenomena in ordinary spoken language, such as:

• Non-Observable (thus Non-Learnable) Information• Intensionality and Compositionally• Inferential Capacity

to reality

Criticisms of the statistical data-driven approach to language understanding are very often automatically associated with the Chomskyan school of linguistics. At best, this is a misinformed judgement (although in many cases, it is ill informed). There is a long history of work in logical semantics (a tradition that forms the background to the proposals we will make here) that has very little to do (if anything at all) with Chomskyan linguistics.

Notwithstanding Chomsky’s (in our opinion valid) Poverty of the Stimulus (POS) argument – an argument that clearly supports the claim of some kind of innate linguistic abilities, we believe that Chomskyans put too much emphasis on syntax and grammar (which ironically made their theory vulnerable to criticism from the statistical and data-driven school). Instead, we think that syntax and grammar are just the external artifacts used to express internal, logically coherent, semantic, and compositionally and productively (i.e., recursively) constructed thoughts, something that is perhaps analogous to Jerry Fodor’s Language of Thought (LOT).

Here we should also mention that we agree somewhat with M. C. Corballis (‘The Recursive Mind’) that it is thought that brought about the external tool we call language, and not the other way around.


what this study is not about

some initialclarifications

Chomsky’s Poverty of the Stimulus argument

Jerry Fodor’s Language of Thought hypothesis

https://en.wikipedia.org/wiki/Poverty_of_the_stimulus

https://en.wikipedia.org/wiki/Poverty_of_the_stimulus

https://en.wikipedia.org/wiki/Language_of_thought_hypothesis

https://en.wikipedia.org/wiki/Language_of_thought_hypothesis

https://www.amazon.com/Recursive-Mind-Origins-Language-Civilization/dp/0691160945/ref=sr_1_4?s=books&ie=UTF8&qid=1490488018&sr=1-4&keywords=michael+corballis

https://www.amazon.com/Recursive-Mind-Origins-Language-Civilization/dp/0691160945/ref=sr_1_4?s=books&ie=UTF8&qid=1490488018&sr=1-4&keywords=michael+corballis




Another association that criticism of the statistical and data-driven approaches to NLU often conjures up is that of building large knowledge bases with brittle rule-based inference engines. This is perhaps the biggest misunderstanding, held not only by many in the statistical and data-driven camp, but also by previously over enthused knowledge engineers that mistakenly believed at one point that all that is required to crack the NLU problem was to keep adding more knowledge and more rules. We also do not subscribe to such theories.

In fact, regarding the above, we agree with an observation once made by the late John McCarthy (at IJCAI 1995) that building ad-hoc systems by simply adding more knowledge and more rules will result in building systems that we don’t even understand. Ockham's Razor, as well as observing linguistic skills of 5-year olds, should both tell us that the conceptual structures that might be needed in language understanding should not, in principal, involve all that complexity.

As will become apparent later in this study, the conceptual structures that speakers of ordinary spoken language have access to are not as massive and overwhelming as is commonly believed. Instead, it will be shown that the key is in the nature of that conceptual structure and the computational processes involved.

FINALLY, our concern here is in introducing a plausible model for natural language understanding (NLU). If your concern is natural language processing (NLP), as it is used, for example, in applications such as these

words-sense disambiguation (WSD); entity extraction/named-entity recognition (NER); spam filtering, categorization, classification; semantic/topic-based search; word co-occurrence/concept clustering; sentiment analysis; topic identification; automated tagging; document clustering;summarization;etc.

then it is best if we part ways at this point, since this is not at all our concern here. There are many NLP and text processing systems that already do a reasonable job on such data-level tasks. In fact, I am part of a team that developed a semantic technology that does an excellent job on almost all of the above, but that system (and similar systems) are light years away from doing anything remotely related to what can be called natural language understanding (NLU), which is our concern here.




An online demo of a semantic engine: semantic search, summarization, categorization, word-sense disambiguation, entity extraction and key topic identification

http://demo.klangoo.com/

http://demo.klangoo.com/


12

3

WE WILL ARGUE THAT purely data-driven extensional models that ignore intensionality,

compositionality and inferential capacities in natural language are inappropriate, even when the relevant data is available, since higher-level

reasoning (the kind that’s needed in NLU) requires intensional reasoning beyond simple data values.

WE WILL ARGUE THAT many language phenomena are not learnable from data because (i) in most situations what is to be learned is not even observable in the data (or is not explicitly stated but is implicitly assumed as ‘shared knowledge’ by a language community); or (ii) in many situations there’s no statistical significance in the data as the relevant probabilities are all equal

WE WILL ARGUE THAT the most plausible explanation for a number of phenomena in natural language is rooted in logical semantics, ontology, and the computational notions of polymorphism, type unification, and type casting; and we will do this by proposing solutions to a number of challenging and well-known problems in language understanding.

what this study is about



We will propose a plausible model rooted in logical semantics, ontology, and the computational notions of polymorphism, type casting and type unification. Our proposal provides a plausible framework for modelling various phenomena in natural language; and specifically phenomena that requires reasoning beyond the surface structure (external data). To give a hint of the kind of reasoning we have in mind, consider the following sentences:

(1) a. Jon enjoyed the movieb. Jon enjoyed watching the movie

(2) a. A small leather suitcase was found unattendedb. A leather small suitcase was found unattended

(3) a. The ham sandwich wants another beer b. The person eating the ham sandwich wants another beer

(4) a. Dr. Spok told Jon he should soon be done with writing the thesisb. Dr. Spok told Jon he should soon be done with reading the thesis

Our model will explain why (1a) is understood by all speakers of ordinary language as (1b); why speakers in multiple languages find (2a) more natural to say than (2b); why we all understand (3a) as (3b); and why we effortlessly resolve ‘he’ in (4a) with Jon and ‘he’ in (4b) with Dr. Spok. Before we do so, however, we will discuss some serious flaws in proposing a statistical and a data-driven approach to NLU.

more specifically ...



is not even in the data?

language learning from data?

what if the relevant information


One of the most obvious challenges to statistical and data-driven NLU are situations where there does not seem to be any statistical significance in the observed data that can help in making the right inferences. As an example, consider the sentences in (1) and (2).

(1) The trophy did not fit in the brown suitcase because it was tooa. bigb. small

(2) Dr. Spok told Jon that he should soon be donea. writing his thesisb. reading his thesis

For a speaker of ordinary language, the decision as to what ‘it’ in (1) and ‘he’ in (2) refer to are immediately obvious, even for a 5-year old. On the other hand, a statistically data-driven approach would be helpless in making such decisions since the only difference between the sentence-pairs in (1) and (2) are words that co-occur with equal probabilities (this is so because antonyms or opposites, such as big/small, night/day, hot/cold, read/write, open/close, etc. have been shown to co-occur in text with equal frequency). Clearly, then, references such as those in (1) and (2) must be resolved using information that is not (directly) in the data.

probabilities are all equal

it's not even in the data

Example (1) is taken

from Levesque, H. J.; Davis, E.; and Morgenstern, L., (2011) The Winograd Schema Challenge, Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning(AAAI Press)

http://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492/4924



In the absence of any statistical significance in the data, we have suggested above that references such as those in sentences (1) and (2) are resolved by relying on other information that is not (directly) in the data.

It might still be suggested, however, that a learning algorithm can create statistical significance between (1a) and (1b), for example, if probabilities of some composites in the sentence (as opposed to the atomic units) are considered. What this would essentially require is creating a composite feature for every possible relation. In (1), we would need at least the following:

trophy-fit-in-suitcase-smalltrophy-fit-in-suitcase-big trophy-not-fit-in-suitcase-smalltrophy-not-fit-in-suitcase-big

Note here that since data-driven approaches also do not admit the existence of a type-hierarchy (or any knowledge structure, for that matter). As such, there’s nothing that says that a Trophy and a Radio are both subtypes of an Artifact, and that Purse and Suitcase are both subtypes of some Container, where the ‘fit’ relation applies similarly to both, other features (e.g., radio-fit-in-purse-small) would also be needed to learn how to resolve the reference ‘it’ in (1).




Again, in the absence of a type-hierarchy (or some other source of information) considering composite features (phrases) instead of individual units (words) to capture statistical significance leads us to something like this:

trophy-fit-in-suitcase-smalltrophy-fit-in-suitcase-big trophy-not-fit-in-suitcase-smalltrophy-not-fit-in-suitcase-bigradio-fit-in-purse-smallradio-fit-in-purse-big radio-not-fit-in-purse-smallradio-not-fit-in-purse-bigetc.

Although the point can be made with the above, the story in reality is much worse, as there are more ‘nodes’ that must be combined in these features to capture statistical significance. For example, if ‘because’ was changed to ‘although’ in (1b) then ‘it’ would suddenly refer to the trophy. Nevertheless, the question now is how many such features would eventually be needed, if every meaningful sentence requires a handful of composite features to capture all statistical correlations? Fodor and Pylyshyn (1988) hint that that number is in the magnitude of the number of seconds in the history of the universe, citing an experiment conducted by the psycholinguist George Miller.



Fodor, J. A., and Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 28: 3- 71.

http://ruccs.rutgers.edu/images/personal-zenon-pylyshyn/proseminars/Proseminar13/ConnectionistArchitecture.pdf



Incidentally, in the absence of any external knowledge structures, the combinatorially implausible explosion in the number of features needed by a statistical data-driven (i.e., bottom-up) learner would also be needed by a top-down learner, one that learns by being told (or by instruction). Specifically, a top-down learner would ask for n number of clarifications in every sentence, requiring therefore a total of nm clarifications for a paragraph with m sentences. The reader can now easily work out how many clarifications would be required for a top-down learner to understand just a small paragraph1.

The point here is that whether the learner tries to discover what is missing bottom-up (from the data), or top-down (by being told), it would seem therefore that the infinity lurking in language (due to the recursive productivity of thoughts) makes learning various language phenomena just from data alone a computationally implausible theory.

a top-down explanation


1The reason a top-down learner would need (n x n), as opposed to (n + n), clarifications for two consecutive sentences where each requires n is that the preferences of one sentence are subject to revision in the context of the previous and/or the following sentence. This is so because, linguistically, it is paragraphs, not sentences, that are the smallest linguistic units that can be fully interpreted on their own and should not (in theory) require any additional text to be fully understood. See The Semantics of Paragraphs(Zadrozny & Jensen, 1991) for an excellent treatment of the subject

Zadrozny, W. and Jensen, K. (1991), Semantics of Paragraphs, Computational Linguistics 17 (2)




Our argument against statistical data-driven approaches in NLU are not meant to dismiss the role of statistical/probabilistic reasoning in language understanding. That would, of course, be unwise. Our argument is about which probabilities are relevant in language understanding. Consider, for example, the following:

(1) The town councilors refused to give the demonstrators a permit because they advocated violence and anarchy.

(2) A young teenager fired several shots at a policeman. Eyewitnesses say he immediately fled away.

While the most likely reading for (1) has they referring to the demonstrators, one can imagine a scenario where a group of anarchist town councilors refused to give the demonstrators a permit specifically to incite violence and anarchy. Similarly, while the most likely reading for (2) is the one where ‘he’ refers to the young teenager, one can imagine a scenario where a slightly wounded policeman fled away to escape further injuries.

Obviously such occurrences are rare, and thus, in the absence of other information, the pragmatic probability of the usual reading wins over with speakers of ordinary language. What is important to note here is that the likelihoods we are speaking of are a function of pragmatics and have nothing to do with anything observed in the data.

pragmatic probabilities



To summarize this argument, consider the table below. At the data level references can be resolved during syntactic analysis using simple NUMBER or GENDER data. At the information level, the resolution would require semantic (type) information, for example that corporations, and not mobile phones, settle a case out of court. Note also that at this level the possibilities are not all available, once the type constraints are applied. It is exactly at the pragmatic level where probabilistic/statistical reasoning factors in, since at this level the referents are all possible, yet some are more probable than others (e.g., it is more likely that the one who fell down is the one who was shot, etc.)



REFERNCES RESOLVED BY SYNTAXJohn informed Mary that he passed the exam.John told Steve and Diane that they were invited to the party.

information

data

intentional

REFERNCES RESOLVED BY SEMANTICSApple and Samsung released new mobile phones this week.a. they both have very similar designs.b. they also seem to be ready to settle their lawsuits out of court.

REFERNCES RESOLVED BY PRAGMATICSA young teenager fired several shots at a policeman. Eyewitnesses say he immediately fled away.

REFERNCES CANNOT BE RESOLVED (intention not clear)John told Bill that he has been nominated to head the committee.

information level

data level

knowledge level

intentional level


The above discussion concerning the level at which references can be resolved (syntax/data level vs. semantic/information level vs. pragmatics/knowledge level), is related to an alternative to the Turing Test, named the Winograd Schema (WS) Challenge that has been recently proposed (Levesque, Davis, and Morgenstern, 2012). For example, the following is considered to be a good WS:

The town councilors refused to give the demonstrators a permit because they <word> violence

<special-word>: feared<alternate-word>: advocated

<answer 1>: the town councilors<answer-2>: the demonstrators

The purpose of the above schema is, while all else remains the same, the referent of they changes by simply replacing the special word by the alternate word. The above is also considered a good schema since a system must bring to bear background (commonsense) knowledge in order to resolve the reference and no statistical data obtained from a corpus analysis can do the trick.



Levesque, H. J.; Davis, E.; and Morgenstern, L., (2011) The Winograd Schema Challenge, Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning(AAAI Press)




Elsewhere (Saba, forthcoming) we argue that a good measure of a Winograd Schema is in fact related to the semantic distance and thus to the effort required in obtaining the relevant information required to resolve the reference. In this regard, the syntax/data level represents level 0, where the information required is available in the words of the text itself (e.g., word properties such as gender, number, animate/inanimate). As such, level 1 would correspond to the situation where properties of the words (such as selectional restrictions) would suffice, as in this example

Jon and Mary wrote two books, and apparently theya. were translated into five languages.b. plan to write another.

where selectional restrictions such as these would be sufficient:

Plan(x :: human, y :: activity)Translate(x :: human, y :: textualObj)

That is, while the agent of a planning activity must be a human

(and not books), the object of a translation must be a textualObj

object (and not humans). In this continuum, what we call thinking or understanding clearly happens beyond these two levels, where the information required to resolve the reference are quite a distance from the text itself and where bring it to bear requires complex chains of inference.



Saba, W. S.,(forthcoming) Levels of Understanding: Situating Winograd Schemas in the Data-Information-Knowledge Spectrum


The above discussion of bringing to bear the relevant information that is not available in the words themselves in order to resolve certain references is compounded by the fact that we often do not even explicitly state all the implicitly assumed words. That is, in most cases we have what we can call the missing text – text which is not explicitly stated but is often assumed as shared knowledge among a community of language users. Consider for example the sentences in (1):

(1) a. Don’t worry, Simon is a rock.b. The truck in front of us is annoying me.c. Carlos likes to play bridge.d. Mary enjoyed the apple pie.e. Jon owns a house on every street in the village.

Clearly, speakers of ordinary English understand the above as

(2) a. Don’t worry, Simon is [as solid as] a rock.b. The [person driving the] truck in front of us is annoying me.c. Carlos likes to play [the game] bridge.d. Mary enjoyed [eating] the apple pie.e. Jon owns a [different] house on every street in the village.

Since such sentences are quite common and are not at all exotic, farfetched, or contrived, any computational model for NLU must clearly somehow ‘uncover’ this [missing text] for a proper understanding of what is being said.

analyzing missing text?



Again, let us consider the sentences below, where there is some [missing text] that is not explicitly stated in every day discourse, but is often implicitly assumed:

(1) Don’t worry, Simon is [as solid as] a rock.(2) The [person sitting at the] corner table wants a beer.(3) Carlos likes to play [the game] bridge.(4) Mary enjoyed [watching] the movie.(5) Mark is an alleged [but not yet a proven] criminal.(6) Jon burned [the book] Das Kapital after he read it[s content].(7) Carl owns a [different] house on every main street in the village.(8) Jimi is an amazing blues [-playing] guitarist.

Although the above seem to have a common denominator, namely some missing text that is often implicitly assumed, it is somewhat surprising that in looking at the literature one finds that the missing text phenomenon have been studied quite independently and under different labels such as

metaphor (1), metonymy (2), lexical ambiguity (3), ellipses (4),intensional adjectives (5), co-predication (6), quantifier scope ambiguity (7), and compound nominals (8).

analyzing missing text?



Perhaps chief among the “it’s not even in the data” phenomena is that of Adjective-Ordering Restrictions (AORs), a phenomenon that can be explained by the examples below:

(1) a. Carlos is a polite young manb. #Carlos is a young polite man

(2) a. A small brown suitcase was found unattendedb. #A brown small suitcase was found unattended

The readings in (1a) and (2a) are clearly preferred by speakers of ordinary spoken language over the readings in (1b) and (2b), although there are no rules that speakers of ordinary language seem to be following. What makes the AORs phenomenon even more intriguing is the fact that these preferences are also consistently made across multiple languages.

First of all this phenomena presents a paradigmatic challenge to the statistical and data-driven story about language learning, as it does not seem that speakers come to have these preferences by observing and analyzing data. Furthermore, there does not seem to be a pattern in the observed data suggesting what adjectives should precede or follow other adjectives. For example, while it is preferred that ‘small’ precede ‘brown’ in (2), in (3) ‘small’ is not anymore preferred to be the first adjective:

(3) A beautiful small suitcase was found unattended

innate preferences?



The most crucial challenge to data-driven NLU as it relates to adjective-ordering restrictions is to explain how beautiful in (4a) could be describing Olga’s dancing as well as Olga as a person, while this reading is not available in (4b):

(4) a. Olga is a tall beautiful dancerb. Olga is a beautiful tall dancer

We will see later why beautiful in (4b) cannot anymore modify Olga’s dancing (which is an abstract entity of type activity). For now we want to note however that while various investigations on large corpora have not yielded any plausible explanation as to what seems to govern these adjective-ordering restrictions, we argue that even if some patterns were to be discovered, the more important question is ‘what is behind this phenomena – i.e., what is it that makes us have these ordering preferences, and across multiple languages’?

In our opinion, what is behind this phenomenon must be much deeper than the outside (observable) data of any language. In fact, we believe that a plausible account for this phenomena must shed some light on the conceptual structures and the processes that are operating in language. As stated above, a plausible explanation for this puzzle, one that is rooted in ontology, polymorphism, type unification and type casting, will be suggested later in this study.

innate preferences?


See our explanation of the phenomena of adjective-ordering restrictions (paper was presented at KI-2008 )

https://arxiv.org/pdf/0801.4746.pdf



extensions and

data is (in the end) just datano matter how big,

intensions


Clearly, as objects (e.g., as logical circuits) the expressions in (1) are not the same. For example, a logical circuit corresponding to the expression on the left-hand side has only two gates, while a gate for the other expression would have three, as shown below.

What do we mean when we write an equality like this?

(A ^ (B _ C)) = (A ^ B) _ (A ^ C)

It would seem then that at some level, equality in data only is not enough and saying two objects are the same is different from saying they are equal (in their data value). In some contexts, as will be seen shortly, these differences are crucial.

What is mostly relevant to our discussion here is that data-driven approaches deal with data only, that is, equality in that paradigm is equality of one attribute, namely the final value. Thus, if it does turn out that equality of data alone is not enough in high-level reasoning (e.g., in NLU), then data-driven approaches to NLU, would also (or, again) clearly be inappropriate.

Let us therefore take a closer look at the equality most of us know, and the related notions of intensions and extensions, notions that some of the most penetrating minds in mathematical logic have studied for nearly two centuries.

(1)

data and intensions


Our grade school teachers once told us that

256 = 16

Can we always equate and replace the data

value 16 by the data value 256?

Let’s see …

but is thistrue ?

data and intensions


Mary taughther little brother that 7 + 9 = 16

Now if we blindly follow what our grade school teachers told

us, namely that 256 = 16, we should be able to replace 16 by

256 without any problem. But if we do that in the above we would then be able to alter reality and come up with

Mary taught her little brother that 7 + 9 = 256

which is not true at all. What happened? Were we taught the

wrong thing when we were told that 256 = 16? Not exactly, but we were also not told the whole story. I guess our grade school teachers did not know we will end up working in AI and NLU. If they did, they would have told us that extensional (data only) equality is not sufficient in high-level reasoning, and if equated with sameness at that level it can easily lead to false conclusions.

here‘s a snapshot of some reality

data and intensions


The four objects below are in fact equal, including 256 and 16, but in regard to one attribute only, namely their data value. As objects, however, they are not the same, as they differ in many other attributes, for example in the number of operators and the number of operands. Note however that the attributes value, no-of-operators, and no-of-operands are still not enough to establish true intensional equality between these objects, as demonstrated by the objects (a) and (b). At a minimum, true (intensional) equality between these objects would require the equality of at least four attributes: value, no-of-operators, no-of-operands, syntaxtree.

equality and sameness

(a) (b)

In many domains where the only relevant attribute is the data (value), working with extensional (data equality) only might be enough. In tasks that require high-level reasoning, such as NLU, however, this will lead to contradictions and false conclusions, as the example of Mary and her little brother clearly demonstrate

data and intensions


The four objects below are in fact equal, including 256 and 16, but only in regard to one attribute only, namely their data value. As objects, however, they are not the same, as they differ in many other attributes, for example in the number of operators and the number of operands. Note however that the attributes value, no-of-operators, and no-of-operands are still not enough to establish true intensional equality between these objects, as demonstrated by the objects (a) and (b). At a minimum, true (intensional) equality between these objects would require the equality of at least four attributes: value, no-of-operators, no-of-operands, syntaxtree.

equality and sameness

(a) (b)

In many domains where the only relevant attribute is the data (value), working with extensional (data equality) only might be enough. In tasks that require high-level reasoning, such as NLU, however, this will lead to contradictions and false conclusions, as the example of Mary and her little brother clearly demonstrate

As an aside …Reducing equality of objects to equality of one extensional attribute, namely the data value, is what is behind the so-called adversarial examples in deep neural works, where small perturbations in the image (the kind of which will not affect the human eye from making a different classification) will cause the network to classify the image in a completely different category. The same is true in the opposite case, where a completely meaningless image (a blob of pixels) is classified with high certainty as a real-life object. That is, behind both of these phenomena is something

similar to the fact that 256 is not always (and in all contexts) equal to 9 + 7, although certain calculations involving these data values might produce the same output value (bottom line: extensional data-only equality is not enough in high

level reasoning)

data and intensions

Intriguing properties of neural networks

Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images






Beyond grade school, we were told in high school that two functions, f and g, are equal (are the same) if for every input they produce the same output. In notation, this was expressed as

But this is not entirely true – or, our high school teachers did not also tell us the whole truth: if two functions are equal whenever they agree on their input-output pairings, then MergeSort and InsertionSort would be the same objects, since for any sequence

But computer scientists know that although their external values are always the same (that is, they are extensionally equal), MergeSort and InsertSort are not the same objects as they differ in many other (and very important) attributes – for example in their space and time complexity.

yet another example

If, for any input x, f and g produce the same output value

f and g are the same functions

then

(8x)(f(x)=g(x)) ) (f = g)

MergeSort(sequence) = InsertionSort(sequence)

data and intensions


data and reasoning

It might be possible that Aristotle was not the teacher of Alexander the Great

Here we consider an example where working with extensions (data values) only and ignoring intensions can easily lead to absurd conclusions. Consider the facts shown in the table below.

Now according to the above, the teacher of Alexander the Great = Aristotle. Notice now that if we simply replace ‘the teacher of Alexander the Great’ with a value that is only extensionally equal to it, we can get an absurdity from a very meaningful sentence, as shown below

It might be possible that Aristotle was not Aristotle

)

data and intensions


Let us now consider examples illustrating why intensionality cannot be ignored in natural language understanding. Suppose we have a question-answering system that was to return the names of:

(1) all the tall presidents of the United States ?(2) all the former presidents of the United States ?

A simple method for answering (1) would be to get two sets, the set of names of all tall people, and the set of names of all presidents of the United States, and simply return the intersection as the result.

What about the query in (2), however? Clearly we cannot do the same, because we cannot, like in the case of tall, represent former by a set (an extension) of all former things. If we did, than Ronald Reagan, for example, would have been a ‘former president’ even while serving his term as president, because he would have been in both sets: the set of presidents, and the set of ‘former things’, as he was also at that time a former actor.

The point here is unlike tall, which is an extensional adjective that can semantically be represented by a set (the set of all tall things), former is an intensional adjective that logically operates on a concept returning a subset of that concept as a result.

data and intensions

data, intensions and reasoning


Let us elaborate on this subject some more. The following is a plausible meaning for (1) and (2) above:

(1) tall presidents of the United States

) f x j is-president-of-the-us(x) ^ is-tall(x)g(2) former presidents of the United States

) f x j is-president-of-the-us (x) ^ F(x,president)g

What the above says is: (1) ‘tall presidents of the United States’ refers to any x that is in the set of presidents and also in the set of tall things; and (2) ‘former presidents of the United States’ refers to any x that is in the set of presidents and also some F is also true of x. Cleary what F does with an x is something to effect of making sure that x was, at some point in time, but is not now, a president. The point here is that unlike is-tall(x), F is not a set, and has no extensional value, but is a logical expression that takes a concept and applies some condition returning a subset of the original concept.

All of this is not available in data-driven NLU, where both ‘tall’ and ‘former’ are adjectives that equally modify nouns, which, as we have seen, can result in contradictions when executed on real data.

data and intensions



One misguided attempt at salvaging the data-only solution would be to maintain a set of for the compound former presidents.

This escape attempt is doomed, however, since composite sets for previous president, former senator, former governor, previous governor, etc. would also then have to be added and maintained. In fact, insisting on a data-only solution for intensional adjectives would essentially mean maintaining a set for every construction of the form [Adj1 Adj2 Noun], [Adj1 Adj2 Noun1 Noun2], … where any adjective Adji was an intensional adjective.

This is exactly the same the situation we encountered previously (pages 12-15), where composite features for every possible relation were needed to resolve references in a data-driven model. In both cases, such alternatives are neither computationally, nor psychologically plausible.

data and intensions



Another major problem with data-driven/statistical approaches to NLU is their complete denial of compositionality in computing the meaning of larger linguistic units as a function of the meaning of their constituents. To illustrate, consider the sentences below.

(1) Jon bought an old copy of Das Kapital.(2) Jon read an old copy of Das Kapital.

Although (1) and (2) refer to the same object, namely to a book entitled ‘Das Kpital’, the reference in (1) is to a physical object that can be bought (and thus sold, and burned, etc), while in (2) the reference is to the content and ideas in that book. Thus, ‘Das Kapital’ may refer to different features or properties of the book, depending on the context, where the context could extend over several sentences. For example, consider (3):

(3) Jon read Das Kapital. He then burned it because he did not agree with anything it espouses.

In (3), we are (at the same time) using ‘Das Kapital’ to refer to an abstract object (namely the content of Das Kapital) when Jon read it and then disagreed with it’s content, and a physical object, that can be burned. We will see later on how a strongly-typed system will discover the existence of all the potential types of objects that ‘Das Kapital’ can refer to (physical object that can be burned, abstract object that can be read and disagreed with, etc.)

data and intensions

compostionality


In natural language we can speak of anything we can conceive or imagine, existent or non-existent. We can thus speak of and refer to an event that did not exist, as in

(1) John cancelled the trip. It was planned for next Saturday.

In (1), we are speaking about and referring to an event (a trip), that did not actually happen, thus a trip that never existed. We can also refer to or speak of objects that do not exist, as in

(2) John painted a yellow bear.

In (2) what is ‘yellow’ is not an actual bear, but a depictionof some object, namely a bear. Reference to abstract and non-existent objects can be quite involved, especially in mixed contexts where the initial reference is to an object that does not necessarily exist, but is an object that subsequent context implies its existence. For example, consider the following:

(3) John’s book proposal was not well received. But it later became a bestseller when it was published.

In (3), the reference was initially to a book proposal, which does not imply the existence of the book, although subsequent context implies the concrete existence of a book. Such inferences cannot be made with a simple analysis of the external data.

data and intensions

yellow bears?


Data-driven approaches typically ignore functional words (prepositions, quantifiers, etc.), and for a good reason: the probabilities of these words are equal in all contexts! But such words cannot be ignored as these words are what logically glues the various components of a sentence into a coherent whole. Consider for example the determiner ‘a’, the smallest word in English, in the following sentences:

(1) A paper on genetics was published by every student of Dr. Miller(2) A paper on genetics was referenced by every student of Dr. Miller

While ‘a paper on genetics’ may refer to a single and specific paper in (2), this not likely in (1), where ‘a’ is most likely under the scope of ‘every’. That is, the most likely meaning of (1) is the one implied by

(3) Every student of Dr. Miller published a paper on genetics

Resolving such quantifier scope ambiguities are clearly beyond data-driven approaches and are a function of pragmatic world knowledge (e.g., while it is possible for several students to refer to a single paper, it is not likely that all of Dr. Miller’s students published the same paper…)

We shall later on see how a strongly-typed ontology of commonsense concepts can be used to make such inferences.

data and intensions

functional words


We (hopefully) have demonstrated that purely quantitative (statistical data-driven) approaches are not plausible models for natural language understanding, and for two main reasons:

1. The relevant information is often not even present in the data, or in many cases there is no statistical significance in the data to make the proper inferences. Attempts to remedy this lead to a combinatorial explosion in the size of features that would have to be assumed, which renders these attempts computationally implausible.

2. It was shown that even when the data is available, reasoning with data only and ignoring intensions and logical definitions can easily lead to absurdities and contradictions.

While statistical and data-driven models may not be appropriate for high-level reasoning tasks in language understanding, we believe that these models have a lot to offer in some linguistic and data-centric tasks. Chief among these are part-of-speech (POS) tagging, statistical parsing, and analyzing linguistic corpus data to ‘enable’ and automate some of the tasks needed in building a system that can truly understand ordinary spoken languages.

data-driven NLU?

so where are we now?


putting logical semantics

back to work

ontological vs. logical concepts


We will start with our proposal by first introducing the general framework, and we will do so gradually. The material presented form hereon assumes some exposure to logic, although we will try to simplify our presentation as much as can possibly be done.

One of the major features in our framework is the crucial idea of distinguishing between what can be called ontological concepts, or first-intension concepts, as Cocchiarella (2001) calls them, and logical concepts (or, second intension concepts). The difference between these two types of concepts can be illustrated by the following examples:

(1) R2 : heavy(x :: physical)R3 : hungry(x :: animal)R4 : articulate(x :: human)R5 : make(x :: human, y :: artifact)R6 : imminent(x :: event)R7 : beautiful(x :: entity)

What the above says is : heavy is a property that can be said of any object x that is of type physical; that we say hungry of objects that are of type animal; that articulate applies to objects that are of type human; that we can speak of the make relation between an object of type human and an object of type artifact; that we can say imminent of objects that are of type event; and, finally, that we can say beautiful of any entity.

the framework


Nino B. Cocchiarella(2001), Logic and Ontology, Axiomathes12: 117–150

http://irafs.org/stoqatpul/stoqnet/materials/cocchiarellla_logicontology.pdf

http://irafs.org/stoqatpul/stoqnet/materials/cocchiarellla_logicontology.pdf


It is also assumed that the types associated with predicates in (1),

e.g. artifact, event, human, entity, etc. exist in a subsumption

hierarchy as shown in the fragment hierarchy below, and where

the dotted arrows indicate the existence of intermediate types.

The fact that an object of type human is ultimately an object of type entity is expressed as human v entity. Furthermore, a property

such as heavy can be said of objects of type human and objects of type artifact since human v physical, artifact v physical

and heavy(x :: physical).

the framework


abstract

event

physical

+R1-

-R5+

-R3+

+R2 -

-R4+

+R6-

artifactanimal

human

entity


As mentioned earlier, a strongly-typed ontology is assumed throughout this study. Usually this conjures up thoughts of massive amount of knowledge that has to be hand coded and engineered by experts. This is not at all what we are assuming here. In fact, the ontological structure we are assuming (and we will discuss later on) is not massive at all since most everyday concepts are actually just instances of basic ontological types.

For example, there’s nothing meaningful (i.e., sensible, regardless of whether it is true or false), in language that we can say about a ‘racing car’ that we cannot say about a car). Thus, as far as language understanding, the ontological type car belongs to the ontology, and ‘racing car’ is just an instance concept. Similarly, most of everyday concepts turn out to be just instances of basic ontological types. This is related to a comment that J. Fodor once made, something to the effect that “to be a concept, is to be locked to a word in the language”. This is also inline with Fred Sommers’ idea of applicability in his proposal about The Tree of Language, and Gottleb Frege’s idea of how a word gets its meaning, namely from all the different ways it can be used in language. That is, what we can say about concepts in ordinary language, tells us what these concepts are, and, subsequently, tells us what structure lies behind.

We will discuss the details of the ontology later on. For now, we will simply assume that this ontological structure exists.

the framework

about the ontological structure


In our framework we assume a Platonic universe that includes everything we can talk about in ordinary discourse, including abstract objects such as events, states, properties, etc. These ontological concepts exist as types in a strongly-typed ontology, and the logical concepts are all the properties of, or the relations that can hold between, these ontological concepts. In addition to logical and ontological concepts there are also proper nouns, which are the names of objects; objects that can be of any type. We use the notation

(91Sheba :: thing)( … )

to state that there is a unique object named Sheba, an object thatis of type thing, such that ( … ). Let us now consider the interpretation of the simple sentence ‘Sheba is a thief’, where sstands for ’the meaning of the expression s‘, is used to mean 'is interpreted as', and where thief(x :: human) is assumed; that is it is assumed that the property thief applies to objects that must be of type human:

(2) Sheba is a thief) (91Sheba :: thing)(thief(Sheba :: human))

Thus ‘Sheba is a thief’ is interpreted as follows: there is some unique object named Sheba, an object that is initially assumed to be a thing; such that the property thief is true of Sheba.

the framework


)


Note that in (2), repeated below, Sheba is now associated with more than one type in a single scope: initially unknown, and thus assumed to be an object of type thing, Sheba was later assigned the type human, when described by the property (or when in the context of being a) thief.

(2) Sheba is a thief) (91Sheba :: thing)(thief(Sheba :: human))

In these situations a type unification must occur, and this is done as follows,

(Sheba :: (thing ² human)) ! (Sheba :: human)

where (s ² t) denotes a type unification between the types s and t, and where ! stands for ‘unifies to’. Note that the unification of thing and human resulted in human since human v thing; that is,

since an object that is of type human is ultimately an object of type thing. The final interpretation of ‘Sheba is a thief’ is now the following:

(2) Sheba is a thief) (91Sheba :: human)(thief(Sheba))

In the final analysis, therefore, ‘Sheba is a thief’ is interpreted as: there is a unique object named Sheba, an object that (we now know) must be of type human, and that object is a thief.

the framework

type unification – the basics


Although we have interpreted a very simple sentence, we have already seen the power of embedding ontological types (that exist in a strongly-typed hierarchy) into the powerful machinery of logical semantics. Specifically, it was the type constraint on the property thief(x :: human), namely that it applies to objects that must be of type human, that allowed us to discover the fact that Sheba must be a human. Admittedly, this a very trivial ‘discovery’ in a very simple context. However, the utility of type unification and the hidden information it will uncover will become more apparent as we move on to more involved contexts.

Let us now assume black(x :: physical); that is, the property black can be said of any object of type physical, and that own(x :: human,y :: entity), i.e., that objects of type human can own any object of type entity. Consider consider now the following:

(3) Sara owns a black cat) (91Sara :: thing)(9c :: cat)(black(c :: physical)

^ own(Sara :: human, c :: entity))

Thus ‘Sara owns a black cat’ is interpreted as follows: there is a unique thing named Sara, and some object c of type cat, such that c is black (and thus here it must be of type physical), and Sara owns c, where in this context Sara must be an object of type human and c an object of type entity.

the framework



The interpretation of (3) is repeated below:

(3) Sara owns a black cat) (91Sara :: thing)(9c :: cat)(black(c :: physical)

^ own(Sara :: human, c :: entity))

Note now that, depending on the context they are mentioned in, Sara is assigned two types, and the object c is assigned three types. For example, initially an object of type cat, c is assigned type entity when considered an object that can be owned, and an object of type physical, when described by the property black (see figure 1 below). The type unifications that must occur in this situation are therefore the following (see figure 1 below):

(Sara :: (thing ² human)) ! (Sara :: human)

(c :: ((physical ² entity) ² cat))

! (c :: (physical ² cat))

! (c :: cat)

The final interpretation of ‘Sara owns a black cat’ is therefore given by:

(3) Sara owns a black cat) (91Sara :: human)(9c :: cat)(black(c) ^ own(Sara, c))

That is, there is unique object named Sara, which is of type human, and some cat c, where c is black and Sara owns c.

the framework



the framework

(c :: cat)

(c :: physical)black

(Sara :: human, c :: entity)owns

(Sara :: thing)

a

b

c

Figure 1. All the type unifications required for sentence (3) above. Note that there is a single type unification required for Sara. For the object c, On the other hand, there are 2 type unifications (between a total of 3 types), that is C(2,3) = 2, that are required. Since type unifications are commutative and associative, the type unifications can occur in any order:

(a) the type unification ((cat ² entity) ² physical) ! cat

(b) the type unification ((physical ² cat) ² entity)! cat

(c) the type unification ((physical ² entity) ² cat) ! cat


As mentioned in our introduction, in our framework ontological concepts include abstract objects such as states, processes, events, properties, etc. Let us now consider one of these categories, namely activities. In our framework a concept such as dancer(x)

is true of some x according to the following:

(8x :: human)(dancer(x)

´ (9d :: activity)(dancing(d) ^ agent(d, x))

That is, any object x of type human is a dancer iff there is some object d of type activity such that d is a dancing activity, and xis the agent of d. Note that according to the above, there are at least two objects that are part of the meaning of ‘dancer’, and in particular, some object x of type human, and some dancing

activity, d. Thus, in saying ‘beautiful dancer’, for example, one could be using ‘beautiful’ to describe the dancer, or the dancing activity itself. Consider now the interpretation below, assuming that beautiful(x :: entity); that is, assuming beautiful is a property that can be said of any entity:

(4) Sara is a beautiful dancer) (91Sara :: thing)(9a :: activity)

(dancing(a) ^ agent(a :: activity,Sara :: human)^ (beautiful(a :: entity)

_ beautiful(Sara :: entity)))

the framework

abstract objects


Thus ‘Sara is a beautiful dancer’ is interpreted as follows: there’s a unique object named Sara, some activity a, such that a is a dancing activity, and Sara is the agent of a (and as such must be an object of type human), and either the dancing is beautiful, or Sara (or, of course, both). Note now that there are a number of type unifications that must occur:

(Sara :: ((thing ² human) ² entity))

! (Sara :: (human ² entity))

! (Sara :: human)

(a :: (activity ² entity))

! (a :: activity)

After all is said and done, the interpretation of (4) is the following:

(4) Sara is a beautiful dancer) (91Sara :: human)(9a :: activity)

(dancing(a) ^ agent(a, Sara)

^ (beautiful(a)_beautiful(Sara)))

Finally, ‘Sara is a beautiful dancer’ is interpreted as: there’s a unique object named Sara, an object that is of type human, and some dancing activity a, and Sara is the agent of that activity, and either Sara is beautiful, or her dancing (or of course both). Note that the ambiguity of what beautiful is describing is still represented in our final interpretation.

the framework

abstract objects


the paradox of the ravens

a brief detour


Before we continue with our proposal, we would like to illustrate the utility of separating concepts into logical and ontological concepts. We will do this here by proposing a solution to the so-called Paradox of the Ravens. Introduced in the 1940’s by the logician (and once an assistant of Rudolph Carnap) Carl Gustav Hempel, the Paradox of the Ravens (or Hempel’s Paradox, or the Paradox of Confirmation) has continued to occupy logicians, statisticians, and philosophers of science to this day. The paradox arises when one considers what constitutes as an evidence for a statement (or hypothesis). To illustrate what the Paradox of the Ravens is consider the following:

(H1) All ravens are black

(H2) All non-black things are not ravens

That is, we have the hypothesis H1 that ‘All ravens are black’. This hypothesis, however, is logically equivalent to the hypothesis H2 that ‘All non-black things are not ravens’, as shown below:

(1) (8x)(raven(x) ¾ black(x))

(2) (8x)(:black(x) ¾ :raven(x))

(1) and (2) are logically equivalent, thus any evidence/observation that confirms H1 must also confirm H2 and vice versa. While it sounds reasonable that observing black ravens should confirm H1, observing a white ball, or a red sofa, that do confirm H2, also confirm the logically equivalent hypothesis that all ravens are black, which does not sound plausible.

a temporary diversion

what raven paradox?

Paradox of the Ravens

https://en.wikipedia.org/wiki/Raven_paradox

https://en.wikipedia.org/wiki/Raven_paradox



what raven paradox?

H1: All ravens are blackH2: All non-black things are not ravens

Observing black ravens confirms hypothesis H1, namely that ‘All ravens are black’.

Observing non-black objects that are not ravens confirms hypothesis H2 (that all non-black things are not ravens). But H2 is logically equivalent to H1, leaving us with the unpleasant conclusion that observing red apples, blue suede shoes, or brown briefcases, confirms the hypothesis that ‘All ravens are black’.


Many solutions have been proposed to the Paradox of the Ravens that range from accepting the paradox (that observing red apples and other non-black non-ravens does confirm the hypothesis ‘All ravens are black’) to proposals in the Bayesian tradition that try to measure the ‘degree’ of confirmation.

The Bayesian proposals essentially amount to proposing that observing a a red apple does confirm the hypothesis ‘All ravens are black’ but it does so very minimally, and certainly much less than the observation of a black raven confirms ‘All ravens are black’.

Clearly, this is not a satisfactory solution since observing a red flower should not contribute at all to the confirmation of ‘All ravens are black’. Worse, in the Bayesian analysis, the observation of black but non-raven objects actually negatively confirms (or disconfirms) the hypothesis that ‘All ravens are black’.


what raven paradox?


One logician that stands out in suggesting an explanation for the Paradox of the Ravens is W. V. Quine, who suggested (in ‘Natural Kinds’) that there is no paradox in the first place, since universal statements of the form All Fs are Gs can only be confirmed on what he called natural kinds, and that ‘non-black things’ and ‘non ravens’ are not natural kinds. Basically, for Quine, members of a natural kind must share most of their properties, and there’s hardly anything similar between all ‘non-black things’, or all non-ravens.

While statistical/Bayesian and other logical proposal still have not suggested a reasonable explanation for the Ravens Paradox, we believe that the line of thought Quine was perusing is the most appropriate. However, Quine’s natural kinds were not well-defined. In fact, what Quine was alluding to, probably, was that there is a difference between what we have called here logical concepts and that of ontological concepts.


what raven paradox?

W. V. Quine (1969), Natural Kinds, In Ontological Relativity and Other Essays. Columbia University Press, pp.114-138

http://fitelson.org/confirmation/quine_nk.pdf

http://fitelson.org/confirmation/quine_nk.pdf


The so-called Paradox of the Ravens exists simply because of mistakenly representing both ontological and logical concepts by predicates, although, ontologically, these two types of concepts are quite different. First, let us discuss some predicates and how we usually represent them in first-order logic. Consider the following:

(1) black(x)

(2) imminent(x)

(3) sympathetic(x)

(4) hungry(x)

(5) dog(x)

(6) guitar(x)

Suppose now that we would like to add types to our variables. That is, we would like our logical expressions to be, in computer programming terminology, strongly-typed. Suppose, further, that we also like our predicates to be polymorphic; that is, they apply to objects of a certain type and all of their subtypes. That is, if a predicate applies to objects of type vehicle, then it applies to all subtypes of vehicle (e.g., car, truck, bus, …) Given this, what are the appropriate types that one might associate with the variables of the predicates above? Here are some possible type assignments:

(1) black(x :: physical)

(2) imminent(x :: event)

(3) sympathetic(x :: human)

(4) hungry(x :: animal)


what raven paradox?


What the above suggests is that, ignoring metaphor for the moment, the predicate black applies to objects that are of type physical. In other words, black is meaningless (or nonsensical) when applied to (or said of) objects that are not of type physical. Similarly, the above says that imminent is said of objects that are of type event (and, of course, all its subtypes, so we can say ‘an imminent trip’, an ‘imminent meeting’, imminent election’, etc.). In the same vein the above says that sympathetic is said of objects that must be of type human, and that hungry applies to objects of type animal. But how about the predicates in (5) and (6)? What are the most appropriate types that can be associated with the variables in the predicates dog(x) and guitar(x), or of what types of objects can these predicates be meaningful? The only plausible answer seems to be the following:

(5) dog(x :: dog)

(6) guitar(x :: guitar)

(5) and (6) are obvious tautologies, since, for example, the predicate dog applied to an object of type dog is always true. Clearly, then, (5) and (6) are quite different from the predicates in (1) through (4) : while the predicates in (1) through (4) are logical concepts, dog and guitar are not predicates/logical concepts, but ontological concepts that correspond to types in a strongly-typed ontology. With this background, let us now go back to the so-called Paradox of the Ravens.


what raven paradox?


The apparent paradox in the so-called Paradox of the Ravens was simply due to a logical representation that treated logical and ontological concepts the same, namely by treating both as predicates in first-order logic. We will now show that starting with the traditional first-order logic representation and associating types with variables of all predicates leads to the proper logical representation, where no paradox of the ravens arises.

All ravens are black)1. (x)(raven(x) ¾ black(x))

2. (x)(raven(x :: raven) ¾ black(x :: physical))

3. (x)(raven(x :: raven) ¾ black(x :: (physical ² raven)))4. (x)(raven(x :: raven) ¾ black(x :: raven))

5. (x)(true ¾ black(x :: raven))

6. (x :: raven)(true ¾ black(x))

7. (x :: raven)(black(x))

What we have in (1) is a straightforward translation of ‘All ravens are black’ to first-order logic. In (2), the most appropriate types are associated with every variable of every predicate. In (3), a type unification is performed between raven and physical, the result of which is shown in (4). In (5), raven(x :: raven) is replaced by true since the predicate raven applied to objects that are of type raven is always true. In (6) the type of x is associated once and for all with the quantifier taking the widest scope. Finally, (7) is a simplification of (6) since (true ¾ p) = p.


what raven paradox?


Let us now do the same for the equivalent hypothesis, ‘All non-black things are not ravens’:

All non-black things are not ravens)1. (x)(black(x) ¾ raven(x))

2. (x)(black(x :: physical) ¾ raven(x :: raven))

3. (x)(black(x :: (physical ² raven)) ¾ raven(x :: raven))

4. (x)(black(x :: raven) ¾ raven(x :: raven))

5. (x)(black(x :: raven) ¾ true)

6. (x)(black(x :: raven) ¾ false)

7. (x :: raven)(black(x) ¾ false)

8. (x :: raven)(black(x) _ false)

9. (x :: raven)(black(x))

Again, in (1) we have a straightforward translation of ‘All non-black things are not ravens’ to first-order logic. In (2), the most appropriate types are associated with every variable of every predicate. In (3), a type unification is performed between ravenand physical, the result of which is shown in (4). In (5), raven(x

:: raven) is replaced by true since the predicate raven applied to objects that are of type raven is always true, and in (6) we replace true by false. In (7) the type of x is associated once and for all with the quantifier taking the widest scope. Finally, in (8) and (9) simplifications are made using (p ¾ q) = (p _ q), and (p _false) = p.


what raven paradox?


What the above shows is that properly identifying the difference between ontological and logical concepts results in the same logical formula for (1) and (2):

All ravens are black ) (x :: raven)(black(x))

All non-black things are not ravens) (x :: raven)(black(x))

In both cases, therefore, we have one and the same hypothesis that is equally conformed by the same objects, namely ravens, that must be black. Note also that the final representation of ‘All ravens are black’ does not assume existential import; in other words, and unlike the traditional first-order logic representation, in our representation (x :: raven)(black(x)) is not true even if there are no ravens!

Our intention behind this short detour into the Paradox of the Ravens was to illustrate the utility of our representation, namely the separation of concepts into ontological concepts, which are types in a strongly-typed ontological structure, and logical concepts which are the properties of and the relations that hold between these ontological types.

We will now continue with describing our proposal.


what raven paradox?



back to work

type unificationsand resolving ambiguities


Thus far our type unifications have always succeeded. In some cases, however, a type unification between two types s and tcould fail, and we write this as

(s ² t) ! ?

Let us see where this might occur and what would this result in. Consider the interpretation of ‘Sara is a tall dancer’ where we assume tall(x :: physical), that is, we are assuming that tall

is a property that applies to objects that must be of type physical.

(5) Sara is a tall dancer) (91Sara :: thing)(9a :: activity)

(dancing(a) ^ agent(a :: activity, Sara :: human)

^ (tall(a :: physical)_tall(Sara :: physical)))

The type unifications needed for Sara are quite simple, since human v physical v thing:

(Sara :: ((thing ² human) ² physical))

! (Sara :: (human ² physical))

! (Sara :: human)

The type unification needed for the activity a, however, is not as straightforward. Before we continue, let us plug in the type unification of Sara to see where we’re at.

the framework

failed type unifications


(5) Sara is a tall dancer) (91Sara :: human)(9a :: activity)

(dancing(a) ^ agent(a, Sara)

^ (tall(a :: (physical ² activity))_ tall(Sara)))

Note that the only term that has an outstanding type unification, , requires a type unification that

will not succeed, as there is no object that can be both, a physicalobject and an activity. That is,

tall(a :: (phsycial ² activity))

! tall(a :: ?)

! false

Since ((a _ false) = a), putting all of this in (5) now yields the

following:

(5) Sara is a tall dancer) (91Sara :: human)(9a :: activity)

(dancing(a) ^ agent(a, Sara) ^ tall(Sara))

Therefore, unlike the situation in (4), where ‘beautiful’ in ‘Sara is a beautiful dancer’ was allowed to describe both, Sara and/or her dancing, in (5), and due to a failed type unification, ‘tall’ was not ambiguous but was left to describe Sara, exclusively.

the framework

failed type unifications

tall(a :: (human ² activity))


Thus far we only had single types associated with our objects. However, types are associated with senses (meanings), and not words. Since words are in reality ambiguous, in most cases the initial type assignment will not be a single type, but a set of types, where the various type unifications will in the end reduce the set of ‘acceptable’ types to, ideally, one. To illustrate, let us consider the following two sentences:

(7) The party was cancelled.

(8) The party nominated Jon.

According to WordNet, ‘party’ has five meanings (although two of those are just shades of the same meaning). The meanings of ‘party’ include ‘political party’, ‘social gathering/group’, and the meaning of a ‘social event’. Thus, the interpretation of (7) would initially start as follows:

(7) The party was cancelled) (91p :: fpoliticalParty,socialGroup,socialEvent,…g)

(9a :: activity)(cancellation(a)

^ object(a,p::event))

That is, there is some unique object p, an object that could be a politicalParty, a socialGroup, or a socialEvent, and some cancellation activity, where p is the object of this cancellation, and as such, it must be an object of type event. The following are the type unifications that are needed:

the framework

types & lexical disambiguation


(p :: f (politicalParty² event),(socialGroup² event),

(socialEvent² event),…g! (p :: f?,?,socialEventg)! (p :: socialEvent)

Notice that due to type unifications two initially assumed types of p(politicalParty and socialGroup) were eliminated. Thus, the meaning of ‘party’ that makes sense in this context, is that of a social event. Similarly, the interpretation of (8) is the following (see the relevant fragment hierarchy in figure 2 below):

(8) The party nominated Jon) (91p :: fpoliticalParty,socialGroup,socialEvent,…g)

(91Jon :: human)(9a::activity)

(nomination(a)^ agent(a,p :: fhuman,socialGroupg) ̂ object(a,c:: human))

Notice that the agent of a nomination can be an individual or some social group. Thus, there are two sets of types for p that must be unified, and the total number of type unifications is the number of elements in the Cartesian product of these two sets:

(p :: fpoliticalParty,socialGroup,socialEventg² fhuman,socialGroupg) !(p :: ff(politicalParty ² human),(politicalParty ² socialGroup)g,

f (socialGroup ² human),(socialGroup ² socialGroup)g,f (socialEvent ² human),(socialEvent ² socialGroup)g,…g)

the framework


abstract

even

t

physical

socialEventparty,politicalParty

recreationalEvent

socialGathering


the framework

Figure 2. A fragment of the ontological structure

party

entit

y

socialGroup


The six different type unifications that are possible for p are then done as follows:

(p :: ff(politicalParty²human),(politicalParty²socialGroup)g,f (socialGroup ² human),(socialGroup ² socialGroup)g,f (socialEvent ² human),(socialEvent ² socialGroup)g,…g)

! (p :: ff?,politicalPartyg,f?, socialGroupg,f?,?g,…g)! (p :: ffpoliticalPartyg,fsocialGroupg,fg,…g)! (p :: fpoliticalParty,socialGroupg,fg,fg,…)

At this point an attempt at uniting the various subsets by type unification (as opposed to equality) is attempted, where a pair

belongs to the union of two sets if the two types s and t unify. With this, the type unifications needed for p proceed as follows

! (p :: f f(politicalParty² socialGroup)gg)! (p :: f fpoliticalPartygg)! (p :: f politicalPartyg)! (p :: politicalParty)

In the final analysis, therefore, the meaning of ‘party’ that makes sense in the context of ‘The party nominated Jon’, is that of a ‘political party’.

the framework


(s ² t)


Consider the fragment hierarchy shown in figure 3 below. While humans make (as physical artifacts)1, ride (as vehicles) and drive

(as road vehicles) trucks, the most salient relationship between a human and a truck is drive. Let us consider the sentence in (9) that involves what is known in linguistics by metonymy (which is a figure of speech in which an object, attribute, or some adjunct is used as a substitute for another object that it is associated with, or is somehow related to).

(9) The truck in front is obnoxious) (91t :: truck)(9r :: region)

(front(r :: region)

^ in(t :: (physical ² truck), r :: region)

^ obnoxious(t :: (human ² truck)))

Note now that the object t, which is of type truck, must be unified with the type human in the context of being obnoxious. However, neither human v truck is true, nor truck v human. In a situation like

this a search is made for the most salient relationship between these two types. In this case it is the relation drive(human,*), where * stands for truck (figure 3 below).

the framework

salient properties/relations

1 Humans also make abstract artifacts (e.g., we make music, we make algorithms, we make make plans, we make decisions, etc.), however, that ‘make’ is different from physical construction. That is, there are many ways of making, depending on the (ontological) things being made; and thus the different senses of make, such as create, invent, conceive, plan, design, etc. This subject is beyond the scope of the current presentation.


fragment of the ontology

entity

abstract

human

physical

physicalArtifact

truck

-ride(human,*)+

-

make(human,*)+

+drive(human,*)

-

Figure 3. A fragment of the ontological structure, where ‘*’ in the relations points to the type below in the direction of the ‘+’ sign. Thus while humans make artifacts, humans ride vehicles, and they drive road vehicles, including their subtypes, of course, namely trucks.

vehicle

roadVehicle


Introducing the salient relation drive(human,*) essentially means introducing an new object, namely some object of type human into the context (recall ‘the missing text’ discussion!). The situation now looks like this:

(9) The truck in front is obnoxious) (91t :: truck)(91x :: human)(9r :: region)

(front(r :: region)

^ in(t :: (physical ² truck), r :: region)

^ obnoxious(x :: human)

^ drive(x :: human, t :: (roadVehicle ² truck)))

where

(t :: (roadVehicle ² truck)) ! (t :: truck)(t :: (physical ² truck)) ! (t :: truck)

In the final analysis, therefore, ‘The truck in front is obnoxious’ was interpreted as ‘The [person driving the] truck in front is obnoxious’:

(9) The truck in front is obnoxious) (91t :: truck)(91x :: human)(9r :: region)

(front(r) ^ in(t,r) ^ obnoxious(x) ^ drive(x,t))

the framework

salient properties/relations


We trust that the above introduction and brief description of our proposal is sufficient to start presenting our proposal in more detail. The rest of this study will describe how our proposal will tackle some of the most challenging phenomena in natural language, namely

1. Word-sense Disambiguation2. Adjective-Ordering Restrictions3. Metonymy4. Abstract objects (states, processes, events, activities, etc.)5. Abstract and Concrete Existence6. Co-predication 7. Reference Resolution8. Quantifier Scope Resolution9. Prepositional Phrase Attachments10. Intensional Contexts11. Compound Nominals

12. The Ontological Structure 13. An Implementation Roadmap

In the last two sections (sections 12 and 13) we discuss the strongly-typed ontological structure that we assumed throughout and suggest the appropriate way to implement our proposal.

the road ahead

ontological semantics: contents



back to work

ontologic andlanguage understanding


We will now start by detailing how our proposal can account for a number of phenomena in natural language. We will do so starting with one of the basic (and first) challenges that a NLU system will face, namely that of word-sense (or lexical) disambiguation. Recall the sentences in (10) and (11), discussed on page 51:

(10) The party was cancelled.

(11) The party nominated Jon.

In these sentences the ambiguity we considered was that of party, which could refer to a social event, as in (10), or to a political organization, as in (11). Since ‘party’ corresponds to an ontological concept (i.e., a type; see figure 2), the lexical ambiguity of ‘party’ thus translated into a set of types, a set that subsequent type unifications reduced it to a singleton set (i.e., to a single type).

(10) (p :: f (politicalParty² event),(socialEvent² event),…g! (p :: f?,?,socialEventg)! (p :: socialEvent)

(11) ! (p :: f f(politicalParty² socialGroup)gg)! (p :: politicalParty)

Below we will see however that lexical ambiguity often translates, in addition to ambiguity of types (i.e., ambiguity in ontological concepts), into an ambiguity in logical concepts; i.e., an ambiguity in the properties of and the relations between ontological concepts.

the proposal

words-sense disambiguation


Let us now look at situations where lexical ambiguities translate into ambiguities in both, logical and ontological concepts. Consider the sentences in (12) and (13):

(10) Melinda ran for twenty minutes.

(11) The program ran for twenty minutes.

First of all, there is a clear ambiguity in the meaning of ‘program’, as it could refer to a computer program (i.e., a process), or to a program of some event, among other meanings. Second, it is clear that the running of Melinda in (10) is different from the running of the program in (11). Let us consider the simplest of these two cases, namely the ambiguity in (10), assuming that there are (at least) two kinds of running activities, one who’s agent is a (legged) animal, and one who’s agent is a process:

(10) Melinda ran for twenty minutes) (91Melinda :: human)(91m :: minute)

((running(a)^agent(a,p :: (process ² human))

^duration(m,20)) _(running(a)^ agent(a,p :: (leggedAnimal ² human))

^ duration(m,20)))

What the above says is the following: there’s a unique object named Melinda, some twenty minutes that Melinda ran, and either a running activity of some human, or the running of some process.

the proposal



(10) Melinda ran for twenty minutes) (91Melinda :: human)(91m :: minute)

((running(a) ^ agent(a,p :: (process ² human))

^ duration(m,20)) _(running(a)^ agent(a,p :: (leggedAnimal ² human))

^ duration(m,20)))

Thus, while the ambiguity of an ontological concept translates into a set of possible types, the ambiguity of a logical concept (such as running) translates into a disjunction of predicates. In the above we have two type unifications that will proceed as

follows, assuming that human v leggedAnimal:

agent(a,p :: (process ² human))

! agent(a,p :: ?)

! false

agent(a,p :: (leggedAnimal ² human))

! agent(a,p :: human)

Note now that

((running(a)^? ^ duration(m,20)) _(running(a)^ agent(a,p :: human)^duration(m,20)))

´ (false _ (running(a)^ agent(a,p :: human)^duration(m,20)))

´ (running(a)^ agent(a,p :: human)^duration(m,20))

the proposal



The final interpretation of ‘Melinda ran for two twenty minutes’ is the following:

(10) Melinda ran for twenty minutes) (91Melinda :: human)(91m :: minute)(9a :: activity)

(running(a)^agent(a,p) ^duration(m,20))

That is, there’s a unique (human) object named Melinda, an object that is the agent of a 2—minute running activity, a running that is performed by humans. Let us now consider the interpretation of (11), which would proceed as follows:

(11) The program ran for twenty minutes) (91p :: feventProgram,computerProgramg)(91m :: minute)

((running(a) ^ agent(a,p :: process)

^ duration(m,20)) _(running(a)^ agent(a,p :: leggedAnimal))

^ duration(m,20)))

Thus, in addition to the ambiguity of the what logical concept we are referring to in ‘ran’, we also have the ontological ambiguity of which type of a ‘program’ we are referring to. In this case we hope that the type unifications will reduce to the set of types associated with ‘program’ to a single type and will also eliminate one of the disjunctions of the running activity.

the proposal



The following are the type unifications that are required in (11):

(91p :: f(eventProgram ² process),

(computerProgram ² process),

(eventProgram ² human),

(computerProgram ² human)g)! (91p :: f ?, computerProgram, ?, ?g)! (91p :: computerProgram)

// the agent activity who’s agent is a processagent(a,p :: (process ² computerProgram))

! agent(a,p :: computerProgram)

agent(a,p :: (process ² eventProgram))

! agent(a,p :: ?)

! false

// the agent activity who’s agent is a humanagent(a,p :: (human ² computerProgram))

! agent(a,p :: ?)

! false

agent(a,p :: (human ² eventProgram))

! agent(a,p :: ?)

! false

Note now that the only type left for ‘program’ is that of a process, and the only ‘running’ activity left is the running of a process.

the proposal



We have briefly discussed above the notion of adjective-ordering restrictions (AORs). This phenomena can be illustrated by the sentences below, where (1a) is considered more natural to say than (1b) and similarly for (2a) and (2b).

(1) a. A small leather suitcase was found unattendedb. A leather small suitcase was found unattended

(2) a. John is a pleasant young boyb. John is a young pleasant boy

In this section we will explain this phenomena and how it is related to the notions of type unification and the notion of polymorphism and type casting. This explanation will also include explaining when certain adjectives can be ambiguous in what they might be describing, and when this ambiguity is not allowed. For example, while ‘beautiful’ in (3) may be describing both Olga as a person or her dancing (or both), this possibility is not available in (4).

(3) Olga is a tall beautiful dancer(4) Olga is a beautiful tall dancer

the proposal

adjective-ordering restrictions

See our explanation of the phenomena of adjective-ordering restrictions (paper was presented at KI-2008 )




To be continued …

the proposal

adjective-ordering restrictions

Technology

Deep Misconceptions and the Myth of Data-Driven NLU - On Putting Logical Semantics Back to Work