90
Reading to Learn Q4 Review Peter Clark John Thompson Phil Harrison Bill Murray

Reading to Learn Q4 Review Peter Clark John Thompson Phil Harrison Bill Murray

Embed Size (px)

Citation preview

Reading to LearnQ4 Review

Peter ClarkJohn Thompson

Phil HarrisonBill Murray

Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”

– Overview– Characterization and analysis– Quantification

• Two Case Studies– AP chemistry– Grade-school biology

• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations

SRI-Boeing’s Reading to Learn Seedling

• Goal:– study issues in learning through reading by working with a

reduced version of the problem, namely working with controlled, rather than unrestricted natural language. The NLP task is factored into two:

• full NL → CL, CL → logic

• Rationale:– by sidestepping some of the shallow linguistic issues of full

NLP, can focus on deeper issues– methods for full NL → CL can be studied separately

this project

SRI-Boeing’s Reading to Learn Seedling• Approach:

– Rewrite 5 pages of chemistry text into our controlled language, CPL

– Extend and use our CPL interpreter to generate logic

– Integrate this new knowledge with an existing chemistry knowledge base (from the Halo Pilot), which has the new knowledge surgically deleted from it

– Report on the problems encountered and solutions developed

This Seedling in Mobius

KnowledgeIntegration

Introspection

Natural LanguageProcessing

TestGeneration

This seedling

Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”

– Overview– Characterization and analysis– Quantification

• Two Case Studies– AP chemistry– Grade-school biology

• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations

Recap: October 2005

• Tutorial on the 5 pages of chemistry text– Acid-base reactions, proton transfer

• Where is that knowledge in the text? – Wanted: Clear, declarative statements– Got: obscure/missing/complex/indirect

• Where is that knowledge in the Halo KB?– Wanted: Modular, constructed from general pieces– Got:

• buried in procedures and code• Very hard to ablate or extend

– Suggestions for a better KB structure

(every Compute-Conjugate-Acid has (input ((a Chemical with (plays ((a Base-Role)))))) (parent_formula ((the term of (the nested-atomic-chemical-formula of

(the has-basic-structural-unit of (the input of Self))))))

(target-unit ((if (the parent_formula of Self) then (:set (#'(LAMBDA () (GET-CONJUGATE-ACID-ATOMIC-FORMULA-BACK

(KM0 '(|the| |parent_formula| |of| |Self|))))))))) (output

((if (oneof (the input of Self) where (It isa H2O-Substance)) then (a H3O-Plus-Substance)

else ((forall (allof2 (the target-unit of Self)

where ((not (It2 = (the parent_formula of Self)))))

(the output of (a Identify-Chemical with (input ((a Chemical with

(has-basic-structural-unit ((the output of

(a Identify-Chemical-Entity with (input ((a Chemical-Entity with

(nested-atomic-chemical-formula ((a Chemical-Formula with (term (It)))))))))))))))))))))))

?

“An acid = a base + a proton”

(every Acid-Role has (intensity ( (a Intensity-Value with (value (

(:pair ;; Case statement for Acids. (if ((the played-by of Self) isa Ionic-Compound-Substance) then (if (((the played-by of Self) isa HCl-Substance) or

((the played-by of Self) isa HBr-Substance) or ((the played-by of Self) isa HI-Substance) or ((the played-by of Self) isa HClO3-Substance) or ((the played-by of Self) isa HClO4-Substance) or ((the played-by of Self) isa H2SO4-Substance) or ((the played-by of Self) isa HNO3-Substance)) then *strong else (if (((the played-by of Self) isa H3PO4-Substance) or

((the played-by of Self) isa HF-Substance) or((the played-by of Self) isa HC2H3O2-Substance) or((the played-by of Self) isa H2CO3-Substance) or

Relative strengths of different acids

• Two CPL versions: (i) close to text (ii) close to inference– Predictable performance

• Discussion of “bridging the gap”

Recap: March 2006

IF there is an equation of a reaction AND a first chemical entity has a chemical formulaAND a second chemical entity has a second chemical formulaAND the first chemical formula is part of the left side of the equation…..THEN the direction of the reaction is rightAND the equilibrium side of the reaction is right.

Manually bridging the “gap”

Inference-Supporting CPL:Predictable Performance

Conjugate pairs

Relative strengths

Labelling acid/bases in a reaction

Computing direction of the reaction

Giant KM procedure for formula manipulation

Qualitative absolute strengths (strong/weak/negligible)

+ qualitative comparison

Giant KM procedure for reaction manipulation

KM rule

Task Halo KB

Lookup table

Relative strength assertions

if-then rule using conjugate pairs

if-then rule

CPLMore general

≈≈

(equivalent)

Questions and Tasks from Last Time• Analysis of “the gap”

– What is the nature of the gap?– Can we characterize it?– Can we quantify it?

• AP chemistry vs. grade-school biology

– How does the gap look in different texts? Domains?– What are the fundamental problems?– How severe are they?– How might they be overcome?

• Case Studies– Given

• text/naïve CPL formulation A• Inference-capable target B

– What knowledge is needed to get from A to B?– How much can be pump-primed, how much bootstrapped?

I: Understanding Language

KnowledgeIntegration

Introspection

NaturalLanguageProcessing

TestGeneration

This seedling

Natural and Controlled Languages• Where is Reading to Learn/Mobius’s Achilles’ heel?

– Schubert: “Dealing with real natural language”– Not (just) the grammatical complexity– It is the imprecision, messiness, incompleteness, and

erroneous nature of real language

• Two styles of CPL usage:(i) As a declarative rule language(ii) As grammatically simpler real language

• Worked with both within this Seedling(i) does inference, but is far from original text(ii) is close to the text, but barely supports inference

(i) CPL as a declarative rule language

“IF a first chemical is stronger than a second chemical AND the second chemical is stronger than a third chemical THEN the first chemical is stronger than the third chemical.”

“IF there is an equation of a reaction AND a first chemical entity has a chemical formulaAND a second chemical entity has a second chemical formulaAND the first chemical formula is part of the left side of the equationAND the second chemical formula is part of the right side of the equationAND the first chemical entity is playing a base roleAND the second chemical entity is playing a base roleAND the first chemical entity is stronger than the second chemical entity THEN the direction of the reaction is rightAND the equilibrium side of the reaction is right.”

(ii) CPL as grammatically simpler real languageAcids have a sour taste.

Acids cause some dyes to change color.

Bases have a bitter taste.

Bases have a slippery feel.

All acids contain hydrogen.

37 percent of the mass of concentrated hydrochloric acid is HCl.

The concentration of HCl in concentrated hydrochloric acid is 12 M.

HCl reacts with NH3 without an aqueous solution.

The reaction transfers a proton from an HCl molecule to an NH3 molecule.

The "HX" in Equation 16.6 donates a proton.

The donating leaves behind an X-minus ion.

The X-minus ion plays a Bronsted-Lowry base in the reverse reaction.

The H2O molecule in Equation 16.6 accepts a proton.

The accepting produces an H3O-plus ion.

Two Paths from Language to Logic…

Declarative CPL rulesInference-supporting

Representation

“The Knowledge

Gap”

Real TextReal(istic) CPL Text

Literal/messy logic representation

“Israel’s Problem”

Real(istic) CPL Text

Inference-supporting

Representation

“The Knowledge

Gap”

Real TextLiteral/messy logic representation

Assume a perfect algorithm for English to (literal-like) logic. Are you done?

Declarative CPL rules

Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”

– Overview– Characterization and analysis– Quantification

• Two Case Studies– AP chemistry– Grade-school biology

• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations

An Analysis of the Gap

• What is the nature of the gap?• Can we characterize it?• Can we quantify it?• How does the gap look in different texts? Domains?• What are the fundamental problems?• How severe are they?• How might they be overcome?

Analysis• Looked at these phenomena in two sets of text

– 5 target pages of AP chemistry– 5 pages of grade-school level biology

• from the Web, about the heart and its function

• Categorization of main causes• Loose quantification of their frequency

9 Fundamental Causes of the Gap

1. Many idiomatic words/phrases, each requiring a theory

2. Some knowledge is taught by example

3. Much important knowledge is conveyed by diagrams and tables

4. Generic sentences are ubiquitous

5. Some text teaches problem-solving knowledge

6. Discourse context is important (need sentence context)

7. Many sentences pose major representational challenges

8. Math/Algebraic models are extremely challenging

9. Text is full of ambiguity, metaphor and metonymy/loosespeak

1. Idiomatic/special-purpose words/phrases• Many words/phrases require special interpretation

– Breadth requirement is very challenging!• 70% in chem, 40% in bio

– Chemistry• “The reaction favors transfer of…”• “From the earliest days of experimental chemistry…”• “The ion, however, more closely represents reality”• “When we closely examine the reaction…”• “According to their definition…”

– Biology• “This is important for the cells to do their work.”• “On its way back to the heart…”• “The right-side pumps stale blood…”• “to smaller and smaller branched tubes…”

2. Examples• Examples play a key role in human teaching• How important are these for a machine?

– Consolidation, verification, disambiguation?• 35% chem, <5% bio

3. Diagrams and Tables

“Teaching” how to compute conjugate acid/base pairs

Relative strengths of acids

• 10% in chemistry but key ones!!! Incidental in bio.• Show-stopper for some needed knowledge

4. Generics• Reference to a collection rather than individual object• Ubiquitous! 90% chemistry, 95% biology

– Chemistry• “Acids cause certain dyes to change color”• “Acids have a bitter taste”• “A substance that is …. is called amphoteric”

– Biology• “The blood leaving the aorta is full of oxygen”• “Veins have thin walls”• “The heart pumps blood to your lungs”

Why are generics hard?• Quantification

• “Acids contain hydrogen.”• Fuzzy quantifiers

• “An HO3+ ion sometimes reacts with three H20 molecules”

• Presuppositions• “HCl dissolves in water.”• “Acids cause some dyes to change color”• “Acid irritates the skin”

• Need background knowledge!• IF an acid touches some skin THEN that skin is irritated”

or more generally• “IF acid + skin are related in way where irritation may

plausibly occur… THEN it will occur.”

5. Needing/Teaching Problem-Solving Knowledge

• Problem-solving knowledge– Chemistry (20%) biology (<5%)

• Worse, is often not even explicit in the text, e.g.:

6. Discourse Context• Can we take sentences in isolation? (“bag of lines”)• Obstacles:

– Pronoun resolution (30% chem, 50% bio)– Context: unqualified compound nouns (most)

• “Every [Bronsted-Lowry] acid has a conjugate [BL] base”• “The [human] heart…The [human] arteries…”

– Other dependencies (15% chem, <5% bio)• “Therefore, HX is the Bronsted-Lowry acid”

• “The other conjugate acids are HS-, PH3 and CO32-”

Discourse Context (Biology)

• Sentences stand on their own more often, e.g.,

7. Major Representational Challenges• Hard to quantify: ~70% chem, ~40% bio• Potentiality:

– an acid is a substance (molecule or ion) that can donate a proton to another substance. Likewise, a base is a substance that can accept a proton.”

• Conveying a proof:

• Imprecision and comparatives:– “About 37% by mass”– “Interacts strongly”– “The aorta is the largest artery in the body”

8. Math/Algebraic models

• ~65% chemistry use or manipulate formulae• “NaOH dissociates into Na+ and OH- ions.”• “An H+ ion is simply a proton with no electrons”• “HX and X- differ only in the presence of a proton”

• Challenges– Relating the symbol system to the real world– Defining and apply operations on the symbol system– Relating those operations to the real world

Math/Algebraic models (cont) • Minimal in grade-school biology

– nearest is rates and measures• “the heart contracts 70 times a minute”• “The plasma is 95% water and the other 5% of dissolved

substances”• “In an adult’s body there is 10.6 pints of blood”

Loosespeak (metaphor, metonymy, etc)• Biology: metaphor more common

• “Your heart’s job is to pump blood”• “Blood delivers oxygen…On the return trip, the

blood picks up waste products”

Loosespeak (metaphor, metonymy, etc)

• Analysis by Univ Texas at Austin (chemistry)– Loosespeak is everywhere!

Relative Frequency of Phenomena

0

10

20

30

40

50

60

70

80

90

100

idio

ms

gene

rics

repr

esen

tatio

n

alge

bra

loos

espe

ak

disc

ours

e

prob

lem

-so

lvin

g

exam

ples

diag

ram

s

AP Chemistry Grade-School Biology

Relative frequency of phenomena

idioms examples diagramsgenerics problem-solving discourserepresentation math loosespeak

AP Chemistry (5 pages) Grade-school biology (5 pages)

Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”

– Overview– Characterization and analysis– Quantification

• Two Case Studies– AP chemistry– Grade-school biology

• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations

Case Study 1 of the GapAP Level Chemistry

Some acids are better proton donors than other acids.Some bases are better proton acceptors than other bases.The conjugate base of a strong proton donor is a weak proton acceptor.The conjugate acid of a strong proton acceptor is a weak proton donor.A stronger acid has a weaker conjugate base.A stronger base has a weaker conjugate acid.A stronger acid is a better proton donor.A stronger base is a better proton acceptor.

Original English

CPL (like)How do we bridge the gap?

From Original English to CPL - 1

• Resolve “others” to mean “other acids/bases”• Use “likewise” to guide a parallel construction• Need to represent “some,” “other,” “better”• Assumes a scale of ability to donate/accept

From Original English to CPL – 2a

• Need to interpret “If we do X, we find that Y” as a mental exercise that draws a conclusion

• Need to have a concept of an ordering based on some ability (to donate a proton)

• Resolve “their ability” back to types of acids• Resolve quantification – one proton per instance of acid

molecule

From Original English to CPL – 2b

• Here “a substance” means an acid molecule• Need to handle jumps between substance-level and molecule-level

references• Need to interpret “the more readily an A does B, the less readily a

C does D”• Need a model of two qualitative scales of ability, with an inverse

relationship• Resolve “its conjugate base” back to the acid

From Original English to CPL – 3

• “Similarly” is a cue for a parallel construction• Other issues are the same as in the previous sentence

(inverse qualitative scales)

From Original English to CPL – 4a

• “In other words” is a cue for another view of the same knowledge in the previous sentence

• “the more readily an acid gives up a proton” = “the stronger an acid”

• Related qualitative scales again• “the stronger an acid” is special syntax

From Original English to CPL – 4b

• Semicolon here denotes parallel constructions• This is also another view of the same knowledge in the

previous sentence• “the more readily a base accepts a proton” = “the stronger

a base”• Inverse qualitative scales again

Overall Interpretation (sketch)

Acid readily

gives up a proton

Acid strength

Conjugate base

readily accepts a

proton

Conjugate base

strength

“In other words”:

“Similarly”: replace acid with base, replace conjugate base with conjugate acid

inverse

inverse

parallel parallel

From Original English to Inference-Supporting Logic: Knowledge Requirements• Discourse Knowledge:

– Pragmatic knowledge for pronoun resolution– Ability to recognize and match parallel constructions

• E.g., with cue words• Both within and across sentences

– Ability to recognize a mental exercise (“if we do …”)

• Domain Knowledge:– Models of qualitative scales and relationships between two

scales– Knowledge to handle substance/molecule metonymy– Models of abilities & give/receive– World knowledge to help resolve quantification

• e.g., one proton per molecule makes most sense

Case Study 2 of the GapGrade-School Biology

Grade-school Biology

• Searched the Web, found 4 simple texts about the human heart and its function

• They are much simpler than our college chemistry text, but still exhibit lots of interpretation issues

• Only a few sentences from each text happened to be in pure CPL syntax

• By the time science is taught in school, the students are beyond the Dick & Jane reading level

Grade-school Biology Syntax - 1

• Pronouns are everywhere– “Your heart is divided into two sides.” [anyone’s heart]

• Dependent clauses are common– “As blood begins to circulate, it leaves the heart …”– “… fresh oxygen that we have inhaled …”

• Conjunctions appear between various expressions

– “… the vessels and the muscles that help and control …”– “Lizards don’t have hair or feathers … and can’t sweat …”

• Comparatives are common– “The tubes that more gently drain back to the heart …”

• Approximations are common– “… some 70 or so times a minute at rest …”

Grade-school Biology Syntax - 2• Negatives are sometimes used

– They do not work on their own, but together as a team.”

• Phrases often modify other terms– “The blood leaving the aorta is full of oxygen.”

– “On its way back to the heart, the blood travels …”

• Infinitives are sometimes used– “This is important for the cells … to do their work.”

• Parenthetical expressions are sometimes inserted– “… the carbon dioxide (a waste product) is removed ...”

– “… times a minute – more if you are exercising – and …”

Grade-school Biology Syntax - 3

• Rhetorical questions to the reader– “Did you know that your heart is the strongest muscle?”

• Modals are sometimes used– “… so that your body can get rid of them.”

– “… your blood vessels could circle the globe 2 ½ times!”

• Phrases about what something is called– “… a colorless liquid called plasma.”

• Omitted words– “… the other two [cavities] are called ventricles …”

• Adverbs, complex phrases, and other minor issues

Grade-school Biology Semantics

• Analyzed sample grade-school biology texts about the heart and circulation

• What commonsense knowledge is needed to correctly understand the text?– What pump-primed models would be needed?– What underlying knowledge could come from bootstrapping?

• As from tuple extraction from general texts

• Rhetorical question – skip “Did you know that”

• “your heart” = a person’s heart (anatomy context)

• “strongest muscle” [in same body] (anatomy context)

• Build in pragmatics of reading for an anatomy context

• Knowledge: basic anatomy (bootstrapped)

• “divided into” = partitioned (word sense for anatomy)

• “two sides” = two compartments (anatomy/container)

• Knowledge:

• Container/compartments model (pump-primed)

• “right side” = [of the heart] (model of left/right parts)

• “pumps blood” = continuous process (anatomy)

• “to your lungs” could mean it fills up the lungs!

• what is “it”? – right side, or blood, or lungs?

• “picks up” = metaphor for absorbs (anatomy context)

• Knowledge:

• Containers, pumps, liquids (pump-primed)

• “left side” = [of the heart] • “oxygen-soaked blood” – but a liquid is already wet

– Would like a model of blood cells, soaked in oxygen (fluid)– Not provided here, so just assume blood absorbed oxygen– Resolves previous sentence: pronoun “it” = blood

• Knowledge:• model of left/right parts (pump-primed)•“out” - liquid flow in & out of containers (pump-pr.)

• “They” = the two sides of the heart (difficult)

• Rely on discourse pragmatics

• Knowledge:

•“work on their own” vs. “together as a team”

• Doing something alone vs. cooperating in an effort

• “The body’s blood” = all its blood as a single blob

• Knowledge:

•“circulated through” - model of closed fluid circulation

•“1,000 times per day”- model of repeated events per time period

• “five and six thousand” = 5 ≤ x ≤ 6,000?• Use pragmatics to get: 5,000 ≤ x ≤ 6,000

• “pumped each day” -- by which side? Or both sides?

• Could pose question: How much blood does a body contain? – 5 to 6 quarts (inference needed)

• Knowledge:

• Fluid flow, iteration, time periods

• “your fist” -- interesting object, involves a pose

• Knowledge:

•“about the same size as” – model of comparative sizes

Summary of Biology Semantics • Pragmatics for an anatomy context• Pump-primed models:

– Container & compartments & left/right parts– Continuously repeated biological events– Pumps & liquids & closed circulation– Working together vs. alone– Body parts in poses & comparative sizes

• Bootstrapped models:– Basic anatomy

• Some difficult pronoun resolutions

Grade-School Biology Conclusions

• Lots of pump-primed knowledge needed

• Bootstrapped knowledge can help

• Even grade-school texts have significant challenges

• Pragmatics need to be built in to NLP engine

• Is still substantially easier than AP chemistry!

Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”

– Overview– Characterization and analysis– Quantification

• Two Case Studies– AP chemistry– Grade-school biology

• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations

Dimensions of Difficulty

Complexityof Knowledge

Educational Level of Text

Grade-school CollegeElementary

Grade-school biology

AP Chemistry

Two Dimensions of DifficultyDimension 1: Domain

• Chemistry (hardest)– Algebraic manipulation, chaining, procedures– Not so much “common sense”

• Physics– Map situations onto a few equations

• Biology (easiest)– Memorize and compare structures and functions

Two Dimensions of DifficultyDimension 2: Educational Level

• College level (hardest):– Sophisticated writing styles– Often includes mathematical abstractions– Attempts to challenge the student– Problem-solving

• Grade-school level (easier):– Simpler sentence structures– Teaches common world knowledge– No/little mathematics– Learning basic facts

Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”

– Overview– Characterization and analysis– Quantification

• Two Case Studies– AP chemistry– Grade-school biology

• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations

II: Integrating Knowledge

KnowledgeIntegration

Introspection

NaturalLanguageProcessing

TestGeneration

This seedling

Knowledge Integration:Principles for an Extensible KB

• The Halo KB was not easily extensible• What should it have looked like?

Five Principles for an Extensible KB1. Need Metonymy-Tolerant Repns

The precision that logic requires of our writtenrepresentations is a fundamental barrier to robustness

IF “the acid on the left” is stronger than “the acid on the right”THEN the reaction direction is “to the right”

“the acid denoted by the formula on the left side of the equation of the reaction”

• Alternative: – Preserve metonymy in the KB– Have it resolved at reasoning time

(every Compare-Relative-Strengths-of-Acids has (output ((if (((the1 of (the value of (the intensity of

(the Acid-Role plays of (the first of (the input of Self)))))) = *strong) and((the1 of (the value of (the intensity of (the Acid-Role plays of (the second of (the input of Self)))))) /= *strong))

then (the first of (the input of Self)))))

(every Compare-Relative-Strengths-of-Acids has (output ((if ((the intensity of (the first of (the Chemicals)) = *strong)

and ((the intensity of (the second of (the Chemicals)) /= *strong)then (the strongest of (the Chemicals)) = (the first of (the Chemicals)))))

1. Metonymy-Tolerant Repns (cont)

if we had a metonymy-tolerant reasoner, we could instead write…

2. Need to Separate Declarative and Procedural Knowledge

input: a Base-Chemicaloutput: convert Chemical → Molecule → Formula, append “H”, then → Molecule’ → Acid-Chemical

Procedural: (Conjugate-Acid calculation)

Declarative:

Acid-Chemical = Base-Chemical + H

+ constraint reasoner to solve constraints

2. Need to Separate Declarative and Procedural Knowledge (cont)

“Every acid has a conjugate base, formed by removing a proton from the acid. ... Similarly, every base has associated with it a conjugate acid, formed by adding a proton to the base.”

Acid-Chemical = Base-Chemical + H

The English text often doesn’t help…

3. Syntactic Organization Matters!• Elaboration tolerance:

– Add/modify knowledge (semantics) by (only) adding formulae (syntactics)

(every Acid-Role has (intensity ( (a Intensity-Value with (value (

(:pair ;; Case statement for Acids. (if ((the played-by of Self) isa Ionic-Compound-Substance) then (if (((the played-by of Self) isa HCl-Substance) or

((the played-by of Self) isa HBr-Substance) or ((the played-by of Self) isa HI-Substance) or ((the played-by of Self) isa HClO3-Substance) or ((the played-by of Self) isa HClO4-Substance) or ((the played-by of Self) isa H2SO4-Substance) or ((the played-by of Self) isa HNO3-Substance)) then *strong else

Not elaboration-tolerant

3. Syntactic Organization Matters!

• Better….

intensity(HCl-Substance, *strong)intensity(HBr-Substance, *strong)intensity(HI-Substance, *strong)intensity(HClO3-Substance, *strong)intensity(HClO4-Substance, *strong)intensity(H2SO4-Substance, *strong)intensity(HNO3-Substance, *strong)…intensity(HF-Substance, *weak)intensity(HC2H3O2-Substance, *weak)intensity(H2CO3-Substance, *weak)…

Elaboration-tolerant

4. Use a linguistically motivated ontology• Key: mapping from English words/phrases to

knowledge-base concepts• Good: Words and concepts match easily:

• Less good: Linguistic concepts are missing

• Even worse: Different conceptual view in the KB

HCl-Substance ↔ “HCl” Easy

Direction of equilibrium: Attached to reaction, not eqn, in KB

*strong/*weak/*negligible ↔ “HCl is stronger than H2O”

4. Use a linguistically motivated ontology

• Key: mapping from English words/phrases to concepts

• Good: Words and concepts match easily:

– HCl-Substance ↔ “HCl”

• Less good: Linguistic concepts are missing

– strong/weak/negligible↔“HCl is stronger than H2O”

• Even worse: Different conceptual view in the KB

– Direction of equilibrium: • Attached to reaction, not eqn, in KB

5. Need Error-Tolerant Reasoning

• KM can go belly-up with a contradiction• Rather need to detect and correct contradictions

– Detect: • explore (ruminate), not just myopic backchaining• richer background knowledge

– Correct:• reasoner supports suspension of assumptions/rules (TMS?)• search mechanism to control this

Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”

– Overview– Characterization and analysis– Quantification

• Two Case Studies– AP chemistry– Grade-school biology

• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations

Knowledge Mining

There is a largely untapped source of general knowledge in texts, lying at a level beneath the explicit assertional content, and which can be harnessed.

“The camouflaged helicopter landed near the embassy.” helicopters can land helicopters can be camouflaged

Schubert’s Conjecture:

Our attempt: “lightweight” LFs generated from ReutersLF forms: (S subject verb object (prep noun) (prep noun) …) (NN noun … noun) (AN adj noun)

Knowledge Mining

HUTCHINSON SEES HIGHER PAYOUT. HONG KONG. Mar 2.Li said Hong Kong’s property market remains strong while its economy is performing better than forecast. Hong Kong Electric reorganized and will spin off its non-electricity related activities. Hongkong Electric shareholders will receive one share in the new subsidiary for every owned share in the sold company. Li said the decision to spin off …

Newswire Article

Shareholders may receive shares.

Companies may be sold.

Shares may be owned.

Implicit, tacit knowledge

Knowledge Mining – our attempt

;; Atoms can combine(S "atom" "combine")

;; For example, combustion reactions are redox reactions because elemental oxygen is converted to compounds of oxygen (Section 3.2).(S "reaction" "be" "reaction")(S-ADJ "oxygen" "converted" ("to" "compound"))(AN "elemental" "oxygen")

;; Plan: Metals react with acids to form salts and gas.(S "metal" "react" (PP "with" "acid"))

;; Extensive oxidation can lead to the failure of metal machinery parts or the deterioration of metal structures.(S "oxidation" "lead" (PP "to" "failure"))(S "oxidation" "lead" (PP "to" "deterioration"))(AN "extensive" "oxidation")

Fragment of the raw data (Brown & Lemay)

Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”

– Overview– Characterization and analysis– Quantification

• Two Case Studies– AP chemistry– Grade-school biology

• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary

Summary: Overall Findings and Products• CPL: two formulations

– "naive CPL": 275 sentences– rule-language CPL: ~15 complex rules

• CPL language interpretation algorithm

• Understanding Language– Characterization and quantification of the main challenges– Detailed case studies on the five pages

• Integrating Knowledge– Characterization of the main challenges– Set of principles for overcoming them– Study and algorithms for some of them

• Bridging the Gap: Useful conceptual framework

• Text Mining– 2 tuple databases: 15k chemistry, 25k biology

Summary: Recommendations for Mobius

• Significant work needed on– math/symbol manipulation– handling generics– idiomatic words/phrases– Loosespeak

• Cycle, not just bottom-up/top-down!• Discourse structure needs to be taken seriously

– Not just individual sentences• Need some radical KB changes

– extensible units of knowledge, not intertwined structures– Error-tolerant/Robust reasoning