24
AD-AO92 500 DARTMOUTH COLL HANOVER N H DEPT OF MATHEMATICS F/6 5/2 7 NATURAL LANGUAGE DATA BASE QUERY.(U) UNLSIE OCT 80 L Rt HARRIS NC00I4-75-C-0514 UN PSIIED ML * IhhhhhhIN flflfflflfflflfflfll.ND

COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

AD-AO92 500 DARTMOUTH COLL HANOVER N H DEPT OF MATHEMATICS F/6 5/27 NATURAL LANGUAGE DATA BASE QUERY.(U)UNLSIE OCT 80 L Rt HARRIS NC00I4-75-C-0514

UN PSIIED ML

* IhhhhhhINflflfflflfflflfflfll.ND

Page 2: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

- INAL REPOT.

Submitted to the Office of Naval Research

for a grant in support of research entitled

( (J atur~al Language Data Base Query,

D I T P ao .onr: %Larry R. Hlarris

iil.I P Tcip e nvestigator

ease Dartmouth College~j~i~,tl~flHanover, NH 03755

Page 3: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

Secr 68CauificationDOCUMENT CONTROL DATA.- R&D

(Security grI..aitcattan of dit. body at abstract a" nda~igin anotation must b0 angered when the ovat rel port to chgahdi.E

I ORIGINATING ACTIVITY (Coapota*. aullir)2aRP T C RI LASF AIO

UNCLASSIFIEDDARTMOUTH COLLEGE 2 b -owHANOVER, NH 03755

3 REPORT TITLEFINAL REPORT SUBMITTED TO THE OFFICE OF NAVAL RESEARCHFOR A GRANT IN SUPPORT OF RESEARCH ENTITLED

/NATURAL -LANGUAGE DATA BASE QUERY,_,..4 09SCRIPTIVE NOTES (Type of report and inclusive doe&)

5 AUTHOR(S) (Lost nefm., 10tIIA 11am. Wntla))

HARRIS, LARRY R.

6. REPORT DATE a OA O b O FRF

I OCT 1980 27ia. CONTRACT ORt 4RANT NO. 94. CRIGINAlON'S REPORT NUhiUER(S)

N00014-75-&-4&4- O.:/1"b. PROJECT NO.

S NR049-344 Ob. OTHER R PORT NO(S) (Any alA., numb.,. diet may be assigned

d.

10- A V A IL ABILITY/LIMITATION NOTICES

I I SUPPLEMENTARY NOTES 12. SPONSORING MILITARY ACTIVITY

OFFICE OF NAVAL RESEARCH

13 ABSTRACT

SThis final report is intended to be a summary of the researchon Natural Language Data Base Querype -~~--,iiiii 11:-stpported-by -the Offie -aa~aer~ %cne-44173I It has beenthe goal of this research to determine a minimal set of techniquessufficient to provide a practical natural language capability fordata base query. This report summarizes the basic requirements forsuch a capability and suggests techniques for meeting theserequirements. As such, this report is in effect, a specificationof the minimal functionality for a practical natural language database query capability.

DD I AN4 1473 UNCPJASSIFIED

Page 4: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

Security Classification _______

14KY OD LINK A 1.111 8E Lok

ROLE OFT ROLEi WT 1101. Utr

NATURAL LANGUAGEDATA BASE QUERY

v PARSING

INSTRUCTIONS

1. ORIGINATING ACTIVITY: Enter the name and address imposed by security classification, using standard statementsof the contractor, subcontractor. grantee. Department of De- such as:fense activity or other organization (corporate author) issuing (1) "Qualified requesters may obtain copies of thisthe report. report from DDC"Is. REPORT SECUINTY CLASSIFICATION: Entter the over- (2) "Foreign announcement and disser. ation of thisall at, urity classification of the report. Indicate whether rpr yDCI o uhrzd"Restricted Date" is included, Markting is to be in accord- eotb D sntatoie.Ance with appropriate security regulations. (3) "U.. S. Government agencies may oL,.ain copl..a of

this report directly from DPC. Other qualified DDC2b GROUP: Automatic downgrading is specified in DoD Di- users sha!l request throughft,ve 5200. 10 and Armed Forces Industrial Manual. Entert group number. Also, when applicable, show that optional t

Flnarkiflgs have been used for Group 3 and Group 4 as author- (4) 11U. S. military agencies may obtain copies of thistied.report directly from DDC. Other qualified users

3. REPORT TITLE: Enter the complete report title In all shall request throughc apiitlttLers. Titles in all cases should be unclassified.1It a meaningful title cannot be selected without classifica-tiori, show title classification in all capital& in parenthesis (S) "All distribution of this report is controlled. Qual-immrediately following the title. ified DDC users shall request through

4. D)ESCRIPTIVE NOTES, If appropriate, enter the type of____________________ 'report, e.g., Interim, progress, summary, annual, or final, Ifthe report has been furnished to the Office of TechnicalGive the inclusive dates when a specific reporting period is Services, Department of Commerce, for sale to the public, indi-C,,vcr.,d. cats this fact and enter the price, if known.S. AUTlKII(9) Enter the name(*) of author(s) as shown on 11. SUPPLEMENTARY NOTES: Use for additional explaa.-or in the report. Enter last name, first name, middle initial. tory notes.If ma~litary. show rank and branch of service. The name ofthe principal ..;sthor is sn absolute minimum requirement. 12. SPONSORING MILITARY ACTIVITY: Enter the name of

6. RPOR DAE. nte th dae a th reortas ay. the departmental project office or laboratory sponsoring (par-u.h yea'rT orAot. Enear. t dao then rnepr das dppas for the rsearch and development, Include address.

cmoth, ea;or mot ya.Ifmrshnoe date appearston 13. ABSTRACT; Enter an abstract giving a brief and factual

7a. OTA NUMER F PGES:Thetotl pae cunt summary of the document indicative of the report, even though7a. OTA NUMER F PGES:Thetotl pae cunt it may also appear elsewhere in the body of the technical re-

sh.uld follow normal pagination procedures, ie., enter the port. If additional apace Is required, a continuation sheet shall'iuanhcr of pages containing informstion. be attached.76. NUijER OF REFERENCES: Enter the total number of It in highly desirable that the abstract of classified reportsreferences cited In the report. be unclassified. Each paragraph of the abstract shall end withSo CONTRACT OR GRANT NUMBER; If appropriate, enter an indlcstiun of the military security classification of the in-the applicable number of the contract or grant under which formation in the paragraph, represented as (TS). (s). (Ci. or (11).the report was written. There is no limitation on the length of the abstract. How-6b. &, & 11d. PROJECT NUMBER: Enter the appropriate ever, the suggested length is from I5O to 225 words.military department identification, such as project number, 14KEWOD: eywrsaetcnalymnifutrmsubproject number, systems numbetrs, task number. etc.14KEWOD.eywrsa tcnalymnigutrs

or short phrases that characterize a report snd may be used s9a. ORIGINATOR'S REPORT NUMBER(S): Enter the offi- index entries for cataloging the report. Key words must beeikl report number by which the document will be identified selected so that no security classification As required. Identi-and controlled by the originating activity. Tis number must firits, such as equipment model designation, trade name, militarybe unique to this report. project code name, geographic location, may be used as key

9b. OTHER REPORT NUMBER(S): If the report has been words but will be followed by an indication of technical con-assigned any other repert numbers (either by the oriinalor text. The assignment of links, rules, and weights is optional.or by the sponsor), also enter thip number(s).

10. AVAIL.AIILITY/LIMITATION NOTICEE& Enter Loy lint.-Rtations on lut then dissemination of the report, other thsn thosel

UNCLASSIFIEDSecurity Classification

Page 5: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

FINAL REPOTSft)mitted to the Off ic. of Naval Research for a grant in sipnpnrt

of research entitled Nattiral Language Data Bansp Otpjry.

Larry R. HarrisPrinciple Investigator

Dartmouth College

Hanover, MH 03755

Abstract

This final report is intended to be a summary of the rp.arch

on Natural LanguaQe Data Base Oipry performe.d at Dartmo,|th Colle.

supported by the Office of Naval Research since 1973. It hs bepn

the goal of this research to determine a minimal st of tPchniqu S

sufficient to nrori ..- oractical nitiiral langliaqe capability, for

Jate base quiery. fhis reoort stummarizes the basic re(qitirements For

stch a capability nd sir.ijests techniques for mneting these

reqifrements. As siuch, this report is in effect, a specificrtion

of the minimal fuinctionality for a practical nituiral lang3(taqe Hata

base query capability. IEli

(4 4 W

00U'(D R

Page 6: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

Pnipe 2

I nLroductjon

iahen rese-irch tind! r thlis contr-ict he~jq n in 1973, the cst-ti, ofthe art in Drnctic,)l nnt'rrnIl 1nqua-1-1 data hasqp quepry vm-s

e,-ssentially non-nxistent. All of the- then exi.sting resea,-rch

iyste ms (and r1vIrn of todaiy's systems) were semnnticaqll" liqit,-l to

-i single dom-iin of disrouirse. The primvary requisito of --

"lpractical" cviprv rapanhility is thAt it be "applicatlon

independent". Achi,3vinq this aprlicatlon independence requiresq a

fund-imnta1 co'-i iit~nent throuighot the des;ign of thae system. it is

Pncoiiragin] to gen that the research commuity as :I whole hasl

stairtedi rovirin in this; direction.

Phis ro.-;ort orst of a suimmary descrription of the -itnimnil

reqtuirpemnt'i fnr i -,,r-iticil query capability. The torhniiqul.-

depvnlopod to iii-L thes.i rr1 1 nm~ni~re oreml ,#, I id. Ih~s (i,,

,-,Jor cormpronpntr of operition tha't Make lip tho nrorcessinj cyrle o~f

ai ripipst. Thpp~ crwmronont, arP the lexic-il nwiyzi r (the

-;c',nner), the syntictic rinaflyz.r (the narser), the Hats bnr

striictire analyzer (the navigaitor) -and the Int-i nrocesslqng iouje

Page 7: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

Pnin 3

ik L- xircii AnalIyzer -- fho S'irrinner

The hAsir function of the !-,canne-r i-s to etermine lqhat the

lnrlividiil unrdW n - tol-Qns., nre. TFhe scanner hreakq thn inpuit

.2 trrevn Into n qqincP of tokenF. A modification of the finitp

-vito-nnton scanner lised in compilers; i- sufficient for this task.

1'hP no-fificantions -ire reli(jireti to deal. with Phrase, amrl

recojnition of 'spncial nuimeric tokens. Phrases, slich ns "Vice-

eresidentu or "Je)w York" must be recognizedi -I, Similp tokpens even

thuih they containi q spicn. r'urnrical .9trincis such ns N;0/01h/3111

mtist be reco-pniv'e iq 1 sqinqle token (represePnting -) date) .,herreas

.11/311 inust h(e re oniiw ns three tokens reprPesentini1 "on,- Hivirini

:)y tonreel.

Scanner-, with these capaibilities nre commonplace- ninonq AT

n-atural ing~iip syc-,tpms. Tfhp most common pitfi 11 is to lrnbPi-

inellin3 detection, or worre yet, snpllino correction within thn

srcannpr. Foth ,rpellin,,.crrerction nd~ spelllnq rfetortion renuiir-

cidv.,nre knowle'~il of nIll worrds that crin be -PmployP,4 h, li-prq. Tfhis

is prohibativp in i domnin inderendepnt approach, since thiq set

of words must clearly contain ill the words in the datA ae

fhernforev the %canner inust Accept words Abouit which it knowsq

nothing -- which couild be, of course, potentially iisspellrd

vinrlis. fhP dsitrectirn of i !spnlling7 error 1I best ad sn of

Page 8: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

4P

L'1P -o-t mor~nar ",lvifn n srintence fail,; to h, ii,ferstoo:i 1)v thn

svqtpn. Ibhiq aporoich s-itisf ips the dfoinin inrf1epen1PnrP crite-rion

'11 '.'*?ll as illo,,,in-1 vnli- reltilests; with no spepllin~ Prro,; th-it

.ort-iin to Jita- valueps not actlullyv in the data) hnas to 5P handfle-

)ropnrly.

fhi ' Synt-ictic Anailyzer -- The Pirse-r

The parser is the hnart of the ninturn-l langtin11 component of

the system. Its role is to syntactically relate the comoonents' of

the requjest. This )rocpq-, driver, the construction of the semantir

striictujres that represent the menninci of the requiest. Trherer -re

several compuiting pnradi-Ims for natural langiiarle pairsprs. In thq

earlyi p-art of o'ir research, we strccesfir1Iy emnioved - top-down

context-free pirser. Later we switched to an Aiqrnented Trransition

:letwork parser (AT 1). .4n feel there nre several advintilf..s to the

AT11 -pproacli th-it itakp it fnr more convenient to uise, nlthouqr~h

suiccessftil context-free narsers couild be built ns wqell. IRerentlv

several new 1)arsirvq schemes; have been reporter; in the lIteratlurp,

so that it may ,vil he the case that the AM'? te-chnology is now

datedt. However, it seems clear that the choice of parsing

algorithm is less important than the mechanism by wvhich the syntnx

controls the generation of the requiired semintic striuctrrs.

Page 9: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

Amhi IIity

The 4ef initi'' pro'-lrl of this type iqc h-it of i~mbirllitNI.

jev-ra'l dfis-tinrct -::*iintir rerprnspntritions i;iiist he r)en-r-ite]l frori 'I

singlie input strin-i. Pie Hiff icuilty here is not how' to ,-o it, hilt

:'iow to limit tivq (ener~tion of too many interpretations. ! (flY

rese-irchers hav-i cilosen to limit the parser to genernting only one

interpretation inr{ stopping. This approarh rikes the riprcijori of

wlhic:i rparse to ptirstte 9i verv criticail one. It ilso mnkpq rin-il in.-

ivit~i truly ainbijliiois requepsts vpry diff icutlt.

Oujr rpcomr~nPtion is to solve the problemr from the other qnri

of tine spectrium hy non-rftrinistic-illy qenerntinrill po-,qih)]1

interpretations. P-iis trainsforiTis the issueP Fron one of tr-vioi to

deci le on a rel-itivp hais which of tw'o pnrtinl pnrsns loo! s monre

progiisinq, to ne of trying to deide on in 1h-'olitte haqis which

of two romnilete pirsepq is more irrieningftil. j'hPrse is also n nroh1en

of aff iciency in ropin'j with the potentIi e-xpnepnti,-l nuimher of

interpretat ions. Jt should be clenr th-it depcisions Mlc~e on nn

absoluite basis sifter the parse should be more icciira~te than

decisions made on a relatiVe hnsis duirnl the parse. The effects

of trie remnininj portion of the inpujt have hnH a chqnce to ilnnqrt

Page 10: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

L1 iecision in th for-iir caseP, hbt not in Lhi 1latter. Ifniuever,

Uhe problemn of Qx;'onentiql growth must be I1enlt with very

careftilly. Fortirnacely it qeems9 thait, qt lshast for n-itural

lain.cjinge query, this c-in bo dePAlt with by t'ininj the pqrse;r to

reduice the non-deptorminism. Fortunntely the length of the inntit is;

uisually very sri-ill (les.s than 20 token,,) nd- the nirrIner of >

Incisions is "reasonihlv" small.

Aindincj Values to i~iplds

(Oe type of Amiguity that arisePs frequently in diata hnase

ieries is that of choosing the field to which -1 given v.alue I.,

related. t.lost rese-irch systems solve this problem with dictionary

let initions. This;, of couirse, is clearly a violation of domain

incde~erdence sincn it requires enutmerating all uniqueP dgta baise

vlu~es in the lexion. (Pur )ppronch has been to dynrimir.ally

determine this fromi indices maintained by the DBMS. In addition

to t:-is, we allow three levels of strength in defining datai

'talii-is in the dictionary. 1'hpy can be tightly hoiind, weapki"t houind

or iinbounf. This allows for suifficient generality at the ;-imp time

It permits definitions that couild limit the non-dietnrminismi.

Page 11: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

ligh-Level Semantic Entities

Another imnortant component in the relating of syntax to

semantics is the ability to deal with entities that themselves

imply a significant semintic striicture. This involves both complex

1efinitions in the dictionary, as well as the billity to den]

wit., these definitions in the parser. i:xanples of this would he

terms like "hnrhelor" or "profit margin". The first .pecifip a

complex descriptiorn whereas the reconrl specifies a formula for

calculating profit margin from other entities that arp more

directly available. It is of critical importance for a nat'|ral

languiage systea to he able to dpal with such t,rm directly,

rather than forcinj the user to continlally define them.

Another t,pe of word that implies a ,substnntive holv of

iemntics is the nronoun. fhe distinction her. is that the

'"neaning" of the pronoun is not to he fouinrl in the firtionarv itit

in the context of the dialog. In anrfltlon, there m',,, he ;ome

ambiguity that irises in determining what the pronoun refers to.

A wide variety of solutilons appears in the l iterattire for this

problen. Ouir approich is to maintain two context re(listers

- 2* - - -f

Page 12: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

-l)Ll- ! j irl th Ow -,w-mL j I trut,r s of~ )rpv in, i c: jinr I(,,. O)n~i I t ie

rpvlo'is reitipe't, Lhe other Is- the re-luies;t ti~t the pre',ioliq

rpqvir-st may hnr,-Fnrrr!, LO. fr. ir"ditinn, intrn--s-nt,-nti;I!

irofliff refprpc! -ire 1150 ;nll~

v i hvp. f opin-1 th is iroronri. to he stif fici'-nt f or riqnl lvn

LhPe vnst mijority o-f p~ronoiin rpfer,-nres in fit-i hnse n'1ierIps,

incl'zdlng the 'lifficijt spences ".dihnt i,; thq mnximim siir" In

issotiri?" 11'hh -I-irns it?". Contrary to poplv-r helief, thp

nron..,rn "it" clois t j!st refpr to the nnswetr of the- first

request. If it di"4, the secondi reqiuest vqoi1 rv-nqr~te rill np

~~rinjtl.it slrevin thoseq outsice 'fi'-sotirI.

Amlbicjuoijs pronolin raiferenres are den ilt with in thn s--I~ip 1.1n'1v

-is other nrmhi-Iiiitir, s. All of thp possihle r-ft-rent- aire concilprQ'4

-id rion-determiniLic lnorpretitions nre rrentri'$ fir each

.,.o 5.5 1) 11ity. n:sw internre-tntions*retn opr.'onajhl

tasii later nlonr) .- ith othir Interpretation-, reat(ed by nt'iar

Lypq!; of nmhimittv. ip have fouind a'erh1;rotrizr o

invriltirible in werini out imrodesirpr' interrretntinnq cro-tpA h)y

i npraiper pronouin r.iferncs. The, optimizoer -ftctc thp lnriir-al

cootrn-lict Ions thnt often get creatpl in this 14ny.

Page 13: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

uqr1. for: :31 i -- tvi-t to "~rrr('p in to nritiri I I r'uvI'~"t',b

jijrif-i Is that of arith-wLir Apljl 1~.Iv(3 fotir I4hs

*'r'c~rslye es-ri t aFor'. I lanpioranfi -mrorirh r'.r n-irsinI

in blr~t c x~)~ ': j~~ to Ie coriji letilly rnnsvi 'Lent --i th thj- &T'l

;r.i Id'. ouvrr oth n-fcrrnjij forms, of iritfviptir

'.:~rn j~f~ ~? ~ in )-itiir i 1.inoiiij q(i'iriP5;. For Qxoiipnl 11211ow'.

mr i Is :iq -In"' f III,- romji I '-s on?"1 Her;- w' s;ee "lq l i-io(I

; C~ )ii. i Ol- :,ronoiin '1lb b reiak in j lip t'i- winrindi-

)por-ttor-opr3rvJ i?~I~~fo*~ f inemterits to the bneir r~riirrs1,

Ins'~tijrfrcc' ,rp r, -ifireJ to Aleil with these, orobin is.

ih. lkiivp found revQrf,1 hmuristics neprs-c,,ry in cfetpr::ulninq the

)ro,-qr rp,.r)oflsp for i rjivr'n jiiry. Thes;e 4irp prinril' ilun to the.

infcirfna, phr~ilri of quleries thit occulr in not',ral Iir'c

*)fteni tsers dio not expliritly risk to ho q.ivqn Inforr~r.tirn thb-it

they qiuite obvioiisly np;,d to interpret the -ns,-ers -inr pjiltQ

rightfully expect the system to provide. For extinpie, the raqueslt

"Print the salary of Sinith and Lawler" implies the printino of

1rime even thotighl 1u literril interpretatlon wouldi not print it.

Page 14: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

"ri t: .,)it L fr in LIri I L, i n-ii 1- ; r: -iI- P- L r ~ , H

imo-ih1Ip for tho use;r LOto t, ii -zjt l'' 'n ithihjl

'DiniltrI', niY1-rri 2Vul'*r~ ~~F it' I ri Ipar r-i.Y'r -illyI

i, ~n,; P c; t i a ,r o in L (r pr i, L t oo 1 i t. ,ij tI v1'.'. i -.rr Y i Io

ii iln?" i I11:7Lrito- this , oint . I t Lhii sys-L.? .;Pre -L) rp[rh- n i aI lv

i 11L c it "'.ii11 a i n if i i r It i (',I tO r int ni', i f t .id I Ir ,nr

t1it 1 to th1e re q 1 w -;t. (: Mr I rY, t 1e (Ir ij;t i q 'IZ t h- 3 T en I o%, PrIl to

1. toc t s mch -,it' ti n -mfl il'v to fn lt nr-iinc whan L U , rr r r' pnv,

iikoiih! im. ITo -ictiwitiofl -f ';ijch hmiri-,tic~i mt t ItIl!:Lely/ hp

.,verhe-iring s;ilC( LI'.e -w~nr lo,;(., the ahil i ty to pr, ci -e! v control

LI~e resporv fn tlv)'~ -i titonm; where it Is imlport ntfr h

t. o;t. X, -I C 1.c 1 ! i- t i t i; t llI

Page 15: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

The Nlaviqator

After the :qvi ry is Parsed, the semantic struictire ,mist he

protected onto the .iatabise scineme resiultinm in a ;trntelv for

extracting the desired information from the database. he

lifficulty of the navi,-tlon problem can range from trivial to

arbitrarily conolpx dependinj on data model employed by the Or".)

and how well the given data base is organized.

For single flat file organizations, navigation is at its

simplest. Hut even for single flat files, dlifficultieq can arise.

if the file was created by flattening ouit a hierarchy or a

network. In these cases, the notion of a record correspondinq to a

real world entity is lost. Hence the navigator must take alvnnt-,r,

of the fact that the file was originally non-flat to conqtr ict thle

proper means of iccess into the file.

For the more struirtured data models -,uch as rplationnl,

network and le rnrch Iral, thi, ri, viontion p oc , ;,; t. i ,i l Ii '

importance. I)etnrmininj the proper entry point into the strc tlirp

as wnll as the proper linking relationshins can critically affect

the contents of the final response, not to mention the impact on

efficiency. In this regard, the relational model's greatest asset

is its closed form expression of a query. This makes it at least

isiscosdfr

Page 16: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

PAqe 12

!,ossible to expre-; nnviaiclon in a high-level lan i-ne like

JmEUHL. For hierarchies and networks no suich rlosed form

expr'esion is iused by the I)rS and therefore it is impossible to

express navifjationil choices without resortina to n nroceduiral

representation. This is the essence of the reason why no nool hiqh-

level qujery lanjires, even of a formal nattlre, exist for network

and hierarchial Dl'kS's. Some intermediate level renrepsentation is

clearly needed here.

Even for the relational systems we have not been uiniformly

satisfierd with SEQUEL as an intermediate language. fhe

navijational linkale is specifierd in an uindtily intricate fashion,

and the functionn] Itv is incomplete. This latter point is

indicative of tlh fact that mtich of SEOUELS power is misdirected0

in terms of the needs of a naive end user of a natuiral language

sysLem, at least in terms of ourr experience. Vhereqs SEOUEL

provides no assistince tn answering many of the difficult natural

lariage requests we encounte-, its power in intricate cyclc

navi)ation is too suhtle to he uisefuil in a natural lanuage

setting.

Page 17: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

Paqe 13

Javiciational Ont imiat ion

Another a'ect of navioation is quiery optimization. The

na viational cIoice made have a profound effect on the effIcienry

of Ienerating tne mnswer. 9ift even when dealini with a ;inriln

r elation (or a sin-1Ie flat file Hatabase), there is a qignificnnt

'mount of query optimization that can he done. For example,

.uestions asking from the maximum, the minimum, or uniqje listin-ys

can be answerei directly from the DBIMS indices if they are

avallable. Thefse o:)timizations can change a several minute

resoonse into an instantaneouts response. These kinds of

optinizations reqiiire knowledge of how the data is goinI to he

procassed after retrieval as well as knowledge of and access to

the I)BS indices.

Knowledge of the TIS indices is also - critical factor in

navigation because it determines how long the DBMS will takp to

respond. Clearly we want to optimize the work done hy the DB'IAS andi

:)refer to ,)ennr-ite requesqts that make uise of Indices or hash

coding rather than file pa.s searches. Rt, glvPn the varietv of

ways In which DBAISs behave on rixtures of keyed and non-keyed

.earches, It is clear that the interface muist have its own ahility

to perform the f||nctions of searching and sorting. lhern is a nice

neshlnq of these functions that makes it possible to Avoid nny

Page 18: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

Pnje 14

IKirestrictions in this; area nnod nt the samer time oive-.s thim

interface control of the situation so that expensive requests c-n

1e trapped. In ,jeneral, once the )B1S lets control, the user must

I walt ,,ntil the 1)B1'i hris finished processing the reqiest, which may

-)e qlite a while. ry selectively sharing some of the searching rInd

sortinj work, it is possible to maintain control an-i .,!grn th. ,ser

,ihen things oet 'xnqnive.

This flexiility is achieved only hy a-ded complexity. Not

only must the interface have the searching and sorting

functionality hilt it must also be prepared to represent the

partitioned workloa1 and, of course, compute the partition that

Willl effect the greatest efficiency. With some Dr,'ISs, this

rapability is only an efficiency ootion. 'ith other DFISs that dio

not sulpport non-keyed searching or sorting at ill, this

capellity becoaes critical in terms of being -ible to answer th.

request at all.

Jecjrity

Another aspect of the navigation problem is how secuirity is

taken into acroint. It is clear that for naive end ,rzers of a

natfiral lanquae .nery system that qecurity from tunguthorized

Page 19: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

PaqP 15

-iccess is a critic-il 0iueiOr originil hopt- in this regardf was

rhat we could merely rely on the [)B. S qtibschemi to providep the

ncisary security. Unfortuinatply, vie foundf the nranidlarity of thp

suibichema security to he too lnrcjP--l.n. -iccess; is -qrnnterf on a

field-by-field nnsis. Nhis is; fine for 9pplication proorims huit

Loo restrictive for datai baise quiery. WJe have proposed that

-;ectrity also he def ineri on recordl-by-record basis so thait n

uiser might have access to cert-iin fields only for a specif ic se-t

of records.

The implic-itinn of all of this on navigation is that the

navijational choices madle by the system mujst b,- a fuinction of

dhat dfata is availablep to the current riser. For some uisers,

/ithotit access to alrelations, this may reqtire lsdirect

paths than would otherwis;e be requtiredf. It is up to the n.avirqtor

to find the test p-ith to relate all the necPessa,,ry +dIt- qith-oujt

violating any of tie secuirity constraints nlong the way.

The final Issuep related to navigation is one that is

cujrre3ntly uinres;olved1 at this time. This issuje is the mechanism by

whicn the parser communicates to the navigator the explicit

fire lit ionship" information that the tiser includedr in the reqiiest.

In general, the navigator must be prepared to work in the absence

of suich informition making use of predfefined 'Inntirrl paths".

Page 20: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

However, in those cases in which the iiser wishes to o',errldepI the'se precfefine(d nnths by explicitly mentioning anotherrelitionship, the system mtust hp rpparedi to act Accordingly. For

ixnM-)1:, in a d-itabasce of professors and studeints related by both

a 'teaches" an1 an "advisePs" relationship, the two requlest: "Who

.1-ire Professor Harris;' students?" is different in A navirlational

sePnse from "a-o aire Professor Harris' *avisees?" or "W',ho dloes

Professor Harriz; -i Ivifse?" It is cleair that for this qimo~le case

Lhe use of the word "a--dvises" or "arivlsees" indiicaites to the

navi-jator which rplaitionqhip to employ. Riut in more rcomn,)px cases

aherp the same two rela~tions must be *Joined more than once in a

request, it is not clear that all such rplationshi,,s shouild he

controlled by the e-xplicit tise of one rplation-shin w'ords'. Oif

couirse if not all suich relationship choices are impacted, the-n vie

nust decide which ones are andi which ones are not, presuimably on

the basis of the original synt--x. However, at this point, it is

not clear that rnol ish syntax couldH (or even should) provide this

kind of Information. This remains An open issuep at this point in

Page 21: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

toi )'i L ti r) c . t" 1-1.!~t~ 1 : C

Af ter a qIia-ry is ~ pro)y 1-rt~ in cli i t~r 1

seParch ing or sort iryj i perfotn Ly ti in L i jtri Ij r

eventually arrive :tt a C; .1 t e L C' i I- f '' I. 1, i i 1l- t '- " w rs the

* u~lser' . requejst. Thiq cia La still r'[ -oroV id th

user with thp inforrmation ornu lori. 'I m in, the 4rsi-

way . OIn the onti h-an,1, Lii; kin, .f KL fri p )cc- im ~ 'r'

conlionpla cp--c-,i i ut i nc !-,th cot ii I f m-a tt Ii -1rn nc , et c. ")n t,)

ot!-ier hand , tI is mob tiIe wot~ I I(I i Ii ' )' Ie (-'-j -1 o r C rrv 1riV-n i~t

a n 3r i trary q n.' ip ienc ofI rcuco-' :i ni- . I I t i i.; cejt heoni s th

ititona tjIc proqrvifn i ncj problem. )t ifnlte r-ro -ii - (12 01 r~ro

capat1i-'0lity is solu.-iht.

Thbe placef'Hrit of vr-ri liC,1 7r f100'jj 1jininiii -)n:4

a x i:numv i, e tc . can of't r4n inr i t hr'Fr tn p 1)L f, 1' r i n t hc i n t rf -i r-

Put 5 represePnt-, ano-:ther -x xanp 1 (uf Ilft' the :,or! loni~ ran hp

eih3red. S itv i ir 1, t he , ait.,j nr ou e '' irl :1 W I I 'I nt j.W~ I I L i w- to ly

Wvith other aspects of the interfa-ce tha-t vlnint-)in th ronaulo)

context, This is truep lbeausep it iq not unt'il the Pnd of

proce-ssing the data- that we realIly know all the c~ontexts; to which

stibseqtuent pronotins might refer. It is cprtiinly -onceivnblp thqt

some processes will restrict evfn fuirther the set of records

lisptayed for the ier on the basis of some arbitrairy prerlica-te.

Page 22: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

i-ron Ihe tispr~q poinlt of vitiw, ri ioronoun v ViLyo r f or on)]v

to the set of rocor,-ds actua9lly rrinted., Since Lhe itferfa)cf hais no

:nowlprlge of wh-it thcm ari- itrairv prec ,a-tn i , j L word~ ! beI Pnp~Iossile for it to nr(rate. a, 1-jig--IileoIrp:;tai n of w'h-iL

-MbS Pqrent proio, in - nay\, re f er to(-. TI e i iii ct i orw ons0f th i Ir P

Iiui t profouind. Since thip sy itw!n !-,o 1 onoe; ho i h i-,h-,leve j

reprosentrttion of the ronoiun reV int it 'eo fTesr dif fjCIc11lt tn

"ecl,'Io the meaninri to the uiser or to even tail Lahnrjt

interpretation in clarification dialog-.

On the positivfe si'lth low-level rer~cnainof ai

?)ronoi in re fe renco U ut1 ' (V is nc~c;s it atea hy il ', h is c:an speItin

ill dronouin refhrnncps. I his. is pnrt lcfslcrly/ no tic-eabjl a whnn thiz

resiji t of a nori-kwer iid noarhI~c ece l ~ooiI nl",

-i high-level prano'in riop'espnta) Lion is ma inta-ined tha-n clie r'o-thlv

search mujst be recomputel. If hoth i high-l evel giIl-lvI

pronotin representat ion iq miIintriinnd, thpn thne. 3y-,p- te cn ioesCIr bo

the qutery to the uiser with the hIih-It,,vol rerre t-set ot initi Ir'

11 rectly a ccc -s U tliesI re ':ods 1,i hcrjt ctiich ti.; in) the In"--

level representaition. Th-is the effect of hiil'Iin-i ain indePx for

in rirbitrary set on the fly inl us-ing it to speedi up sujhseqIient

iccesses to the same set.

Page 23: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

Page 19

i,ith reqnrl to thv'o rinta rroc -inrj rolltines themselves, there

is in interesting relationship to syntictic qwmntiflcation. There

is 9 direct sem-ntic connectinn hetween the category specification

evnployed by the procesneq on certnin types of quantification.

consider the request "How many salesmen are over 100 percent of

.pisotri in each recjion?" he qujantification "in earh region"

semntically deflrns the categories to he employed by the colintin 9

process. This gives nn interesting simplified representation for

Lhis kind of quintifir..tion.

f

I

Page 24: COLL HANOVER N H DEPT OF MATHEMATICS AD-AO92 500 NATURAL LANGUAGE DATA … · 2014. 9. 27. · ad-ao92 500 dartmouth coll hanover n h dept of mathematics 7 f/6 5/2 natural language

Pae 20

2onclus ion

Vie have suimmarized the resilt~z of thp C!JR slipported reseparch

on nitural lnnguag,-e datahase quiery. It is interesting to note the

chan'.)e in expectntions ahout datiihngn quepry th-it ha-ve takePn nilace

luring the life of this research contr,ct. At the ouitset, people

regarded practical naturil. languacie systems as a fuituristic

notion: soneiLhinj that would not be availablP for 10-20 years. Tfhe

current atmosphere is one in which a few real worl-I applicitions

are just making it into actuail produiction.

It is fair to say that the Issueps considered most importint

atth outget of t:iis re-se.-rch (the naturallauaenlyi)i

no longer the limiting f-actor. As is evident from the disciission

given in this report, the navigational and processing fuinctione;

provide the most fertile grouind for fuitire research. As suich, the

Problem of providling practical natural languiage access,. to

database is no longer to be considered a primarily nqttural

langtiage analysis problem, buit also a theoretical dantabase

Problem, with overtones of auitomatic proqrnmminj. For this reason,

it is not likely that morp sophisticated prirsing technigue1s will

impact current capabilities as much as more general AT research

related to database semantics Is likely to dio.