Upload
amit-nepal
View
215
Download
0
Embed Size (px)
Citation preview
7/30/2019 Cognition Semantic NLP for Search Overview-1
1/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 1 of 20
TechnicalOverviewofCognitionsSemanticNLP
(asAppliedtoSearch)KathleenDahlgren,Ph.D.
Growth of the Internet, the proliferation of email as the preferred method of information
exchange, and the creation of huge stores of digitized text have opened the gateway to a
delugeofinformationthatisoftendifficulttonavigateandsearch. Usersareawashindata
unabletogleanrelevantinformation.
To address this need and painpoint, Cognition Technologies, Inc. has introduced Cognitions
SemanticNLP
(Natural
Language
Processing).
This
evolutionary
software
uses
state
of
the
art
linguistic technology to easily and precisely find ontarget information on the Internet or in
large libraries of digitized text. Users pose queries in plain English and Cognitions Semantic
NLP interprets their meaning responding with moreprecise results than is possible with
traditional search technologies (e.g. pattern matching, concept search, etc.). Cognitions
SemanticNLPproducesresultswhicharebothhighlyrelevanttotheuserandverycomplete.
Thisincreasedrelevancyandcompleteness(precisionandrecall)ismuchhigherthanispossible
withtraditionalSearchtechnologiesnomatterhowtheuserqueryisworded.
(PleaserefertotheglossaryintheAppendixwhilereadingthisTechnicalOverview.)
I. SearchTechnologyMostsearchengines inusetodayarefrustratingtousebecausetheyyielda largequantityof
irrelevantinformation. Paradoxically,theyalsofailtoretrievesignificantamountsofrelevant
information. Currentsearchtechnologiesonlyworkwellwhentheuserknowsexactlyhowthe
informationinthetargetdocumentsisworded,andformsasearchquerywithsufficientlyfine
granularitytoyieldamanageableamountofinformation. Itisimpossibletoknowhowtoword
aqueryinadvance(itwouldrequiretheusertoknowtheanswertothequeryasthequerywas
constructed), so typically users spend a lot of time browsing irrelevant information,
constructing
ever
more
complex
Boolean
queries
with
only
marginal
success,
or
face
thefrustrationoffindingnothingatall.
CognitionsSemanticNLP issubstantiallymorepreciseandexhaustive initsabilitytosearcha
dataset, as indicated by commonly used Precision/Recall tests. Precision is a measure of
retrievalaccuracycalculatedbydividingthetotalnumberofrelevantretrievalsbythenumber
ofallretrievalsgeneratedbythesearch. Recall isameasureoftheextenttowhichrelevant
materialinthetotaldocumentbaseisfound. Itiscalculatedbydividingthenumberofrelevant
retrievals by the total number of potentially relevant retrievals in the document base. One
7/30/2019 Cognition Semantic NLP for Search Overview-1
2/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 2 of 20
comparativebenchmarkformeasuringPrecisionandRecallisthroughagovernmentsponsored
competitionknowsasTREC ,whichis intendedtotestSearchtechnologies. In2006,thebest
performerinbioinformaticshad16%precisionand26%recall. CognitionsSemanticNLPhas
performed a number of internal headtohead comparisons of search comparing Cognitions
SemanticNLPwithotherwellknownSearchtechnologiesonthesamedatabaseswiththesame
queries. Documentsetssearchablebyallthe Search technologies were notavailable, sothe
headtohead comparisons were performed on different document sets, but in each case
Cognitions Semantic NLP and the competitor Search engine were searching in the identical
documentswithidenticalqueries. Fiftyqueriesconsidered likelytobeenteredbyuserswere
formed, searches performed, and result relevancy judged by members of the Cognition
Technologiesstaff. Precision/Recallresultswerethencompared.
RecallforCognitionsSemanticNLPandotherSearchengineswasmeasuredbytakingallthe
relevantretrievalsfoundasabaselineof100%andcomparingtheCompletenessofeachSearch
enginetothatbaseline. Theremayhavebeenotherretrievalsmissed,butnonewasobserved.
Note that Cognitions Semantic NLP far outperforms the competitors in both precision and
recall.
SearchEngine DocumentBase Precision Recall
Google globalissues.com(blog) 12% 21%
Cognitions
SemanticNLP
globalissues.com(blog) 91% 90%
dtSearch MicrosoftAntiTrustCaseemails 24% 19%
Cognitions
SemanticNLP
MicrosoftAntiTrustCaseemails 96% 95%
Autonomy NewYorkLife.com(corporateWebsite) 1% 40%
Cognitions
SemanticNLP
NewYorkLife.com(corporateWebsite) 92% 87%
II. CognitionsSemanticNLPLinguisticTechnologyCognitions Semantic NLP searches on meanings, not patterns, therefore, its results are very
precise. The user poses queries in plain English, and Cognitions Semantic NLP determines
whatthewordsinthequerymeaninthecontextofthequery. IfyouaskHowcanIbuystock
on
the
market?,
Cognitions
Semantic
NLP
determines
that
stock
means
share
or
security. Itsearchesonlyonthatmeaningofstockanddoesntretrieveinformationabout
stockingshelves,cattleorflowers. IfyouaskHowcan Istocktheshelvesofmymarket?, it
retrieves information about merchandising, and doesnt retrieve information about shares in
companies. CognitionsSemanticNLPreturns informationwithover90%Relevancy,reducing
the users need to ponder large numbers of irrelevant retrievals found with other Search
technologies.
7/30/2019 Cognition Semantic NLP for Search Overview-1
3/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 3 of 20
Simultaneously, Cognitions Semantic NLP overcomes the problem of information underload,
i.e.notfindinganythingatallbecauseofdifferencesinwording. Itfindsinformationregardless
of the way a concept is worded in the target documents. If you ask Fatal fumes in the
workplace?, Cognitions Semantic NLP finds documents that talk about gas, vapor,
"steam",etc. Itisimportanttonotethatsomeofthesewordshaveambiguousmeanings(e.g.
fume can mean a vapor or it can mean an aromatic wine), but Cognitions Semantic NLP
doesnt retrieve irrelevant information triggered by those words used in a different meaning
thanthequery. Forexample,whensearchingonfume,it retrievestogasmeaningvapor,
but not gas meaning gasoline. The result is that Cognitions Semantic NLP retrieves 5 to 7
timesmorerelevantinformationthanotherSearchtechnologies,asmeasuredinheadtohead
comparisonswithotherSearchengines,whilemaintainingover90%Relevancy.
AnothersourceofgreaterCompletenessisCognitionsSemanticNLPtaxonomy,whichenables
Cognitions Semantic NLP to search on specific information when queried on more general
information. As an example, if the user searches on money, Cognitions Semantic NLP will
find informationaboutdollar,poundandyen,etc. CognitionsSemanticNLPtaxonomy
covers 506,000 concepts, and is thus very complete. The customer doesnt have to build a
taxonomyfrom
scratch,
as
with
some
technologies.
CognitionsSemanticNLPArchitecture
The components of linguistic processing in Cognition include a reader, a phrase parser, a
morphological component, a word meaning interpreter, a dictionary, a taxonomy and a
meaningthesaurus. Thedictionaryandtaxonomyareusedbythephraseparser,morphology
and word meaning interpreter. The meaning thesaurus is used to find alternate wordings
duringsearch.
Reader
Morphology
Phraserecognizer
WordMeaning
Interpreter
SearchEngine
Indexer
Dictionary
Taxonomy
MeaningThesaurus
7/30/2019 Cognition Semantic NLP for Search Overview-1
4/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 4 of 20
1. Thereader. Thereaderreadsthetextorquery,locatesthewords,andlooksthemupinthedictionary. Thiscomponentguaranteesagainstfalsehitswhenonewordispartofanother
word,asinpartandparty,orlossandfloss.
2. The morphological component. The morphological component isolates word stems fromprefixes
and
suffixes,
enabling
Cognitions
Semantic
NLP
to
recognize
many
millions
of
word
forms (actually, an indefinite number). Some words take various forms according to
morphologicalrules. Someoftheseareenumeratedbelow:
a. Nounswithirregularpluralmorphologysuchasmousemice,toothteeth.b. Nounswithregularchangesinthepluralsuchasbabybabies.c. Regular verb inflections such as raze razed razing and
shipshippedshipping".
d. Verbswithirregularpasttenseformssuchascatchcaughtcatching.e. Regularderivedforms:
inter communicate intercommunicate
tion communicate communication
inter + tion communicate intercommunication
ize actual actualize
re marry remarry
3. ThePhraseRecognizer. Thephraserecognizercombineswordsintophrasesformoreaccurateinterpretation. Thereareseveraldifferenttypesofphrasesithandles.
a. Names. Thephraseparserrecognizesthatcertainpatternsarepersonalnames,companynamesorplaces,whetherornotthosenamesarerepresentedinthelexicon. Theresultis
theabilitytomapfromoneformtoanother,includingregularshortforms.Examplesare:
Mr.SaviSamdi shortformMr.Samdi,Samdiisahumanmale LakeSushortformSu,isalake TheXYZCorporationshortformXYZCorp.orXYZ,isacompany
b. Dates. Thephraserecognizerseesalldatevariations,andmapsonetotheother,asin December1,1992 12/1/92 12192 Dec.1,92.
7/30/2019 Cognition Semantic NLP for Search Overview-1
5/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 5 of 20
c. Compounds. Thephraserecognizer interprets lexicalphrasessuchasmovieset, heartattack,netrevenue,etc. Suchphrasescanconsistofmanywords. Therearecurrently
over191,000compoundphrasesinthelexicon.
d. Acronyms. Thephraserecognizermapsthelongformofacronymstotheshortform,notingthatSecuritiesandExchangeCommissionisthelongformofSEC.
e. Idioms. Thephraserecognizerpiecestogether idioms, includingmorphologicalvariantsofthem(whicharelinkedtootherwordmeaningsinthemeaningthesaurus).Someexamples
are:
kickthebucketkickedthebucket(die,sense1) letthecatoutofthebaglettingthecatoutofthebag (disclose,sense2)
4. Word Meaning Interpreter. The word meaning interpreter uses context and structure todetermine the meanings of words in context. Several databases are consulted to determine
wordmeanings. Forexample,thewordcheckincheckupisinterpretedaspartof phrase
with
up
with
a
specific
meaning,
while
check
in
pay
with
a
check
is
interpreted
as
an
individual word meaning a promissory note because of the context with pay. The lexicon
contains17,000ambiguouswords.
5. Word Sense Selection Database. This database encodes trigger information and statisticaloccurrence metrics used to contribute to contextual word sense selection. There are over 4
millionsuchmeaningcontexts.
6. Dictionary. Inthedictionary,eachmeaningofeachwordisdefined,andgivenmorphological,syntactic, taxonomic and semantic features. This information enables the software to select
wordmeanings,recognizevariousformsofagivenword,andparsesentences.
CognitionsSemanticNLPlexiconisquitebroad,including506,000forms,over536,000concepts,
andmillionsofwordforms. IthasentriesforalmosteverycommonwordofEnglish,andtensof
thousands of proper names, phrases and acronyms. In combination with the morphology
component,itrecognizesmillionsofwordforms. Ithasvocabularyinmanydomainsincluding
law,healthandmedicine,biology,genomics,finance,terrorism,recreation,household,human
resources, encyclopediaarticles, nuclearenergy,softwaretechnical notes,newspaperarticles,
governmentregulations,telecommunications,humanfactorsengineering,andmilitary.
7. Taxonomy. Cognitions Semantic NLP taxonomy classifies all objects and events in aninheritance
hierarchy.
An
abbreviated
piece
of
the
taxonomy
is
shown
below:
bovidmammal
antelelope_mammal bovine_mammal ovine_mammal
gnu antelopegazelle cow bull domestic_sheep ovine Ovispoli
sheep lamb ewe Ovispoli
7/30/2019 Cognition Semantic NLP for Search Overview-1
6/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 6 of 20
There are approximately 7,000 unique nonterminal nodes and 536,000 unique leaves or
wordsenses.
8. Meaning Thesaurus. The meaning thesaurus maps meanings to each other, formingmeaning classes. Unlike ordinary thesauri, the mapping is from word meaning to word
meaning(orphrase).Themeaningswithinagroupingarejudgedtobelooselysynonymous.
The synonymy is "loose" in the sense that related meanings may have different parts of
speech, but they evoke the same ideas in the mind of the reader. For example, the
followingconceptsareinonethesauralgrouping:
bank9,column2,file3,line3,queue1,rank4,row1,tier1,alignment1,align1
Thedigitsindicatethespecificsensesormeaningsofthewords. Inthisexample,
bank9 means a set of similar things, column2 means a line of similar things, file3
meansalinedupgroupofthings,andsoon.
The word support illustrates that a given word in different meanings may belong to a
numberof
different
meaning
classes
or
thesaural
groups,
as
shown
below:
support1,abet1,assist1,bail3,sponsor1,benefit1,benefit2,assistance1 support2,attest1,back7,affirm3,establish2,prove1 support3,bear3,carry9,fortify1,shore2 support4,helpdesk1,hotline1,service3,serve3,help2
Phraserelationsarealsoindicatedintheconceptthesaurus:
kick
the
bucket
die1
expire4
SECSecuritiesandExchangecommissionTherearecurrentlyover76,000conceptthesauralgroups.
This database is a primary source of Cognitions Semantic NLP unique combination ofRelevancy and Completeness. Cognitions Semantic NLP not only knows all the different
waysofsayingthingsforfullCompleteness,italsoknowswhichsensesofthewordsshould
be counted as equivalents for high Relevancy. In fact, Cognition Semantic NLPs wordmeaning interpreter has 94% Relevancy. It is a commonplace in search to say that high
recall comes at the expense of high precision. Since Cognitions Semantic NLP
disambiguates
words,
and
maps
meaning
to
meaning
in
the
thesaurus,
the
thesaurus
improvesrecallwithoutloweringprecision.
9. Synographs. Synographs are alternate spellings for entries in the dictionary, such ascookie and cooky. This database allows for recognition of alternate spellings and
commonmisspellings. Therearecurrentlyover12,000synographentries.
7/30/2019 Cognition Semantic NLP for Search Overview-1
7/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 7 of 20
III. LexicographyToolsThesemanticdatabaseshavebeendevelopedwithover100personyearsoflexicographywork.
A number of tools have been developed to assist lexicographers in the development of the
semanticdatabases.
1. LexiconToolEntries inthedictionary includesinglewords, multiplewordphrases,andthe nodes ofthe
ontology. Each entry has one or more senses (meanings) with syntactic and semantic
features.
Eachwordsensehasseveralcomponents,asfollows:
a. plainEnglishdefinitionThisiswhatisdisplayedtotheuserinthesearchinterfaceiftheuserchoosestoviewthe
searchconcepts.
b. ontologicalattachment(s)Everysenseisattachedtooneormorenodesinthe ontology. For example, the
firstsenseof"dog"isattachedto"pet_node"and"canine_mammal".
c. syntacticfeaturesThere are currently about 1,250 unique syntactic features. Each sense has 2 or more
featuresassociatedwithit. Thesyntacticfeaturesincludemaincategoryfeatures(noun,
verb, etc.), morphological features for classifying the different forms of a word (e.g.
"wind", "winding", "wound"), and subcategory features (such as intransitive for verbs).
For example,the first sense of "dog" has the category feature"noun", a morphological
featureindicatinghowtopluralizeit,andnosubcategoryfeatures.
d. semanticfeaturesEach
sense
may
have
semantic
features,
such
as
"domain"
features
(used
to
prefer
a
sense in a particular domain) and selectional restrictions (for use with a parser).
Selectional features help guide word sense disambiguation. For example, the verb
charge in the meaning indict requires a sentient object and a crime as the oblique
object,whereaschargeinthemeaningelectrifyrequiresanelectricaldeviceasobject
andaformofenergyasobliqueobject.
7/30/2019 Cognition Semantic NLP for Search Overview-1
8/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 8 of 20
The tool runs as a server and guarantees that no word is edited by more than one
lexicographer at a time, for integrity of the database. It also makes changes to words
availabletoalllexicographersimmediatelyaftertheyhavebeensavedintothedatabase.
2. MeaningThesaurusToolTheconceptthesaurustoolenablesthe lexicographerto lookupwords inthe lexicon,viewthe
meanings,selectameaning,andviewalloftheconceptgroupsthatthemeaningisamemberof.
Thelexicographermayaddtotheconceptgroups,deletefromthem,andcreatenewones. This
canbedoneinparallelwithmorethanoneconceptgroupatatime.
The tool runs as a server and guarantees that no concept group is edited by more than one
lexicographer at a time, for integrity of the database. It also makes upgrades available to all
lexicographersimmediatelyaftertheyhavebeensavedintothedatabase.
IV. CustomerToolsA number of tools have been developed to enable customers to index their documents,
customizetheirspecificjargonsuchasproductnames,andsearchtheirdocuments. Cognitions
SemanticNLPemploysclientservercommunication foroptimaleaseofuseandefficiencyon
largedocumentbases. Astandalone,nonclientserverversionofCognitionsSemanticNLPis
alsoavailable. AtthecoreofthesystemistheCognitionsSemanticNLPServer,whichcanbe
configuredforindexing,searching,orboth.
1. IndexingToolThe
Cognition
Indexer
GUI
is
the
primary
interface
used
to
create
and
index
aCognitions
SemanticNLPProject. ACognitionsSemanticNLPProjectissimplyalistofdocuments(in
anyoftheformatsCognitionsSemanticNLPhandles),togetherwithasetofparametersto
beusedwhentheyareindexedandsearched. Itisstraightforwardandeasytouse,though
aswithanysoftware,goodresultswilldependontraining(orreferencetotheuser'sguide)
andpractice.
ThefollowingscreenshotshowsthemainwindowoftheIndexerGUI.
7/30/2019 Cognition Semantic NLP for Search Overview-1
9/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 9 of 20
Togetstarted,theuseremploystheCognitionIndexerGUItocreatealistofthedocuments
thattheuserwantstosearch. ThefollowingscreenshotshowstheIndexerwindowusedfor
selectingfilesforindexingfromafilesystem.
7/30/2019 Cognition Semantic NLP for Search Overview-1
10/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 10 of 20
The Indexer is then used to send an indexing request to the Cognitions Semantic NLP
Service. ThedocumentstobeindexedcanbeinHTML,plaintext,MSWord,WordPerfect,
RTF,orPDFformats,andcanbe includedfromthe localnetworkorfromtheWebviathe
CognitionSpider.
Following its initialcreation,an indexmay beupdatedas oftenas needed. Updating the
index does not entail reindexing the entire document set, but only those documents for
which a change is indicated. The Cognition Indexer GUI can automatically detect when
documentshavechanged,andmarkthemforreindexing. Documentscanalsobeaddedto
orremovedfromthedocumentsetbytheuser. Inaddition,adaemonmaybeinvokedto
automaticallydetectchangesandupdatetheindexusingthecommandlineversionofthe
CognitionIndexer.
2. SampleScriptsOncethe indexhasbeencreated,theuseremploysaWebbrowser(e.g. InternetExplorer
or Netscape Communicator) toSearch it. TheCognitions Semantic NLP package includes
sample
Web
pages
for
this
purpose.
These
sample
pages
may
be
used
without
modification,ortheymaybecustomizedbytheusertoobtainthedesiredappearanceand
functionality. SamplescriptsforASP,Python,Java,PHPandPerlareavailable.
3. AutomaticDictionaryexpansionCognitionsSemanticNLPdictionarycanbeexpandedtoincludelargenumbersofcustomer
vocabulary words automatically. Any file of terms and words used in the customers
business along with the categories of the terms can be merged automatically with
Cognitions Semantic NLP dictionary. Recently, Cognition expanded its vocabulary in
medicine
and
molecular
biology
by
over
130,000
words
using
semi
automated
techniques.
4. CustomerDictionaryexpansion
The customer may add words in ontological classes if desired. The customer may force
CognitionsSemanticNLPintothedesiredmeaningofaword,andthecustomermayforce
CognitionsSemanticNLPtoconsiderawordalastname,ornotalastname,asdesired.
V. ProductFeaturesofCognitionsSemanticNLP1. Relevance ranking. Cognitions Semantic NLP returns a list of retrievals containing
documents in which the query concepts were found together in a sentence. Those with
exactwordmatchestothequerytermscomefirst.Thenextgroup isdocuments inwhich
there were exact matches to some query terms in the body of the document, but other
querytermsonlymatchedconceptually. Whichtermsmatchexactly isindicated. The last
group is documents in which there were conceptual matches to the query terms. Within
eachgroup,documentsarelistedaccordingtothenumberofsentencesinwhichallquery
termswerefound.
7/30/2019 Cognition Semantic NLP for Search Overview-1
11/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 11 of 20
2. Spellingcorrection. IntheSearchinterface,unrecognizedwordsarenotedandtheuserisgivena listofalternativespellingstoselectfrom. Theusermayalso leavethewordas is.
CognitionsSemanticNLPdoessearchonwordsthatareunrecognizediftheusersodesires.
3. Specific retrievals highlighted. When the user clicks on the Highlighted Text linkcorresponding to a retrieved document, Cognitions Semantic NLP highlights the relevant
section.Additionalrelevantretrievalswithinadocumentare indicatedbyapointinghand
figureattheendofahighlightedsection. Ifthereisnopointinghandfigureattheendofa
highlightedsection, ittellstheuserthattherearenoadditionalrelevantresults. Inother
words,itbehaveslikeaclippingservice,whichismostusefulwithlargerdocuments.
4. Specific words highlighted. When the user clicks on the Highlighted Text linkcorresponding to a retrieved document, Cognitions Semantic NLP also colorcodes the
specificwordswhichmatchedthequerywithintherelevanthighlightsection. Astheuser
hovers the cursor over a matched word, the corresponding query term is indicated.
Sometimesagivenword may correspond tomore thanonequeryterm. In thiscase the
word
is
highlighted
according
to
the
first
query
term
matched,
but
the
hoverover
text
indicatesallmatchedterms.
5. LinguisticBooleanSearch. Cognitions Semantic NLP searches can be formed using fullyrecursive Boolean expressions with AND, OR, WITH and AND NOT operators. The
expressions connected with the Boolean operators are interpreted for meaning. See the
help function at http://medline.cognition.com or http://wikipedia.cognition.com for
completeinstructionsandexamplesofuseforlinguisticBooleans.
6. Fuzzy Search. Cognitions Semantic NLP searches can be formed using wildcardsand
fuzzy
operators
(e.g.,
/Liebowicz/n
matches
names
that
sound
like
Liebowicz,
and
imatcheswordsthatstartwithdrandendwithg,regardlessofcase)suchthat
proper names and other words can be matched approximately. See the help function at
http://medline.cognition.com or http://wikipedia.cognition.com for complete instructions
andexamplesofuseforfuzzysearch.
7. Review Tool. As part its ASP solution, Cognitions Semantic NLP has a review tool forreviewing and categorizing document sets. Features include the ability to create project
categoriesandclassifydocumentsintothosecategories,tolimitsearcheswithincategories,
to browse document sets without querying and to export archives of categorized
documents. When combined with a relational database,the features provided allow the
user
tocreate
and
modify
database
tables,
display
document
related
data
from
tables
as
partofqueryresultsandrestrictsearchesbasedontablecolumnvalues. Alsoincludedare
scripts that allow the user to take notes and have the notes indexed and available to be
searchedintandemwiththerelateddocumentset.
8. Formats. CognitionsSemanticNLPindexesdocumentsinHTML,XML,OCR'dtextandplainASCIItext. Documents inWord,PowerPointRTF,orWordPerfectareconvertedtoHTML
beforebeingindexed. SomeengineeringmayberequiredforXML. DocumentsinPDFare
7/30/2019 Cognition Semantic NLP for Search Overview-1
12/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 12 of 20
convertedtoplaintextbeforebeing indexed. Theusermayviewretrievals indocuments
converted to HTML or plain text with the specific retrieved sections highlighted, but may
alsochoosetoviewtheoriginalfilewithouthighlighting.
9. Customerandmetatags. CognitionsSemanticNLPcansearchincustomerspecifiedtags,ifdesired,withsomeengineeringassistancefromCognitionTechnologies.
10.Indexing local directories. The Cognition Indexer interface permits the user to selectindividualfilesorwholedirectoriesforindexing.
11.Spider. Cognitions Indexer GUI includes an interface for creating a list of Web files toindex. TheuserentersaURLfromwhichtostart,adesireddepthtocrawl,andparameters
toincludeorexcludeparticularURLs.
12.Authentication. Thespidercanbedirectedtousepasswordsorcookiestoentersitesthatrequireauthentication,sothattheusercanindexthesesites.
13.Languages. CognitionsSemanticNLPsearchesinanylanguagehandledbyUnicode.14.Searchandretrievalpages. CustomizablesearchandretrievalASPpagesareprovidedfor
theuser. Theusercanmakethesearchandretrievallookanywaydesired.
15.Partialupdating. Theusermayselectanynumberoffilestoreindex,ratherthanhavingtoreindexanentiredocumentbasewhenindividualdocumentsareaddedorchanged.
16.Automatic updating. A consolelevel indexing command is provided so that systemadministrators can automatically update new files or changed files on a regular basis,
initiated
by
the
computer
clock.
17.Loadbalancing. TheCognitionsSemanticNLPindexinginterfaceautomaticallydistributesdocumentindexingacrossasmanyserversastheadministratorselects.
18.Brokering. Administrative tools enable the system administrator to manually controlindexing and searching load, and membershipaccess todocument bases. The tools send
queries to servers in response to load and allocate databases to specified servers.
Customerspecific criteria that may involve user parameters such as subscription
membership.
19.Categorization. Theinterfacequeriesusersforcategoriesandsaveswholeretrievallistsorindividualfilesintothecategories. Newcategoriescanbecreatedonthefly. Subsequent
searchescanberestrictedtocategories.
20.Userdefinedontology. Userscanaddontologicalclassesbyeditingastandardfile. Inthisway users can define search into classes unknown to Cognition Semantic NLP, such as
companywidgetnamesorphrases. Forexample,Sonycouldaddacategoryvideorecorder
withspecificvideorecordnamesascategorymembers. Usingthisnewcategory,anduser
7/30/2019 Cognition Semantic NLP for Search Overview-1
13/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 13 of 20
could ask videorecorder and retrieve to specific video recorder names mentioned in
indexed documents.
21.Usercontrolofnames. UserscanforcepreferencefornameornonnameinterpretationsofwordslikeBushandStone,whichcaneitherbenamesorcommonwords.
VI. ComparisonwithOtherSearchEnginesThe low Relevancy/Completeness performance of most Search engines is due to the use of
patternmatchingtechnology. Patternmatchingmatchesstringsoflettersinaquerytostrings
ofletters within the document set, ignoring context and meaning. If you search on check
meaningapromissorynote,apatternmatchingsearchengineonlysearchesforinstancesof
letterpatterncheck,regardlessofthemeaningofthewordincontext. Yourresultsinclude
check meaning see whether, postponement, and hold back, as well as promissory
note.
Patternmatching
technology
results
in
information
overload
because
of
word
ambiguity,
so
that
the
bestthistechnologycanofferistypicallynobetterthan33%Relevancy. Itisimportanttonotethat
wordambiguityismostprominentinthemostfrequentlyusedwordsofalanguage. Sotheproblem
isconcentratedamongtheverywordswhichoccurmostofteninqueriesanddocuments.
Also, most Search engines dont require that your search terms be near each other in a
document, so if you search on pay with a check, they will retrieve to documents in which
payisfoundinthelastsentence,andcheckinthefirstsentence. Someenginesdonteven
respect word boundaries, retrieving texts that contain one of your Search terms as part of
anotherword. Ifyousearchonloss,theywillreturndocumentswithglossandfloss.
Patternmatching technologies also miss relevant information. They only find material with the
exact words of the search. If you search on profit meaning net revenue, they dont find
documents containing net revenue, or income. If you search on SEC, they dont find
documents containing Securities and Exchange Commission. Conversely, if you search on
SecuritiesandExchangeCommission,theydontfinddocumentscontainingSEC. Ifyousearch
onvehicle,theyonlyfindthatword,nottypesofvehiclessuchascar,truckandplane. Thisis
why such technology has vastly inferior Completeness to linguistic technology employed by
Cognition. Typically,patternmatchingtechnologiesretrievenomorethan20%oftheinformation
CognitionsSemanticNLPretrieves.
SomeSearch
engines
have
added
statistical
information
along
with
pattern
matching
search,
but
the
endresultisstillthesametoomuchirrelevantinformationandtoolittlerelevantinformation.
Verity,Yahoo.
Theseenginesemploypatternmatchingtechnology. Theyproduceoverretrievalbecausethey
donotdisambiguatewordsandtheylackphrases. Ifyouask"HowdoIcheckonmystocks?",
theyretrieveto"Checkyourstockinthecattleyard". Ifyouaskabout"math",theyretrieveto
"aftermath". CognitionsSemanticNLPhaslittleoverretrievalbecauseitdisambiguateswords
and has over 100,000 phrases. For the query "How do I check on my stocks", Cognitions
7/30/2019 Cognition Semantic NLP for Search Overview-1
14/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 14 of 20
SemanticNLPinterprets"check"tomean"review"or"update",NOT"noteforpayment" and
"stocks"tomeansharesincompanies,NOTbovineanimals. Patternmatchersunderretrieve
due to lack of synonyms. If you ask about "pay raises", they won't find "salary increases".
Cognitions Semantic NLP has little or no underretrieval because of its synonyms and
taxonomy. Ifyouaskabout"payraises",itfiguresoutthemeaningsof"pay"and"raise",and
thenretrievesto"salaryincreases","wagehikes",etc. Patternmatchersunderretrievedueto
lack of taxonomy. If you ask about "vehicles", they do not retrieve "car", "ship" or "plane".
Cognitions Semantic NLP has taxonomy, and thus can reason from the general to the
particular.
Google also employs patternmatching technology, but with a statistical boost. It never
retrievestoyourinformationunlessthequeryiswordedthesamewayasthetargetdocument.
However,ittrackspopularityandplacesthemostpopularWebsitesfirst. Thesetendtobethe
sitesotheruserswanttolookat,soitseemsmoreontargetthanSearchengineswithoutthe
popularitymeasure. CognitionTechnologiesconductedacomparisonofCognitionsSemantic
NLPvs.Googleonthewww.globalissues.orgsite,aworldpoliticalsite. Therewere50queries
inthe
test.
ExamplesofGooglesearchissuesthatareresolvedbyCognition.
Searchquery Google CognitionSemanticNLP
treasonousbehavior 0retrievals
Googledoesn'tknowsynonyms
2relevant
0irrelevant
casualtiesof
natural
disasters
0relevant
107irrelevant
7relevant
2irrelevant
tidalwave Noreply
Googledoesn'tknowthatatsunami
isatidalwave
8relevant
0irrelevant
heatinguptheglobeand
biodiversity
0relevant
4irrelevant
1relevant
0irrelevant
turmoilintheMiddleEastand
economicdownturns
Noreply 4relevant
0irrelevant
7/30/2019 Cognition Semantic NLP for Search Overview-1
15/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 15 of 20
Autonomy
Autonomyalsoemploysapatternmatchingsearchenginewithsomestatisticalenhancements
usingBayesianinference. IntestingAutonomyontheNewYorkLifeSite,CognitionsSemantic
NLP did not see the effects of the statistical reasoning. If working, such technology would
recognize topics of documents based upon the probability of occurrence of individual words.
These probabilities are calculated by associating a handassigned document topic with the
words that occur in the document. For example, a text that has been assigned the topic
baseball will have large numbers of occurrences of the words ball, bat, hit, field,
strike, etc. Thus upon searching for baseball in documents that have not been assigned
topicsbyhand,documentswithhighnumbersofthoseassociatedwordswillberecognizedas
being about baseball. This type of technology only increases retrieval for those topics that
have been identified in advance as of interest to many users, and for which additional
processingoftrialdocumentshasbeenformed. Thus itonlyworkswell inasmallnumberof
cases,andtogeteventhesecasestoworkislaborintensive.
Inpractice,asCognitiontestedit,Autonomyhastwoproblems. Itdoesntdisambiguatewordsand
atthesametime itaddsmanysearchtermswithstatistics,so itsretrievalRelevancy isonly0.5%.
Secondly,ithaspoorCompletenessbecauseithasnoparaphrasingorthesaurus. Itmisses60%of
therelevant
material
that
Cognitions
Semantic
NLP
finds
on
the
same
site
(www.nylife.com).
VIII. CognitionsSemanticNLPSpecificationsDeployment:
Inorderto deployCognition, theuser installstheCognitionsSemanticNLPprogramand the
Cognitions Semantic NLP Web component for Search. After installation is complete, the
CognitionsSemanticNLPservice(forWindows)ordaemon(forLinux)isstartedautomatically.
ThisCognitionsSemanticNLPServerfunctionsforindexing,searchorboth,assetbytheuser
inthe
server
administration
interface.
Thefirststepistoindexthetargetdocuments,whichcanbeeitheronthelocalnetworkoronthe
Web. Ifthedocumentsarelocal,theuserselectswhichdocumentstoindexfromthefilesystem. If
thedocumentsareontheWeb,theuseremploystheCognitionSpidertoobtainalistofdocuments.
Oncetheprojecthasbeencreated,thedocumentscanbeindexed.
ThesecondstepistheSearchitself. AtleastoneCognitionsSemanticNLPServermustberunning
withSearchenabled. Theusercreatesa script page (ASP andPerlarebothsupported)as inthe
sampleprovidedwiththeprogram,andthenvisitsthatpageinabrowser(suchasIEorNetscape).
From
that
page,
the
user
can
search.
Platforms(operatingsystems):
Windows(32bit)(Windows2000,WindowsXP,Windows2003)
Unix,Solaris,RedHat,Linux,FreeBSD,Centos
RAM: 2GBpreferred,1GBminimum
7/30/2019 Cognition Semantic NLP for Search Overview-1
16/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 16 of 20
Processor: Minimum3.0GHzpreferred.MulticoreHyperthreadedpreferred.
Disk: SCSI/SASdrivespreferred.CognitionsSemanticNLPrequires140Mbofdiskspaceplus
spaceforconceptindices,whichrangefromonetotwotimesthesizeoftheoriginaltexts.
Search Interface: Any HTML browser such as Thunderbird or Internet Explorer, accessing a
script which sends requests to the Cognitions Semantic NLP Server. ASP, Perl, Python, and
PHP4areallsupported.
API: A C++ API to the Cognitions Semantic NLP system is available. An API to the ASP
Componentisprovidedinthedocumentationthatcomeswiththesoftware.
Speed: CognitionsSemanticNLPbuildsConceptIndicesatabout1hourperGB,dependingon
the machine, the configuration, and the text itself. It can handle an unlimited number of
queriessimultaneouslyusingqueuingtechnologyandasufficientnumberofcomputers.
Vocabulary:CognitionsSemanticNLPknowsmostEnglishareasofinterestordomains,except
companyspecific
terminology,
such
as
product
names
(which
can
be
learned).
TechnicalSupport: Supportisavailablefrom9amto6pmPacifictime.Extendedsupportwitha
3hourminimumresponsetimecanbearranged.Withaservicecontract,CognitionsSemantic
NLP will regularly monitor performance and make system adjustments to further enhance
Relevancyandprovideoptimumqualitycontrol.
IX. ConclusionCognitions
Semantic
NLP,
when
applied
to
Search
technology,
returns
the
most
relevant
and
complete results in the industry by employing linguistic techniques and huge semantic
databases. Pattern matching and statisticallybased technologies lack the knowledge of
languagethatenablesCognitionsSemanticNLPtooutperformthemsodramatically. Because
of itsvastknowledgeofEnglish, little or no customization is required. Search functions like
thoseusersareaccustomedtoareincluded,suchasBooleansearch(withconceptualBooleans)
and fuzzy search. It comes with an easytouse indexer, and sample scripts for browser
searching. CognitionsSemanticNLPhasanAPIforembeddingitinothersoftwareplatforms.
AfterashorttimeusingCognitionSemanticNLPspatentedtechnology,usersareloathetogo
backtopatternmatchers.
7/30/2019 Cognition Semantic NLP for Search Overview-1
17/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 17 of 20
AppendixGlossary
ambiguousword an ambiguous word has more than one meaning. "strike" is ambiguous
becauseitcanmean"tohit","toignite","towalkoutofajob","agoodpitchinbaseball", or
other
meanings.
An
unambiguous
word
has
only
one
meaning,
as
in
"deer".
Bayesian In search, a technique for classifying documents uses the statistical theorem of
Bayes. Inbrief,thistheoremexplainswhattheprobabilityofonewordis,givenanother. Soif
you'veseen"bat"inadocument,theprobabilityofseeing"ball"ishigherthantheprobability
ofseeing"rock".
concept thesaurus A traditional thesaurus lists all the synonyms of words in meaningful
groupings,e.g.: "strikehit beat"as onegroup, "strike walkout protest" asanothergroup. A
conceptthesauruslistsMEANINGSofwordsingroups,soifthefirstmeaningof"strike"means
"to hit or beat", then one thesaural group is "strike1 hit3 beat2" (assuming that the third
meaningof
"hit"
and
second
meaning
of
"beat"
mean
the
same
thing).
If
the
second
meaning
of"strike"means"walkoutorprotest",thenanother, independent,thesauralgroupwouldbe
"strike2 walkout1 protest2". Thus a concept thesaurus maps meaning to meaning, while a
traditional thesaurus maps word to word, with no way of deciding which instance of a given
wordshouldbeconsideredsynonymouswithwhichotherwords.
compound a word that is formed from two or more identifiable words, e.g. "blackbird,"
"cookbook"
concept awordmeaning(seewordmeaning)
computational thepropertyofbeinganactioncarriedoutbyacomputer.
derived form a word that is derived through morphological rules, such as "rerun" or
"derivation".
domain asubjectareaofhumanactivitywhichhasasubvocabularyofitsown,suchaslawor
medicine. Also"domain"referstoabasicpartofaURLaddresswhichbreaksdownintothree
pieces,"site.domain.suffix"andin"www.cognition.com".
finegranularityquery a finegrainedquery isonewhich is looking fordetailed information.
Forexample,
afine
granularity
query
would
be
"What
is
the
cure
for
leaf
wilt
on
fuschias?".
In
contrast a coarse granularity query (or general query) would be "How can I buy a car in Los
Angeles?"
GUI graphicaluserinterface,suchasthebasicinterfaceforWindows.
idiom a fixeddistinctive expressionwhosemeaning cannot bededuced fromthecombined
meaningsofitsactualwords,suchas"kickthebucket".
7/30/2019 Cognition Semantic NLP for Search Overview-1
18/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 18 of 20
incrementalindexing theabilitytoaddanewdocumentindextoanexistingdocumentbase
index. This ability is an advance over many available indexers, which require users to
completelyreindexanentiredocumentbasewhennewdocumentsareaddedtoit.
index
a
computational
representation
of
the
location
of
words
or
concepts
in
a
document
base. Thiscanincludedetailssuchaswordandsentencepositions,ornot.
latentsemanticindexing Astatisticaltechniqueforclassifyingdocumentsbasedoncounting
cooccurrences of words. Documents that share many words are considered to have similar
contentandareclassedtogether.
linguisticalgorithms rulesoflanguageappliedascomputeralgorithms. Forexample,givena
wordthatpluralizeslike"bat",addan"s"tomakeitplural,orremovean"s"tofindthebase
formorstem.
linguistics thestudyofhumanlanguageincludingelements,structure,rulesandhistory.
morphological feature a property of a word that indicates how it changes form in specific
syntactic situations. For example, the fact that the past tense of "catch" is "caught" is
expressedinthedictionarywithamorphologicalfeature(seemorphology).
morphology therulesofwordformation. Suchrulesdictatethatthepasttenseof"cook" is
"cooked", while the past tense of "catch" is "caught", and that the plural of "bat" is "bats",
whilethepluralof"mouse"is"mice". Morphologyalsocontrolsthederivationofwordsfrom
otherwords,suchasthechangefrom"derive"to"derivation",orfrom"run"to"rerun".
overload retrieval of many documents that are not relevant in response to a query; poor
relevancy;poorprecision
parsing combining the words in a sentence into a structure which elucidates or shows the
syntactic role of each word in the sentence. For example, in "John loves pizza", the parser
creates a structure showing that "John" is the subject and "loves pizza" is the predicate, and
thatwithinthepredicate,"love"istheverb,and"pizza"isthedirectobjectoftheverb.
patternmatch(orstringpatternmatching) Insearch,decidingtoretrieveadocumentbased
on
an
exact
match
of
strings
in
the
query
and
document.
For
a
query
"what
is
the
form
of
the
law?", a patternmatching search engine would match documents containing phrases like
"conforms to the law", "..inform him that the lawn", "the laws of nature dictate that
waterformscrystalsatatemperatureof"
phrase ameaningfulsequenceofwords. "bokchoyisaphrase,while"theand"isnot.
phraseparser acomputerprogramthatreadsintextanddetermineswhichsubsequencesof
sentencesarephrases
7/30/2019 Cognition Semantic NLP for Search Overview-1
19/20
CognitionTechnicalOverview 2007CognitionTechnologies,Inc. 19 of 20
precision/recall Thisisthesamethingasrelevancy/completeness. Precisionisthepercentage
ofasetofretrievalsthat isrelevant. Iftherearetenretrievals,and9ofthemarerelevantto
the query, then precision is 90%. Recall is the percentage of the relevant documents in the
targetdatabasethatisretrievedinresponsetoaquery. Ifthereare10documentsthatexistin
thetarget
database
that
would
be
relevant
to
aquery,
and
the
software
retrieves
9relevant
documents,thenrecallis90%.
reasoning from thegeneral to theparticular reasoning in a taxonomic tree from higher to
lower nodes. For example, reasoning in the taxonomy example under the definition of
"taxonomy",from"vehicle"to"car",or"vehicle"to"boat"(seetaxonomy).
relevant Insearch,documentcontentthataddressesor issimilartothecontentofaquery;
apt;ontarget
relevancy/completeness This is the same thing as precision/recall. Relevancy is the
percentage of a set of retrievals that is relevant. If there are ten retrievals in response to a
query,and9ofthemarerelevant,thenrelevancy is90%. Completeness isthepercentageof
therelevantdocumentsinthetargetdatabasethatisretrievedinresponsetoaquery. Ifthere
are10documentsthatexistinthetargetdatabasethatwouldberelevanttoaquery,andthe
softwareretrieves9relevantdocuments,thencompletenessis90%.
semantics In linguistics, the rules for determining meaning of words, sentences and
discourses. Themeaningsofindividualwordsaretypicallylistedinalexiconordictionaryina
computationallinguisticsystem.
sense
Anindividual
meaning
of
an
ambiguous
word.
"strike"
meaning
"to
hit
or
beat"
is
one
senseof"strike".
spider a computer program that searches the internet by connecting from one URL to the
nextviathelinksinit.
string asequenceorpatternofletters. Astringcanbeawordornot. "XXX"isastring,"cat"
isastring.
synograph analternatespellingofaword,suchas"center"or"centre".
syntactic feature a property of a word that indicates how it functions in grammar. For
example,thefactthataword isanoun isasyntactic feature.Thefactthataverbrequiresa
directobject isasyntacticfeature. Averbwhichrequiresanobject,usedwithoutone,forms
ungrammaticalsentences,asin"Thegirlwants."
underload retrieval of very few, if any relevant documents in response to a query; poor
completeness;poorrecall
7/30/2019 Cognition Semantic NLP for Search Overview-1
20/20
taxonomy ahierarchyofISArelationshipsbetweenconceptsthatformatree,withthetopof
the tree the most general concept. For example, a car is a motor vehicle in the vehicle
taxonomy:
thesaural enhancer A type of search engine that uses standard, nonconceptual thesaural
groupstoenhancequeryterms. Ifthequerycontainstheword"strike",thethesauralenhancer
addstermsfromallthethesauralgroups"strike" isamemberof. So itwouldsearchon"hit,
beat", but also "walkout, protest", "ignite", etc. Note that many of the synonyms are
ambiguous here. The thesaural enhancer doesn't distinguish meanings, and doesn't know
whichmeaningof"hit"or"beat"isrelevant.
wordmeaning anindividualsenseofaword. Asenseisadescriptionormentalpictureofthe
objectsreferredtobythewordsense. Forexample,theword"bank"caneithermean"aplace
wheremoneyisstored",oritcanmean"thesideofariver". Themeaningofthesense"where
moneyis
stored"
is
adescription
of
atypical
bank
with
tellers,
counters,
little
windows
to
the
tellers,aguard,avault,etc.
wordstem thebaseformofawordwithnomorphologicalrulesapplied. "bat"isabaseform,
"bats"isnot. "run"isabaseform,"ran"isnotand"rerun"isnot.
vehicle
motorvehicle watervehicle spacevehicle
car truck boat ship rocketspaceship