60
A Common Concept Description A Common Concept Description of Natural Language Texts as of Natural Language Texts as a Foundation of Semantic a Foundation of Semantic Computing Computing on the Web on the Web Mitsuru Ishizuka Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Creative Informatics & Dept. of Info. and Communication Eng. Dept. of Info. and Communication Eng. School of Information Science and School of Information Science and Technology Technology

Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

  • Upload
    tauret

  • View
    35

  • Download
    1

Embed Size (px)

DESCRIPTION

A Common Concept Description of Natural Language Texts as a Foundation of Semantic Computing on the Web. Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng. School of Information Science and Technology. Semantic Computing Initiative. - PowerPoint PPT Presentation

Citation preview

Page 1: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

A Common Concept DescriptionA Common Concept Descriptionof Natural Language Texts as of Natural Language Texts as

a Foundation of Semantic Computinga Foundation of Semantic Computingon the Webon the Web

Mitsuru IshizukaMitsuru Ishizuka Dept. of Creative Informatics &Dept. of Creative Informatics &

Dept. of Info. and Communication Eng.Dept. of Info. and Communication Eng.School of Information Science and TechnologySchool of Information Science and Technology

Page 2: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

2

Semantic Computing InitiativeSemantic Computing Initiative

The aims of CDL are 1) to realize machine understandability of Web text contents, and 2) to overcome language barrier on the Web.

lay a foundation that allows computers to understand the semantic meaning of Web contents so that they can perform semantic computing on the Web.

Page 3: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

3

Major Differences from Semantic WebMajor Differences from Semantic Web

Semantic WebSemantic Web

Target of representation: Meta-data extracted from Web contents.

Domain-dependent ontologies (which cause the difficulty of wide inter-boundary usage)

RDF / OWL (description logic is hard for ordinary people to understand)

Semantic Computing Semantic Computing InitiativeInitiative Target of representation:

Semantic concepts expressed in texts.

Universal vocabulary (+ additional specific vocabulary in a domain if necessary), and pre-defined relation set.

CDL.nl (richer than RDF)

Tim Berners-Lee says that: “the Data Web” is more adequate rather than “the Semantic Web”. (2007)

Main body:Institute of Semantic Computing (ISeC) Institute of Semantic Computing (ISeC) in Japanin JapanInt’l Standardization Activity: W3C Common Web Language(CWL)-XG W3C Common Web Language(CWL)-XG

Page 4: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

4

Incubator Group Activity at W3CIncubator Group Activity at W3Cfrom Oct. 2006 to March 2008from Oct. 2006 to March 2008

Page 5: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

5

22ndnd Incubator Group at W3C Incubator Group at W3C from May 2008from May 2008

Page 6: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

6

CDLs and Semantic WebCDLs and Semantic Web

Tim Berners-Lee(2007): The Semantic Web The Data Web (more adequate)

Page 7: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

7

Another BroaderAnother Broader View of View of CDL DevelopmentCDL Development In 1960s – 1970s The foundation on the common representation and

manipulation (retrieval) of Data. Database

In 2000s – 2010s The foundation of the common representation and

manipulation of Semantic Information. Common Concept Base

It is preferable that this is language independent; in other words, Computer Esperanto Language which is understandable by computers.

Page 8: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

8

Functions and Supporting StandardsFunctions and Supporting Standards

Page 9: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

9

From Machine TranslationFrom Machine Translation

Pivot method

Transfer method

UNL UNL (Universal(UniversalNetworking Language)Networking Language)

CDLCDL (Concept(ConceptDescription Language)Description Language)

ComputerComputerEsperantoEsperantoLanguageLanguage

could be

English Japanese Chinese

PivotPivotLanguageLanguage

Minimal sufficient relations have been chosen to represent the surface-level concept meaning of texts.

Page 10: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

10

UNL (Universal Networking Language)UNL (Universal Networking Language) The development started in1997

at the United Nations Univ. (Tokyo). The chief scientist has been Dr. Hiroshi Uchida. It is now continuously developed under the UNDL foundation.

The purpose is to let people in the world exchange and share textural info. on the Web beyond language barrier.

The design is based on the results of Machine Translation (especially, Pivot method) and Electric Dictionaries.

There have been activities wrt English, Japanese, Chinese, Spanish, French, Arabic, etc.

Page 11: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

11

The defining method of one unique The defining method of one unique sense of a word in sense of a word in UW UW (( Patent of UN UniPatent of UN Univ.v. ))

Defining categoryswallow(icl>bird) the bird

“One swallow does not make a summer”swallow(icl>action) the action of swallowing

“at one swallow”swallow(icl>quantity) the quantity

“take a swallow of water”

Defining possible case relationsspring(agt>thing,obj>wood) bending or dividing somethingspring(agt>thing,obj>mine)) blasting somethingspring(agt>thing,obj>person, escaping (from) prison src>prison))spring(agt>thing,gol>place) jumping up

“to spring up”spring(agt>thing,gol>thing) jumping on

“to spring on”spring(obj>liquid) gushing out

“to spring out”

Page 12: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

12

UWUW ((Universal WordsUniversal Words)) in UNL in UNLUniversal Worduw{(equ>Universal Word)}adjective concept{(icl>uw)}    uw(aoj>thing{,and>uw,ben>thing,cao>thing,cnt>uw,cob>thing,con>uw,coo>uw,dur>period,man>    how,obj>thing,or>uw(aoj>thing),plc>thing,plf>thing,plt>thing,rsn>uw(aoj>thing),rsn>do,icl>adjective concept})

Achaean({icl>uw(}aoj>thing{)})Afghan({icl>uw(}aoj>thing{)})African({icl>uw(}aoj>thing{)})African-American({icl>uw(}aoj>thing{)})Ainu({icl>uw(}aoj>thing{)})Alaskan({icl>uw(}aoj>thing{)})Albanian({icl>uw(}aoj>thing{)})Aleutian({icl>uw(}aoj>thing{)})Alexandrian({icl>uw(}aoj>thing{)})Algerian({icl>uw(}aoj>thing{)})Altaic({icl>uw(}aoj>thing{)})American({icl>uw(}aoj>thing{)})Anglian({icl>uw(}aoj>thing{)})Anglo-American({icl>uw(}aoj>thing{)})Anglo-Catholic({icl>uw(}aoj>thing{)})Anglo-French({icl>uw(}aoj>thing{)})Anglo-Indian({icl>uw(}aoj>thing{)})Anglo-Irish({icl>uw(}aoj>thing{)})Anglo-Norman({icl>uw(}aoj>thing{)})Arab({icl>uw(}aoj>thing{)})Arab-Israeli({icl>uw(}aoj>thing{)})Arabian({icl>uw(}aoj>thing{)})Arabic({icl>uw(}aoj>thing{)})

40,000 lexicons are open to public.

The full vocabulary includes 200,000 lexicons as of 2007.

Page 13: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

13

CDLsCDLs CDL.coreCDL.core

defines the basic format.

CDL.nlCDL.nl    (or Common Web Language)(or Common Web Language) describes every possible concept expressed in natural language text

in any languages. It provides a concept description scheme wrt word, phase, sentence and documents in any languages, based on the CDL concept model.

A basic vocabulary set including relation vocabulary is given.

CDL.jpn, CDL.eng, CDL.chi, CDL.spa, etc.CDL.jpn, CDL.eng, CDL.chi, CDL.spa, etc.

Articulate Japanese Articulate Japanese (( 明晰日本語明晰日本語 )) defines a style of Japanese sentence expressions which suppress

ambiguity referring to CDL.nl

Page 14: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

14

CDL.coreCDL.core Basic Description ElementsBasic Description Elements :: EntityEntity       NodeNode

Elementary Entity Composite Entity       Hyper-Hyper-

NodeNode

RelationRelation                 LinkLink

Attribute-Value pairsAttribute-Value pairs can be added to the entity (and the relation).

These are quasi-relations; if the value takes a complex entity, then we can treat it as a relation.

Page 15: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

15

Representation with CDL.nlRepresentation with CDL.nl <John reported to Alice that he bought a computer yesterday.> {#A01 Event tmp=‘past’; {#B01 Event tmp=‘past’; {#b01 buy;} {#b02 computer ral=‘def’;} {#b03 yesterday;} [#b01 agt John] [#b01 obj #b02] [#b01 tim #b03] } {#John John;} {#Alice Alice;} {#a01 report;} [#a01 agt #John] [#a01 gol #Alice] [#a01 obj #B01] }

report#a01John#

Alice#

buy#b01computer#b02 ral=‘def’

yesterday#b03

Event#A01tmp=‘past’ Event#B01

tmp=‘past’

golobj

agtagt

tim

obj

Page 16: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

16

Semantic Role Labels in PropBankSemantic Role Labels in PropBank

Arg0 (prototypical agent) Arg1 (prototypical patient) Arg2 (indirect object/benefactive/instrument/attribute/end state) Arg3 (start point/benefactive/instrument/attribute) Arg4 (end point) Arg5 ( ) TMP (time) LOC (location) DIR (direction) MNR (manner) PRP (purpose) CAU (cause) MOD (modal verb) NEG (negative marker) ADV (general-purpose modifier) DIS (discourse particle and clause) PRD (secondary predication)

The focus is on Predicate-Argument Structure.

These are defined wrt each word sense.

Ex) buy:: Arg0: buyer Arg1: thing bought Arg2: seller (bought-from) Arg3: price paid Arg4: benefactive (bought-for)

This set is not sufficient for representing every concept expressed in natural language texts. It cannot be used for every language due to its language (English) dependency.

Page 17: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

17

CDL.nl Relations (1)CDL.nl Relations (1) RELATIONRELATION

ElementalReration  要素関係 FunctionalRelation  機能的観点 [AgentRelation  主体関係 ]

agt ( agent :動作主) cag ( co-agent :並行動作主) aoj ( thing with attribute :属性主) cao ( co-thing with attribute :並行属性主) ptn ( partner :相手)

[PatientRelation  被行為体関係 ] obj ( affected thing :対象) cob ( affected co-thing :並行対象)  opl ( affected place :場所対象)  ben ( beneficiary :受益者)

[InstrumentRelation  道具関係 ] ins ( instrument :道具) mat (material :材料 ) met ( method or means :方法)

[PlaceRelation  場所関係 ] plc ( place :場所) plf ( initial place :起点) plt ( final place :終点) vip (intermediate place, via place :場

所経由)[StateRelation  状態関係 ]

sta (state :状態 ) src ( source, initial state :始状態) gol ( goal, final state :終状態) vis ( intermediate place or state :経

由)[TimeRelation  時間関係 ]

tim ( time :時間) tmf ( initial time :始時間) tmt ( final time :終時間) dur ( duration :期間)

Page 18: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

18

CDL.nl Relations (2)CDL.nl Relations (2)[SceneRelation 場面関係 ]

scn (scene :場面 ) vic ( via scene :場面経由 )

[CausalRelation 原因関係 ] con ( condition :条件) pur ( purpose or objective :目的) rsn ( reason :理由)

[OrderRelation 順序関係 ] coo ( co-occurrence :同起) seq ( sequence :先行)

[MannerRelation  様態関係 ] mal (qualitative manner :質的仕方) mat (quantitative manner :量的仕方) bas ( basis for expressing a standar

d :基準)[ModificationRelation 限定関係 ]

mod (modification :限定) qua (quantity :量的限定)

pos ( possessor :所有者) cnt ( content, namely :内容) nam ( name :名前) per ( proportion, rate or distribution :

単位) fmt (range/from to :範囲) frm (origin :起源点) to (destination :目的点)

[LogicalRelation 論理関係 ] and ( conjunction :連言的) or (disjunction, alternative :選言的) not (complement :補集合)

[ConceptRelation 概念関係 ] equ (equivalent :同義) icl (included / a kind of :上位) tof (type-of :具体化) pof ( part-of :部分)

Page 19: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

19

CDL.nl Relations (3)CDL.nl Relations (3) InterThingOrInterEventRelation  間事

物・間事象関係

[ConnectingRelation 接続関係 ] cau (causal :順接) adv (adversative :逆接) adt (aditive :添加) cot (contrastive :対比) par (parallel :同列) att (attached :補足)

[RefferingRelation 参照関係 ] rfi (reffered by identically :同一参

照 ) rfp (reffered by partially :部分参

照) rfw (reffered by wholly :包含参照)

AttensionRelation 注目関係 ent (entry :主概念) foc (focus :焦点) qfo (question focus :質問焦

点) tpc (topic :トピック) com (comment :コメント)

Page 20: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

20

Arg0 (prototypical agent) agt (agent), cag (co-agent), aoj (thing with attribute), cao (co-thing with attribute) Arg1 (prototypical patient) obj (affected thing), cob (affected co-thing) Arg2 (indirect object/benefactive/instrument/attribute/end state) ---, ben (beneficiary), ins (instrument), mat (material), met (method or means), sta (state), gol (goal, final state) Arg3 (start point/benefactive/instrument/attribute) plf (initial place), ben (beneficiary), ins (instrument), mat (material), met (method or means), sta (state) Arg4 (end point) plt (final place), to (destination) TMP (time) tim (time), tmf (initial time), tmf (final time), dur (duration) LOC (location) plc (place) DIR (direction) to (destination) MNR (manner) mal (qualitative manner), mat (quantitative manner) PRP (purpose) pur (purpose or objective) CAU (cause) rsn (reason) MOD (modal verb) an attribute in CDL.nl NEG (negative marker) an attribute in CDL.nl ADV (general-purpose modifier) mod (modification), qua (quantity), pos (possessor), cnt (content), nam (name), per (proportion, rate or distribution), fmt (range/from to), frm (origine) DIS (discourse particle and clause) [inter-sentence relation] PRD (secondary predication) [unique in English]

Rough Correspondence between Rough Correspondence between Semantic Relations of PropBand and Semantic Relations of PropBand and CDL.nl CDL.nl (1)(1)

Page 21: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

21

Rough Correspondence between Semantic Rough Correspondence between Semantic Relations of PropBand and CDL.nl Relations of PropBand and CDL.nl (2)(2)

Other CDL.nl Relations [AgentRelation]

ptn (partner): an indispensable non-focused initiator of an action

Ex) He competes with John. Mary collaborates with him.

[PatientRelation]opl (affected place): a place in focus affected by an

event. Ex) He cut the paper in middle in the room.

[PlaceRelation]vip (intermediate place, via place)

[StateRelation]src (source, initial state)vis (instermediate place or state)

[SceneRelation]scn(scene): a scene where an event occurs, or state is

true, or a thing exists. A scene is different from plc in that plc is the real

place something happens, whereas scn is an abstract or metaphorical world.

Ex) He won a prize in a contest. He played in the movie.

vic (via scene)

CDL.nl’s Relations other than Predicate-Argument Relations

[OrderRelation]coo (co-occurrence) seq (sequence)

[LogicalRelaion]and (conjunction) or (disjunction, atternative)

not (complement) [ConceptRelation]

equ (equivalent) icl (included/a kind of)

tof (type of) pof (part of) [ConnectingRelaion]

cau (causal) adv (adversative) adt (additive)

cot (contrastive) par (parallel) att (attached) [ReferringRelaion]

rfi (referred by identically)

rfp (referred by partially) rfw (referred by wholly) [AttensionRelation]

ent (entry) main (main element) in Connexor

foc (focus) qfo (question focus) tpc (topic)

com (comment)

Page 22: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

22

Rich Attributes in UNL and CDL.nl Rich Attributes in UNL and CDL.nl

Time with respect to speaker @past @present @future Writer’s view on aspect of event @begin @complete @continue @custom @end @experience @progress @repeat @state Writer’s view of reference

@generic @def @indef @not @ordinal Writer’s view of emphasis, focus and topic

@emphasis @entry @qfocus @theme @title @topic Writer’s attitudes

@affirmative @confirmation @exclamation @imperative @interrogative @invitation @politeness @respect @vocative Writer’s View of reference @generic @def @indef @not @ordinal

Express subjectivity evaluation of the writer/speaker for the sentence. Ex.) tense, aspect, mood, etc.

Writer’s feeling and judgements @ability @get-benefit @give-benefit

@conclusion @consequence @sufficient @grant @grant-not @although @discontented @expectation @wish

@insistence @intention @want @will @need @obligation @obligation-not @should @unavoidable @certain @inevitable @may @possible @probable @rare @regret @unreal

@admire @blame @contempt @regret @surprised @troublesome

Describing logical characters and properties of concepts@transitive @symmetric @identifiable@disjoint

Modifying attribute on aspect @just @soon @yet @not

Attribute for convention @passive @pl @angle_bracket @brace

@double_parenthesis @double_quote @parenthesis @single_quote @square_bracket

Page 23: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

23

Discourse (Inter-sentence) Relations Discourse (Inter-sentence) Relations are missing in current CDL.nlare missing in current CDL.nl

derivation causes conditional inference purpose trigger

compromise conflict contrast unconditional

Discourse Relations at ISO/TC37/SC4/TDG3 (34 types) detail element example extraction general-specific minimum part process-step Restatement

constraint supplement

background content evaluation

comparison disjunction dissimilar manner otherwise proportion similar strongComparison

Page 24: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

24

Concept Description LevelsConcept Description Levels

There are several choices for the deep semantic-level description depending on applications. On the other hand, a certain consensus has been made wrt “Concept Description” which is slightly below the surface level, through decades-long researches on NLP, machine translation and electric dictionaries.

Whereas a complete consensus has not been achieved yet regarding the Concept Description level and its description scheme, it is meaningful to set up a common concept description format as an international standard today.

Surface Level

Deep SemanticLevel

ConceptDescription

Page 25: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

25

Hierarchical Construction of Hierarchical Construction of Concept Representation in CDL.nlConcept Representation in CDL.nl

elementary thing/entitycorresponding to disambiguated word sense

composite entity

single event(single sentence)consisting ofproposition and modality components

compositeconcept/event(complex sentence)

situation (discourse)

predicate, case components, predicate-modification components, etc.

temporal and causal relations, etc., and coreference

agent-patient relation, phrasal relation, etc.

Page 26: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

26

Current Major Issues in CDL.nlCurrent Major Issues in CDL.nl Semi-automatic Conversion from Text. (Text generation from CDL.nl is not so difficult.)

Semantic Retrieval of CDL Data (The design of a CDL Query Language (CDQL)

and its processing mechanism )

Killer Application(s) -- information exchange and share beyond language barrier. -- semantic patent document retrieval.

23/04/22 26

Page 27: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

27

Approaches for Generating CDL DataApproaches for Generating CDL Data Manual Coding & Editing

Even in this case, a graphical input editor is necessary.

Graphical Input & Editing ( Hasida’s Semantic Authoring)

Some Manual Tagging to Text, then Conversion into CDL.

Semi-automatic Conversion from Text (1) Graphical interface for selecting a right one among possible

candidates.

Semi-automatic Conversion from Text (2) Manual disambiguation of the word sense (a pull-down menu

selection), then automatic conversion into CDL. Full Automatic Conversion (ultimate goal)

Our currentapproach

An approach taken at UNDL foundation

Page 28: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

28

Semantic Authoring Semantic Authoring (by K. Hasida):(by K. Hasida): A Graphical Input ApproachA Graphical Input Approach

Coarse Grain

Fine Grain

Page 29: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

29

Semantic Parsing Language processing is going through:

Syntactic parsing Dependency parsing Shallow semantic parsing

Semantic Role Labeling Given a sentence:

The brave soldiers fought with their enemies for their country in the War

Assign predicate-argument roles to sentence elements. (Who did What to Whom, When, Where, Why, How, etc.)

[ARG0The brave soldiers] [rel fought] with [ARG1-with their enemies]

for [ARG2-for their country] in [ARGM-loc the War] (in PropBank) Corpora: PropBank, FrameNet, …

Page 30: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

30

Dependency Parser as Dependency Parser as a lower basisa lower basis of our Semantic Analysisof our Semantic Analysis

Dependency Functions close to the semantic rolemain (main element)

agt (agent) : The agent by-phase in passive sentences. Ex) The dog was chased by the boy.

ins (instrument)

tmp (time)

dur (duration) Ex) ...experience in the past 10 years.

man (manner)

loc (location)

sou (source) Ex) ... move away from the street.

goa (goal) Ex) ... shift to a full power.

pth (path) Ex) ... travel from Tokyo to Beijing.

cnt (contingency (purpose or reason)) Ex) ... unable to say why he was too ....

cod (condition)

qn (quantifier)

Syntactic Functionspcomp (prepositional complement) Ex) They are in that red car.phr (verb particle) Ex) She looked up the word in the dictionary.subj (subject)obj (object)comp (subject complement) Ex) John remains a boy. dat (indirect object) Ex) John gave her an apple.oc (object complement) Ex) John called him a fool.corpred (copredicative) Ex) John regards him as foolish.com (comitative) Ex) Drinking with you is nice.voc (vocative) Ex) John, come here!frq (frequency)qua (quantity)meta (clause adverbial) Ex) So far, he has been ….cla (clause initial adverbial) Ex) Under his guidance, they can ....ha (heuristic prepositional phrase attachment)

Ex) escape trough ..., fight for ...det (determiner)neg (negator) notattr (attributive nominal) Ex) industrial editormod (other postmodifer) Ex) … of …ad (attributive adverbial) Ex) So much for modern technology, cc (coordination) Ex) and

Connexor Machines Text Analyser

Page 31: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

31

Named Entity RecognitionNamed Entity Recognition In Connexor Machinese Text Analyser

+ org (organization, company) + loc (location) + ind (individual) + name (name) + role (occupation, title)

This info. is useful as a lexical feature for the semantic parsing.

Page 32: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

32

Conversion of text into CDL.nl Conversion of text into CDL.nl through Shallow Semantic Parsingthrough Shallow Semantic Parsing Original text The records retrieved in answer to queries become information that can be

used to make decisions . #

Separate each word with a ID The \ w37 records \ w38 retrieved \ w39 in \ w40 answer \ w41 to \ w42 queries \ w43

become \ w44 information \ w45 that \ w46 can \ w47 be \ w48 used \ w49 to \ w50 make \ w51 decisions \ w52 . \ w53 #

Connexor Analiser’s output det: ( w38 w37 ) subj: ( w44 w38 ) mod: ( w38 w39 ) loc: ( w39 w40 ) pcomp: ( w40 w41 ) mod: ( w41 w42 ) pcomp: ( w42 w43 ) main: ( w36 w44 ) comp: ( w44 w45 ) subj: ( w47 w46 ) v-ch: ( w48 w47 ) v-ch: ( w49 w48 ) mod: ( w45 w49 ) pm: ( w51 w50 ) cnt: ( w49 w51 ) obj: ( w51 w52 ) obj: ( w51 w53 ) #

Hand-coded partial CDL.nl relations obj(w39 w38) gol(w39 w41) pur(w39 w43) aoj(w44 w38) obj(w44 w45) obj(w49 w45) pur(w49 w51) obj(w51 w52) #

Page 33: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

33

Relation/Role Set ComparisonRelation/Role Set Comparison PropbankPropbank

describes how a verb relates to its arguments. FrameNetFrameNet

describes how to describe words with its arguments in a related common scenario. Common disadvantages of FrameNet & Propbank role set:

The set covers only predicate-argument roles; they don’t consider any other types of relationships between entities.

CDL.nlCDL.nl CDL relation set describes how words are correlated and what the meanings of

their relationships are. Advantages of CDL.nl relation set

Each relation in the set is pre-defined along with distinctive information from other similar relations.

It describes not only predicate-argument relations, but also those between each pair of entities there exists a meaningful relationship. Thus it has better coverage.

The set has been chosen so that every concept expressed in texts can be sufficiently encoded.

It is universal, i.e., language independent, and can be applied to any language.

Page 34: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

34

CDL Relation SetCDL Relation Set Used to describes shallow semantic structure of text. Relations have been chosen to be able to sufficiently represent the

semantic concepts of texts, and are predefined. The set of relations contains all relation types which are organized

roughly into three groups: intra-event relations (22)

agt(agent), aoj(thing with attribute), cag(co-agent), cao(co-thing with attribute), ptn(partner), …..

inter-entity relations (13) and(conjunction), con(condition), seq(sequence), …..

qualification relations (9) mod(modification), pos(possessor), qua(quantity), …..

Page 35: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

35

Frequencies of CDL RelationsFrequencies of CDL Relations

1011121719202123242527#rel

CobOplCaoPlfPtnPltInsPerCooViaIclnam

QuaPurTimGolPlcManAgtAndAojObjModnam

26928932139544678810461122206926973128#relConNamEquMetBasDurCntSrcRsnScnPosnam

4

Seq47

2

To46

1

Iof41 4149586163657186#rel

067888910#rel

CagTmfFmtOrFrmPofTmtBennam

Data sparseness : The whole number of relation:13487 Relation type: 44 Average num per relation: 306.5

Page 36: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

36

Feature spacesFeature spaces Combine information from different language processes:

syntactic analysis tells the details of the word forms used in the text and the syntactic

structures among words. dependency analysis

A dependency relation specifies an asymmetric relationship between words, where one word is a dependent of the other word, which is called its governor.

lexical construction Lexical meaning contains two parts of information: word sense and

semantic behavior which is all the semantic relationships the word may contain.

Page 37: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

37

Syntax and Dependency FeaturesSyntax and Dependency Features Connexor Machinese

Text Analyser based on a functional

dependency grammar

Syntax Features Morphology features Syntactic features

Dependency Features Relation type Dependency Path

Page 38: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

38

Identification of Entity Pair Identification of Entity Pair with a Semantic Relationwith a Semantic RelationTesting all possible pairs is not efficient.

Step 1: For each input sentence, generate a dependency tree that specifies the syntactic head in the sentence.

Step 2: Find a headNode set from the dependency tree. Each can be a headword of a head entity to govern a relation. We select nodes which have subtree, and omit those which cannot be headNodes by creating a head stoplist.

Step3: For each headNode, check its subtrees to find those that can be tail entities to the headNode. We create a tail stoplist containing those that cannot be root nodes of subtrees of tail entities. Repeat this process.

Step 4: A simple post-processing is applied to correct the boundaries within which the dependency tree does not show correct relationship.

Entity pairs[fought, (the brave soldiers)][fought, (their enemies)][fought, (their country)][fought, (the War)][soldiers, brave][enemies, their][country, their]……..

A dependency tree generated fromConnexor Machinese Analyser

Page 39: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

39

main:

root

in

forwith

soldiers

braveThe enemies

their

country

their

War

the

foughtsubj:

attr:det:

pcomp:

attr:

attr:

det:

loc:

phr:ha:

pcomp:

pcomp:

Syntactic andDependency-pathfeatures

Lexical features fromWordNet,VerbNet andUNLKB.

Features for CDL Relation RecognitionFeatures for CDL Relation Recognition

Some labels of Connexor Machinese Analyser: ha (prepositional phase attachment), phr (verb particle), pcomp (subject complement)

Page 40: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

40

Experimental SettingExperimental Setting Datasets:

Manual-annotated dataset (1700 sentences documents) Contains 13487 CDL.nl relations (44 types of relations)

We choose to train on top 36 relation types with large number of training instances.

Tools SVM-light software Connexor Machinese Text Analyser

Scheme 10-fold cross validation One-vs-all

Page 41: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

41

The table shows the performance of the SVM using individual kernels incrementally.

ResultsResults

S: syntactic featuresD: dependency featuresL: lexical features

KS 80.10 86.35 83.11KD   85.43 83.57 84.49KL   73.98 82.75 78.12KS+D 87.19 86.27           86.73KS+D+L 87.35             88.07           87.71

Kernel       Precision         Recall          F-value

Page 42: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

42

Supporting Input EditorSupporting Input Editor Selection among possible candidates proposed by a

computer analysis. (Like Japanese Input Front-end Processor.)

Graphical verification and editing.

Page 43: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

43

Semantic Search with CDL.nl Semantic Search with CDL.nl beyond Keyword-based Searchbeyond Keyword-based Search

Baseline: Combination of keywords (The use of bi-gram, tri-gram,… does not lead to the improvement of

performance.) A pair of dependent words leads to a slight improvement. Search using natural language queries is preferable in a sense.

However, when the search result is unsatisfactory, it is not obvious for a user how to modify the query sentence.

The CDL.nl-based Search allows a more specified search based on a set of words with named dependency relations, rather than with the simple (non-named) dependency.

It also allows a search with using more specific word concepts such as one modified by attributes and/or larger concept units than a single word.

It allows a search taking account of semantic relevancy, such as similarity between two words, a relation between words in a sentence, etc.

Page 44: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

44

Interface for CDL.nl Data Retrieval Interface for CDL.nl Data Retrieval (Query)(Query)

Query by Natural Language

SQL-like Query Language: CDQL

Graphical Query Interface for CDL.nl data

Page 45: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

45

Approach to Implementing CDQLApproach to Implementing CDQL 1st Step

It is not easy to implement it from scratch. Thus we utilize SPARQL which is the query language

for RDF data. SPARQL is backed by Jena RDB.

Next Step Maybe original implementation for the CDQL

processing.

Page 46: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

46

CDL.nl Data Retrieval SystemCDL.nl Data Retrieval System

Page 47: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

47

CDL to RDFCDL to RDF

Page 48: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

48

CDL.nl Data converted into CDL.nl Data converted into RDF Graphical FormRDF Graphical Form

Page 49: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

49

CDL Data Retrieval via SPARQL :: CDL Data Retrieval via SPARQL :: a simple case a simple case

Query (the RealizationLabel of) a person to whom John reported.

Page 50: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

50

CDL.nl Data Retrieval CDL.nl Data Retrieval via CDQL (an Extended SPARQL)via CDQL (an Extended SPARQL)

Query:: What did John report.

Page 51: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

51

Semantically Flexible MatchingSemantically Flexible Matching

Query::What did John take?

Page 52: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

52

Toward the Foundation Toward the Foundation of Next-generation Webof Next-generation Web

Page 53: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

Immediate Applications ofImmediate Applications ofRelation ExtractionRelation Extraction from Textsfrom Texts

Jie Yang, Dat Nguyen and Mitsuru IshizukaJie Yang, Dat Nguyen and Mitsuru IshizukaDept. of Creative Informatics &Dept. of Creative Informatics &

Dept. of Info. and Communication Eng.Dept. of Info. and Communication Eng.School of Information Science and TechnologySchool of Information Science and Technology

Page 54: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

54

Relation Extraction from WikipediaRelation Extraction from WikipediaWilliam Henry Gates III   (born October 28, 1955) is the co-founder, chairman, former chief software architect, and former CEO of Microsoft Corporation. He is also the founder of Corbis, a digital image archiving company…

Microsoft Corporation,…Headquartered in Redmond, Washington, USA, its best selling products are the Microsoft Windows operating system and the Microsoft Office suite of productivity software.

(Microsoft, location, Redmond)

(Microsoft, product, MS Office)(Microsoft, product, MS Windows)

…(Microsoft, founder, Bill Gates)

(Microsoft, CEO, Bill Gates)(Microsoft, chairman, Bill Gates)

(Corbis, founder, Bill Gates)…

Page 55: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

55

System FrameworkSystem Framework

Entity ClassifierMicrosoft: ORGBill Gates: PER

Albuquerque: LOC

Entity type feature

Wikipedia

Pre-processing

Tag & link extractor

SentenceSplitter

Tokenizer

Phrase Chunker

Microsoft was…The company was

…by Bill Gates… in Albuquerque…

Principal Entity Detector

Secondary Entity Detector

SentenceDetector

MS co-founded… Bill GatesMS … in Albuquerque…

Keyword Extractor

FOUNDER: found, establish…LOCATION: headquartered, situated…

Relation Extractor

FOUNDER LOCATION

Dependency trees

Core Trees

Sub tree feature

SVMclassifiers

Structured knowledge

Page 56: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

56

Triple Tagging allowing rich info. Triple Tagging allowing rich info. in social tagging (folksonomy)in social tagging (folksonomy)

Page 57: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

57

Triple Tag ExtractionTriple Tag Extraction

Page 58: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

58

TripleTag EditorTripleTag Editor

Page 59: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

59

Semantic Retrieval of Patent DocumentsSemantic Retrieval of Patent Documents

Represent patent texts in CDL.nl (started in 2008). Also contributes to the translation of patents.

Page 60: Mitsuru Ishizuka Dept. of Creative Informatics & Dept. of Info. and Communication Eng

60

I have introduced our research on Semantic Computing

centered around CDL (Concept Description Language).

Thank You

Mitsuru IshizukaMitsuru IshizukaUniv. of TokyoUniv. of Tokyo