114
Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Scienc e University of Tokyo Japan

Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

  • View
    225

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Information Extractionfrom

Scientific Texts

Junichi Tsujii

Graduate School of Science

University of Tokyo

Japan

Page 2: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Texts are one of the major sources of information and knowledge.

However, they are not transparent.

They have to be systematically integrated with the other sources like data bases, numerical data,etc.

Natural Language Processing--IE

Page 3: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Information ExtractionModule

•Identify & classify terms•Identify events

Raw(OCR) TextStructure

Annotated

Corpus

Document Named-Entity Event

Database

Ontology Markuplanguage

Data model

Background Knowledge

MEDLINE

Retrieval Module

•Request enhancement•Spawn request•Classify documents

Security

User

•IR Request•Abstract•Full Paper

User

•IR Request•Abstract•Full Paper

Interface Module

•GUI•HTML conversion•System integration

Concept Module

Corpus Module

•Markup generation / compilation•Annotated corpus construction

Database Module

•DB design / access / management•DB construction•BK design / construction / compilation

Overview of GENIA System

Page 4: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

1. What is IE ?

2. General Framework of NLP

3. Basic IE techniques

4. IE in Biology

Plan

Automatic Term Recognition (S. Ananiadou)

Page 5: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

What is IE ?

Page 6: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Application Tasks of NLP

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(5)Text Understanding

(4) Question/Answering Tasks

To search and retrieve documents in response to queriesfor information

To search and retrieve part of documents in responseto queries for information

To extract information that fits pre-defined database schemas or templates, specifying the output formats

To answer general questions by using texts as knowledgebase: Fact retrieval, combination of IR and IE

To understand texts as people do: Artificial Intelligence

Page 7: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(5)Text Understanding

(4) Question/Answering Tasks

Ranges of Queries

Pre-Defined: Fixed aspectsof information carried in texts

Page 8: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

Example of IE: FASTUS(1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 9: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

Example of IE: FASTUS(1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 10: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

Example of IE: FASTUS(1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 11: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

Example of IE: FASTUS(1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 12: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

Example of IE: FASTUS(1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 13: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

set upnew Twaiwan dallors

a Japanese trading househad set up

production of 20, 000 iron and metal wood clubs

[company][set up][Joint-Venture]with[company]

Page 14: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.

Example of IE: FASTUS(1993)

TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000

ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990

Page 15: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Information Extraction

………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on thesecond floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou,was hacked to death with 45 cm watermelon knives. ……….

Name of the Venture: Yaxing BenzProducts: buses and bus chassisLocation: Yangzhou,ChinaCompanies involved: (1)Name: X? Country: German (2)Name: Y? Country: China

Page 16: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Information Extraction

A German vehicle-firm executive was stabbed to death ….………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on thesecond floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou,was hacked to death with 45 cm watermelon knives. ……….

Crime-Type: Murder Type: StabbingThe killed: Name: Jurgen Pfrang Age: 51 Profession: Deputy general managerLocation: Nanjing, China

Different template for crimes

Page 17: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(5)Text Understanding

(4) Question/Answering Tasks

Interpretation of Texts

User

User

System

System

System

Page 18: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Collection of Texts

IR System

Characterization of Texts

Queries

Page 19: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Collection of Texts

IR System

Characterization of Texts

Queries

Interpretation

Knowledge

Page 20: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Collection of Texts

PassageIR System

Characterization of Texts

Queries

Interpretation

Knowledge

Page 21: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Collection of Texts

PassageIR System

Characterization of Texts

Queries

Interpretation

Knowledge

IE System

Texts Templates

Structuresof

SentencesNLP

Page 22: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Interpretation

Knowledge

IE System

Texts Templates

Page 23: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Interpretation

Knowledge

IE System

Texts TemplatesPredefined

General Frameworkof

NLP/NLU

IE as compromise NLP

Page 24: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

(1)Information Retrieval/Detection

(2)Passage Retrieval

(3)Information Extraction

(5)Text Understanding

(4) Question/Answering Tasks

Performance Evaluation

Rather clear

A bit vague

Rather clear

A bit vague

Very vague

Page 25: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

N

N: Correct DocumentsM:Retrieved DocumentsC: Correct Documents that are actually retrieved

M

C

Query

Collection of Documents

Precision:

Recall:

CMCN

F-Value:

P

R

P+R2P ・ R

Page 26: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

N

N: Correct TemplatesM:Retrieved TemplatesC: Correct Templates that are actually retrieved

M

C

Query

Collection of Documents

Precision:

Recall:

CMCN

F-Value:

P

R

P+R2P ・ R

More complicated due to partially filled templates

Page 27: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Page 28: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

Page 29: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

Page 30: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

S

NP

P-N

John

VP

V

run

Page 31: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

S

NP

P-N

John

VP

V

runPred: RUN Agent:John

Page 32: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

John runs.

John run+s. P-N V 3-pre N plu

S

NP

P-N

John

VP

V

runPred: RUN Agent:John

John is a student.He runs.

Page 33: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Domain AnalysisAppelt:1999

Tokenization

Part of Speech Tagging

Term recognition(Ananiadou)

Inflection/Derivation

Compounding

Page 34: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Page 35: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge Incomplete Lexicons

Open class words TermsTerm recognitionNamed Entities Company names Locations Numerical expressions

Page 36: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions

Page 37: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Syntactic Analysis

General Framework of NLP

Morphological andLexical Processing

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects of

Information

Page 38: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Page 39: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Most words in Englishare ambiguous in terms of their part of speeches. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings

Page 40: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Structural Ambiguities

Predicate-argument Ambiguities

Page 41: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Structural Ambiguities

(1)Attachment Ambiguities John bought a car with large seats. John bought a car with $3000.

(2) Scope Ambiguities

young women and men in the room

(3)Analytical Ambiguities Visiting relatives can be boring.

The manager of Yaxing Benz, a Sino-German joint ventureThe manager of Yaxing Benz, Mr. John Smith

John bought a car with Mary.$3000 can buy a nice car.

Semantic Ambiguities(1)

Semantic Ambiguities(2)

Every man loves a woman.

Co-reference Ambiguities

Page 42: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

(2) Ambiguities:Combinatorial

Explosion

Structural Ambiguities

Predicate-argument Ambiguities

CombinatorialExplosion

Page 43: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Note:

Ambiguities vs Robustness

More comprehensive knowledge: More Robust

big dictionariescomprehensive grammar

More comprehensive knowledge: More ambiguities

Adaptability: Tuning, Learning

Page 44: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Framework of IE

IE as compromise NLP

Page 45: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Syntactic Analysis

General Framework of NLP

Morphological andLexical Processing

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects of

Information

Page 46: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Syntactic Analysis

General Framework of NLP

Morphological andLexical Processing

Semantic Analysis

Context processingInterpretation

Difficulties of NLP

(1) Robustness: Incomplete Knowledge

Incomplete Domain Knowledge Interpretation Rules

PredefinedAspects of

Information

Page 47: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Techniques in IE

(1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted

(2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques

(4) Adaptation Techniques: Machine Learning, Trainable systems

(3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences

Page 48: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Anaysis

Context processingInterpretation

Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names

Domain specific rules: <Word><Word>, Inc. Mr. <Cpt-L>. <Word>Machine Learning: HMM, Decision TreesRules + Machine Learning

Part of Speech TaggerFSA rulesStatistic taggers

95 %

F-Value90

DomainDependent

Local ContextStatistical Bias

Page 49: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Anaysis

Context processingInterpretation

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Page 50: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Anaysis

Context processingInterpretation

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Page 51: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Page 52: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Chomsky Hierarchy Hierarchy of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine

Computationally more complex, Less Efficiency

Page 53: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Chomsky Hierarchy Hierarchy of Grammar of Automata

Regular Grammar Finite State Automata

Context Free Grammar Push Down Automata

Context Sensitive Grammar Linear Bounded Automata

Type 0 Grammar Turing Machine

Computationally more complex, Less Efficiency

A Bn n

Page 54: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Page 55: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Page 56: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Page 57: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Page 58: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Page 59: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Page 60: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Page 61: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Page 62: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Page 63: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Page 64: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

0

1

2

3

4

PN ’s

ADJ

Art

N

PN

P

’s

Art

John’s interestingbook with a nice cover

Pattern-maching

PN ’s (ADJ)* N P Art (ADJ)* N

{PN ’s/ Art}(ADJ)* N(P Art (ADJ)* N)*

Page 65: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

General Framework of NLP

Morphological andLexical Processing

Syntactic Analysis

Semantic Analysis

Context processingInterpretation

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Page 66: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 “metal wood” clubs a month.

Example of IE: FASTUS(1993)

1.Complex words 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location

AttachmentAmbiguitiesare not madeexplicit

Page 67: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 “metal wood” clubs a month.

Example of IE: FASTUS(1993)

1.Complex words 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location

{{ }}

a Japanese tea house

a [Japanese tea] housea Japanese [tea house]

Page 68: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 “metal wood” clubs a month.

Example of IE: FASTUS(1993)

1.Complex words 2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location

Structural Ambiguities of NP are ignored

Page 69: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.

The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 “metal wood” clubs a month.

Example of IE: FASTUS(1993)

2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location

3.Complex Phrases

Page 70: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] and [COMPNY] toproduce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE], [COMPNY], capitalized at 20 million [CURRENCY-UNIT] [START] production in [TIME]with production of 20,000 [PRODUCT] a month.

Example of IE: FASTUS(1993)

2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location

3.Complex Phrases

Some syntactic structureslike …

Page 71: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

[COMPNY] said Friday it [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] toproduce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY] [START] production in [TIME]with production of [PRODUCT] a month.

Example of IE: FASTUS(1993)

2.Basic Phrases: Bridgestone Sports Co.: Company name said : Verb Group Friday : Noun Group it : Noun Group had set up : Verb Group a joint venture : Noun Group in : Preposition Taiwan : Location

3.Complex Phrases

Syntactic structures relevantto information to be extractedare dealt with.

Page 72: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Syntactic variations

GM set up a joint venture with Toyota.GM announced it was setting up a joint venture with Toyota.GM signed an agreement setting up a joint venture with Toyota.GM announced it was signing an agreement to set up a joint venture with Toyota.

Page 73: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Syntactic variations

GM set up a joint venture with Toyota.GM announced it was setting up a joint venture with Toyota.GM signed an agreement setting up a joint venture with Toyota.GM announced it was signing an agreement to set up a joint venture with Toyota.

[SET-UP]

GM plans to set up a joint venture with Toyota.GM expects to set up a joint venture with Toyota.

S

NP VP

V NP

N VP

V

GM

signed

agreement

setting up

Page 74: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Syntactic variations

GM set up a joint venture with Toyota.GM announced it was setting up a joint venture with Toyota.GM signed an agreement setting up a joint venture with Toyota.GM announced it was signing an agreement to set up a joint venture with Toyota.

[SET-UP]

GM plans to set up a joint venture with Toyota.GM expects to set up a joint venture with Toyota.

S

NP VP

VGM

set up

Page 75: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

[COMPNY] [SET-UP] [JOINT-VENTURE] in [LOCATION] with [COMPANY] toproduce [PRODUCT] to be supplied to [LOCATION].

[JOINT-VENTURE] capitalized at [CURRENCY] [START] production in [TIME]with production of [PRODUCT] a month.

Example of IE: FASTUS(1993)

3.Complex Phrases4.Domain Events

[COMPANY][SET-UP][JOINT-VENTURE]with[COMPNY][COMPANY][SET-UP][JOINT-VENTURE] (others)* with[COMPNY]

The attachment positions of PP are determined at this stage.Irrelevant parts of sentences are ignored.

Page 76: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Complications caused by syntactic variations

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG][NG] Relpro {NG/others}* [VG]

Page 77: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Complications caused by syntactic variations

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG][NG] Relpro {NG/others}* [VG]

Page 78: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Complications caused by syntactic variations

Relative clause

The mayor, who was kidnapped yesterday, was found dead today.

[NG] Relpro {NG/others}* [VG] {NG/others}*[VG][NG] Relpro {NG/others}* [VG]

Basic patterns Surface PatternGenerator

Patterns usedby Domain Event

Relative clause constructionPassivization, etc.

Page 79: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

FASTUS

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Piece-wise recognitionof basic templates

Reconstructing informationcarried via syntactic structuresby merging basic templates

NP, who was kidnapped, was found.

Page 80: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

FASTUS

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Piece-wise recognitionof basic templates

Reconstructing informationcarried via syntactic structuresby merging basic templates

NP, who was kidnapped, was found.

Page 81: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

FASTUS

1.Complex Words:

2.Basic Phrases:

3.Complex phrases:

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Piece-wise recognitionof basic templates

Reconstructing informationcarried via syntactic structuresby merging basic templates

NP, who was kidnapped, was found.

Page 82: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Current state of the arts of IE

1. Carefully constructed IE systems F-60 level (interannotater agreement: 60-80%) Domain: telegraphic messages about naval operation (MUC-1:87, MUC-2:89) news articles and transcriptions of radio broadcasts Latin American terrorism (MUC-3:91, MUC-4:1992) News articles about joint ventures (MUC-5, 93) News articles about management changes (MUC-6, 95) News articles about space vehicle (MUC-7, 97) 2. Handcrafted rules (named entity recognition, domain events, etc)

Automatic learning from texts: Supervised learning : corpus preparation

Non-supervised, or controlled learning

Page 83: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

IE in Biology

Page 84: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

CSNDB( National Institute of Health Sciences)

• A data- and knowledge- base for signaling pathways of human cells.– It compiles the information on biological molecules,

sequences, structures, functions, and biological reactions which transfer the cellular signals.

– Signaling pathways are compiled as binary relationships of biomolecules and represented by graphs drawn automatically.

– CSNDB is constructed on ACEDB and inference engine CLIPS, and has a linkage to TRANSFAC.

– Final goal is to make a computerized model for various biological phenomena.

Page 85: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Example. 1

• A Standard Reaction Excerpted @[Takai98]

Signal_Reaction:

“EGF receptor Grb2” From_molecule “EGF receptor”To_molecule “Grb2”Tissue “liver”Effect “activation”Interaction

“SH2+phosphorylated Tyr”Reference [Yamauchi_1997]

Page 86: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Example. 3

• A Polymerization Reaction Excerpted @[Takai98]

Signal_Reaction:

“Ah receptor + HSP90 ” Component “Ah receptor” “HSP90”Effect “activation dissociation”Interaction

“PAS domain” “of Ah receptor”

Activity “inactivation of Ah receptor”

Reference [Powell-Coffman_1998]

Page 87: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)

Page 88: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)Is separation of stages possible ?

Page 89: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)Is separation of stages possible ?

Open word classes: techical terms very long specific formation rules many semantic classes acronyms variants fairly ambiguous [[Term recognition]]

Coordination across wordformation A or B and C D

Page 90: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

FASTUS

1.Complex Words: Recognition of multi-words and proper names

2.Basic Phrases:Simple noun groups, verb groups and particles

3.Complex phrases:Complex noun groups and verb groups

4.Domain Events:Patterns for events of interest to the application

Basic templates are to be built.

5. Merging Structures:Templates from different parts of the texts are merged if they provide information about the same entity or event.

Based on finite states automata (FSA)Is separation of stages possible ?

Page 91: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Syntax/Semantics

An active phorbol ester must therefore, presumablyby activation of protein kinase C, cause dissociationof a cytoplasmic complex of NF-kappa B and I kappa Bby modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

Page 92: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Syntax/Semantics

An active phorbol ester must therefore, presumablyby activation of protein kinase C, cause dissociationof a cytoplasmic complex of NF-kappa B and I kappa Bby modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

Page 93: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Syntax/Semantics

An active phorbol ester must therefore, presumablyby activation of protein kinase C, cause dissociationof a cytoplasmic complex of NF-kappa B and I kappa Bby modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

E3: It dissociates a cytoplasmic complex of NF-kappa B and I kappa B.

Part-Whole

Page 94: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Syntax/Semantics

An active phorbol ester must therefore, presumablyby activation of protein kinase C, cause dissociationof a cytoplasmic complex of NF-kappa B and I kappa Bby modifying I kappa B.

E1: An active phorbol ester activates protein kinase C.

E2: The active phorbol ester modifies I kappa B.

E3: It dissociates a cytoplasmic complex of NF-kappa B and I kappa B.

Part-Whole

Page 95: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Full parser based on good grammar formalisms

1. Several attempts of using full parsers : To improve the Precision

2. Systematic treatment of interaction of the different phases : Unification-based grammar formalisms

The two papers in the NLP session of PSB 2001

Page 96: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Experiment(A.Yakushiji et.al, PSB2001)

XHPSG: HPSG-like Grammar translated from XTAG of U-Penn (Y.Tateishi, TAG+ workshop 98)

Terms (Compound nouns) are chunked beforehand.

Automatic conversion: Detailed, empirical comparison of grammars of different formalisms (+LFG)

180 sentences from abstracts in MEDLINE The average parse time per sentence: 2.7 sec by a naïve parser (This can be improved by the multi-stage parser by 50 times)

Page 97: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Argument Frame Extractor

133 argument structures, marked by a domain specialist in 97 sentences among the 180 sentences

Extracted Uniquely

Extracted with ambiguity

ParsingFailures

Extractable from pp’s

31

32

26

Not extractable 27

Memory limitation,etc 17

68%

Page 98: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Ontology: Knowledge of the Domain

Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names

More refined semanticclasses with part-wholerelationships, properties,Etc.

Acronyms, variants,Etc.

Page 99: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Ontology: Knowledge of the Domain

Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names

More refined semanticclasses with part-wholerelationships, properties,Etc.

Acronyms, variants,Etc.

Page 100: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

A database for all sort of biological terms collected from genome databases and biological texts.

It will contain 2 million terms in 2001 and 5 million terms until 2005.

Terms are classified by biochemical and terminological attributes, grounded on their resources.

A database for all sort of biological terms collected from genome databases and biological texts.

It will contain 2 million terms in 2001 and 5 million terms until 2005.

Terms are classified by biochemical and terminological attributes, grounded on their resources.

Biological ontology committee Japan organized by T. Takagi and T. Takai, U.Tokyo in Genome Projects of MESSC (2000.4 ~ 2005.3)

Bio Term BankBio Term Bank BTB

Page 101: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Ontology: Knowledge of the Domain

Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names

More refined semanticclasses with part-wholerelationships, properties,Etc.

Acronyms, variants,Etc.

Page 102: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

GENIA ontology (current version)

+-name-+-source-+-natural-+-organism-+-multi-cell organism   | | | +-mono-cell organism   | | | +-virus  | | +-tissue   | | +-cell type   | | +-sub-location of cells   | +-artificial-+-cell line   |  +-substance-+-compound-+-organic-+-amino-+-protein-+-protein family or group | | +-protein complex | | +-individual protein molecule | | +-subunit of protein complex | | +-substructure of protein | | +-domain or region of protein | +-peptide | +-amino acid monomer | +-nucleic-+-DNA-+-DNA family or group | +-individual DNA molecule | +-domain or region of DNA | +-RNA-+-RNA family or group +-individual RNA molecule +-domain or region of RNA

Page 103: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Expansion of GENIA Ontology

• Try to tag all NPs in some MEDLINE abstracts and find the classes that appears in abstracts but not in current ontology

• Find frequent verbs and what class of arguments they take

Page 104: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Expansion of GENIA Ontology

• Chemical class of substance and their substrucutres• Sources • Biological role, or function, of substances• Reaction

– Biological reaction– Pathway– Disease

• Structure themselves• Experiment , experimental results, and researchers• Measure

Page 105: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Example of Entities in Expanded

• Biological role, or function, of substances– receptor, inhibitor, …

• Biological reaction– activation, binding, inhibition, apoptosis, G2 arrest– pathway, signal– immune dysfunction, Ataxia telangiectasia (AT)

• Structure themselves– alpha-helix,

• Experiment, experimental results, researchers– our results, these studies, we

Page 106: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Verbs Related to Biological EventsFrequent Verbs in 100 MEDLINE Abstracts

Verb Count Verb Count Verb Count Verb Countbe 255 involve 16 determine 9 explain 6induce 56 identify 16 construct 9 exert 6bind 50 act 15 associate 9 enhance 6show 49 stimulate 14 reduce 8 display 6suggest 42 provide 14 prevent 8 characterize 6activate 42 express 13 locate 8 participate 5factor 36 affect 13 line 8 localize 5demonstrate 35 type 12 differ 8 investigate 5inhibit 26 report 12 trigger 7 imply 5have 25 form 12 synergize 7 establish 5reveal 21 contribute 12 examine 7 conclude 5require 21 study 11 block 7 compare 5regulate 21 observe 11 become 7 use 4indicate 21 lead 11 analyze 7 transform 4find 21 function 11 target 6 transfect 4result 20 assay 11 signal 6 test 4play 19 appear 11 remain 6 suppress 4interact 18 occur 10 produce 6 support 4mediate 17 increase 10 present 6 substitute 4contain 17 phosphorylate 9 possess 6 share 4

Page 107: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Verbs Related to Biological EventsVerbs that take biological entities as arguments

• induce– noun BE INDUCED BY noun activation of these PROTEIN was induced by PROTEIN

– noun INDUCE noun PROTEIN induced the tyrosine phosphorylation

• bind– noun BIND TO noun the drugs bind to two different PROTEIN – noun BIND noun motifs previously found to bind the cellular factors– noun BINDING noun the TATA-box binding protein– the BINDING of noun the binding of PROTEIN

semantic class: substance structure source experiment fact reaction

Page 108: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Verbs Related to Biological EventsVerbs that take description entities

• report– noun REPORT that-clause we report here that PROTEIN is activated by PROTEIN– noun REPORT noun we report the characterization of PROTEIN– noun REPORT noun we report a novel structure of PROTEIN

semantic class: substance structure source experiment fact reaction

Page 109: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Verbs Related to Biological EventsVerbs whose arguments depend on syntactic patterns

• show– noun BE SHOWN to-infinitive PROTEIN has been shown to trigger cellular PROTEIN activity – noun SHOW that-clause the data show that PROTEIN stimulation is also not sufficient – noun SHOW noun SOURCE showed a dose-dependent inhibition of PROTEIN activity

semantic class: substance source experiment fact

Page 110: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Verbs Related to Biological EventsVerbs that take both entities

• indicate– noun INDICATE that-clause the data indicate that PROTEIN is required in CELL prolifiration – noun INDICATE noun these findings indicate an unexpected role of DNA

– noun INDICATE that-clause the structure indicates that it represents a unique class of PROTEIN– noun INDICATE noun the structure indicates mechanisms for allosteric effector action

semantic class: substance structure source experiment fact reaction role

Page 111: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Example of NE AnnotationUI - 85146267

TI - Characterization of <NE ti="3" class="protein" nm="aldosterone binding site" mt="SV" subclass="family_or_group" unsure="Class" cmt="">aldosterone binding sites</NE ti="3"> in circulating <NE ti="2" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="2">.

AB - <NE ti="4" class="protein" nm="Aldosterone binding sites" mt="SV" subclass="family_or_group" unsure="Class" cmt="">Aldosterone binding sites</NE ti="4"> in <NE ti="1" class="cell_type" nm="human mononuclear leukocyte" mt="SV" unsure="OK" cmt="">human mononuclear leukocytes</NE ti="1"> were characterized after separation of cells from blood by a Percoll gradient. After washing and resuspension in <NE ti="5" class="other_organic_compounds" nm="RPMI-1640 medium" mt="SV" unsure="OK" cmt="">RPMI-1640 medium</NE ti="5">, cells were incubated at 37 degrees C for 1 h with different concentrations of <NE ti="6" class="other_organic_compounds" nm="[3H]aldosterone" mt="SV" unsure="OK" cmt="">[3H]aldosterone</NE ti="6"> plus a 100-fold concentration of <NE ti="7" class="other_organic_compounds" nm="RU-26988" mt="SV" unsure="OK" cmt="">RU-26988 </NE ti="7">(<NE ti=“17" class="other_organic_compounds" nm="11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one" mt="SV" unsure="OK" cmt="">11 alpha, 17 alpha-dihydroxy-17 beta-propynylandrost-1,4,6-trien-3-one</NE ti=“17">), with or without an excess of unlabeled <NE ti="8" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="8">. <NE ti="9" class="other_organic_compounds" nm="Aldosterone" mt="SV" unsure="OK" cmt="">Aldosterone</NE ti="9"> binds to a single class of <NE ti="10" class="protein" nm="receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">receptors</NE ti="10"> with an affinity of 2.7 +/- 0.5 nM (means +/- SD, n = 14) and a capacity of 290 +/- 108 sites/cell (n = 14). The specificity data show a hierarchy of affinity of <NE ti="11" class="other_organic_compounds" nm="desoxycorticosterone" mt="SV" unsure="OK" cmt="">desoxycorticosterone</NE ti="11"> = <NE ti="12" class="other_organic_compounds" nm="corticosterone" mt="SV" unsure="OK" cmt="">corticosterone</NE ti="12"> = <NE ti="13" class="other_organic_compounds" nm="aldosterone" mt="SV" unsure="OK" cmt="">aldosterone</NE ti="13"> greater than <NE ti="14" class="other_organic_compounds" nm="hydrocortisone" mt="SV" unsure="OK" cmt="">hydrocortisone</NE ti="14"> greater than <NE ti="15" class="other_organic_compounds" nm="dexamethasone" mt="SV" unsure="OK" cmt="">dexamethasone</NE ti="15">. The results indicate that <NE ti="17" class="cell_type" nm="mononuclear leukocyte" mt="SV" unsure="OK" cmt="">mononuclear leukocytes</NE ti="17"> could be useful for studying the physiological significance of these <NE ti="16" class="protein" nm="mineralocorticoid receptor" mt="SV" subclass="family_or_group" unsure="OK" cmt="">mineralocorticoid receptors</NE ti="16"> and their regulation in humans.

Page 112: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Available from our website:

Definition of ontological classes Manual of GMPL: extention of XML to annonate texts Manual of Text Annotation

Soon: Annotated texts (1000 abstracts) by the end of March

Page 113: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

1. IE can contribute to Bio-informatics significantly.

2. However, the domains in Bio-chemistry seem morestructurally rich than the domains we have dealt with so far. Term formation, rich ontologies, complex syntactic structures.

3. It requires substantial efforts in resource building.

4. However, those resources can contribute to other applications : Knowledge sharing, Intelligent IR, Knowledge discovery

One of the crucial techniques is ATR ….

Page 114: Information Extraction from Scientific Texts Junichi Tsujii Graduate School of Science University of Tokyo Japan

Information ExtractionModule

•Identify & classify terms•Identify events

Raw(OCR) TextStructure

Annotated

Corpus

Document Named-Entity Event

Database

Ontology Markuplanguage

Data model

Background Knowledge

MEDLINE

Retrieval Module

•Request enhancement•Spawn request•Classify documents

Security

User

•IR Request•Abstract•Full Paper

User

•IR Request•Abstract•Full Paper

Interface Module

•GUI•HTML conversion•System integration

Concept Module

Corpus Module

•Markup generation / compilation•Annotated corpus construction

Database Module

•DB design / access / management•DB construction•BK design / construction / compilation

Overview of GENIA System