81
Human Language Technology in Musing Horacio Saggion (U. of Sheffield) & Thierry Declerck (DFKI)

Human Language Technology in Musing

  • Upload
    honora

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Human Language Technology in Musing. Horacio Saggion (U. of Sheffield) & Thierry Declerck (DFKI). Outline. Role of HLT in BI Information Extraction (IE) and Semantic Annotation IE development Overview of GATE system Ontology-based IE in Musing Identity Resolution in Musing - PowerPoint PPT Presentation

Citation preview

Page 1: Human Language Technology  in Musing

Human Language Technology in Musing

Human Language Technology in Musing

Horacio Saggion (U. of Sheffield) & Thierry Declerck

(DFKI)

Horacio Saggion (U. of Sheffield) & Thierry Declerck

(DFKI)

Page 2: Human Language Technology  in Musing

OutlineOutline

Role of HLT in BI Information Extraction (IE) and

Semantic Annotation IE development Overview of GATE system Ontology-based IE in Musing Identity Resolution in Musing Opinion Mining in Musing

Role of HLT in BI Information Extraction (IE) and

Semantic Annotation IE development Overview of GATE system Ontology-based IE in Musing Identity Resolution in Musing Opinion Mining in Musing

Page 3: Human Language Technology  in Musing

Human Language Technology in Business Intelligence

Human Language Technology in Business Intelligence

Business Intelligence (BI) is the process of finding, gathering, aggregating, and analysing information for decision making

BI has relied on structured/quantitative information for decision making and hardly ever use qualitative information found in unstructured sources which the industry is keen in using

Human language technology is used in the processes of gathering information through Information

Extraction aggregating information through cross-source

coreference or identity resolution

Business Intelligence (BI) is the process of finding, gathering, aggregating, and analysing information for decision making

BI has relied on structured/quantitative information for decision making and hardly ever use qualitative information found in unstructured sources which the industry is keen in using

Human language technology is used in the processes of gathering information through Information

Extraction aggregating information through cross-source

coreference or identity resolution

Page 4: Human Language Technology  in Musing

Information Extraction (IE)

Information Extraction (IE)

IE pulls facts from the document collection It is based on the idea of scenario template

some domains can be represented in the form of one or more templates

templates contain slots representing semantic information

IE instantiates the slots with values: strings from the text or associated values

IE is domain dependent and has to be adapted to each application domain either manually or by machine learning

IE pulls facts from the document collection It is based on the idea of scenario template

some domains can be represented in the form of one or more templates

templates contain slots representing semantic information

IE instantiates the slots with values: strings from the text or associated values

IE is domain dependent and has to be adapted to each application domain either manually or by machine learning

Page 5: Human Language Technology  in Musing

IE ExampleCompany Agreements

IE ExampleCompany Agreements

SENER and Abu Dhabi’s $15 billion renewable energy company MASDAR new joint venture Torresol Energy has announced an ambitious solar power initiative to develop, build and operate large Concentrated Solar Power (CSP) plants worldwide….. SENER Grupo de Ingeniería will control 60% of Torresol Energy and MASDAR, the remaining 40%. The Spanish holding will contribute all its experience in the design of high technology that has positioned it as a leader in world engineering. For its part, MASDAR will contribute with this initiative to diversifying Abu Dhabi’s economy and strengthening the country’s image as an active agent in the global fight for the sustainable development of the Planet.

SENER and Abu Dhabi’s $15 billion renewable energy company MASDAR new joint venture Torresol Energy has announced an ambitious solar power initiative to develop, build and operate large Concentrated Solar Power (CSP) plants worldwide….. SENER Grupo de Ingeniería will control 60% of Torresol Energy and MASDAR, the remaining 40%. The Spanish holding will contribute all its experience in the design of high technology that has positioned it as a leader in world engineering. For its part, MASDAR will contribute with this initiative to diversifying Abu Dhabi’s economy and strengthening the country’s image as an active agent in the global fight for the sustainable development of the Planet.

COMPANY-1 SENER Grupo de Ingeniería

COMPANY-2 MASDAR

% COMP-1 60%

% COMP-2 40%

NEW COMPANY Torresol Energy

AGREEMENT Joint Venture

PURPOSE “…develop, build, and operate CSP plants worldwide…”

Page 6: Human Language Technology  in Musing

Uses of the extracted information

Uses of the extracted information

Template can be used to populate a data base (slots in the template mapped to the DB schema)

Template can be used to generate a short summary of the input text “SENER and MASDAR will form a joint

venture to develop, build, and operate CSP plants”

Data base can be used to perform querying/reasoning Want all company agreements where

company X is the principal investor

Template can be used to populate a data base (slots in the template mapped to the DB schema)

Template can be used to generate a short summary of the input text “SENER and MASDAR will form a joint

venture to develop, build, and operate CSP plants”

Data base can be used to perform querying/reasoning Want all company agreements where

company X is the principal investor

Page 7: Human Language Technology  in Musing

Information Extraction Tasks

Information Extraction Tasks

Named Entity recognition (NE) Finds and classifies names in text

Coreference Resolution (CO) Identifies identity relations between entities

in texts Template Element construction (TE)

Adds descriptive information to NE results Scenario Template production (ST)

Instantiate scenarios using TEs

Named Entity recognition (NE) Finds and classifies names in text

Coreference Resolution (CO) Identifies identity relations between entities

in texts Template Element construction (TE)

Adds descriptive information to NE results Scenario Template production (ST)

Instantiate scenarios using TEs

Page 8: Human Language Technology  in Musing

ExamplesExamples

NE: SENER, SENER Grupo de Ingenieria, Abu Dhabi, $15

billion, Torresol Energy, MASDAR, etc. CO:

SENER = SENER Grupo de Ingenieria = The Spanish holding

TE: SENER (based in Spain); MASDAR (based in Abu

Dhabi), etc. ST

combine entities in one scenario (as shown in the example)

NE: SENER, SENER Grupo de Ingenieria, Abu Dhabi, $15

billion, Torresol Energy, MASDAR, etc. CO:

SENER = SENER Grupo de Ingenieria = The Spanish holding

TE: SENER (based in Spain); MASDAR (based in Abu

Dhabi), etc. ST

combine entities in one scenario (as shown in the example)

Page 9: Human Language Technology  in Musing

Named Entity RecognitionNamed Entity Recognition

It is the cornerstone of many NLP applications – in particular of IE

Identification of named entities in text Classification of the found strings in categories

or types General types are Person Names, Organizations,

Locations Others are Dates, Numbers, e-mails, Addresses,

etc. Domains may have specific NEs: film names,

drug names, programming languages, names of proteins, etc.

It is the cornerstone of many NLP applications – in particular of IE

Identification of named entities in text Classification of the found strings in categories

or types General types are Person Names, Organizations,

Locations Others are Dates, Numbers, e-mails, Addresses,

etc. Domains may have specific NEs: film names,

drug names, programming languages, names of proteins, etc.

Page 10: Human Language Technology  in Musing

Approaches to NERApproaches to NER Two approaches:

(1) Knowledge-based approach, based on humans defining rules; (2) Machine learning approach, possibly using an annotated corpus

Knowledge-based approach Word level information is useful in recognising entities:

• capitalization, type of word (number, symbol) Specialized lexicons (Gazetteer lists) usually created by

hand; although methods exist to compile them from corpora• List of known continents, countries, cities, person first

names• On-line resources are available to pull out that

information

Two approaches: (1) Knowledge-based approach, based on humans defining rules; (2) Machine learning approach, possibly using an annotated corpus

Knowledge-based approach Word level information is useful in recognising entities:

• capitalization, type of word (number, symbol) Specialized lexicons (Gazetteer lists) usually created by

hand; although methods exist to compile them from corpora• List of known continents, countries, cities, person first

names• On-line resources are available to pull out that

information

Page 11: Human Language Technology  in Musing

Approaches to NERApproaches to NER

Knowledge-based approach rules are used to combine different evidences a known first name followed by a sequence

of words with upper initial may indicate a person name

a upper initial word followed by a company designator (e.g., Co., Ltd.) may indicate a company name

a cascade approach is generally used where some basic names are first identified and are latter combined into more complex names

Knowledge-based approach rules are used to combine different evidences a known first name followed by a sequence

of words with upper initial may indicate a person name

a upper initial word followed by a company designator (e.g., Co., Ltd.) may indicate a company name

a cascade approach is generally used where some basic names are first identified and are latter combined into more complex names

Page 12: Human Language Technology  in Musing

Machine Learning Approach

Machine Learning Approach

Given a corpus annotated with named entities we want to create a classifier which decides if a string of text is a NE or not

• …<person>Mr. John Smith</person>…• …<date>16th May 2005</date>

Each named entity instance is transformed for the learning problem …<person>Mr. John Smith</person>… Mr. is the beginning of the NE person Smith is the end of the NE person

The problem is transformed in a binary classification problem is token begin of NE person? is token end of NE person?

The token itself and context are used as features for the classifier

Given a corpus annotated with named entities we want to create a classifier which decides if a string of text is a NE or not

• …<person>Mr. John Smith</person>…• …<date>16th May 2005</date>

Each named entity instance is transformed for the learning problem …<person>Mr. John Smith</person>… Mr. is the beginning of the NE person Smith is the end of the NE person

The problem is transformed in a binary classification problem is token begin of NE person? is token end of NE person?

The token itself and context are used as features for the classifier

Page 13: Human Language Technology  in Musing

Name Entity RecognitionName Entity Recognition

Page 14: Human Language Technology  in Musing

Linguistic Processors in IELinguistic Processors in IE

Tokenisation and sentence identification

Parts-of-speech tagging Morphological analysis Name entity recognition Full or partial parsing and semantic

interpretation Discourse analysis (co-reference

resolution)

Tokenisation and sentence identification

Parts-of-speech tagging Morphological analysis Name entity recognition Full or partial parsing and semantic

interpretation Discourse analysis (co-reference

resolution)

Page 15: Human Language Technology  in Musing

System development cycle

1. Define the extraction task2. Collect representative corpus (set of

documents)3. Manually annotate the corpus to create a

gold standard4. Create system based on a part of the corpus:

create identification and extraction rules5. Evaluate performance against part of the

gold standard6. Return to step 3, until desired performance is

reached

Page 16: Human Language Technology  in Musing

Corpora and System Development

“Gold standard” corpora are divided typically into a training, sometimes testing, and unseen evaluation portion

Rules and/or ML algorithms developed on the training part

Tuned on the testing portion in order to optimise Rule priorities, rules effectiveness, etc. Parameters of the learning algorithm and the

features used Evaluation set – the best system configuration is run

on this data and the system performance is obtained No further tuning once evaluation set is used!

Page 17: Human Language Technology  in Musing

Performance EvaluationPerformance Evaluation

Precision (P) = correct answers (system)/ answers (system)

Recall (R) = correct answers (system) / answers (human)

trade off between P & R, the F-measure= (β2 + 1)PR / (β2 P+ R )

depending on beta more importance will be given to P or R (beta =1, both are equally important, beta > 1 favours P, beta <1 favours R )

Precision (P) = correct answers (system)/ answers (system)

Recall (R) = correct answers (system) / answers (human)

trade off between P & R, the F-measure= (β2 + 1)PR / (β2 P+ R )

depending on beta more importance will be given to P or R (beta =1, both are equally important, beta > 1 favours P, beta <1 favours R )

Page 18: Human Language Technology  in Musing

GATE (Cunningham&al’02) General Architecture for Text

Engineering

GATE (Cunningham&al’02) General Architecture for Text

Engineering Framework for development and

deployment of natural language processing applications (http://gate.ac.uk)

A graphical user interface allows users (computational linguists) access, composition and visualisation of different components and experimentation

A Java library (gate.jar) for programmers to implement and pack applications

Framework for development and deployment of natural language processing applications (http://gate.ac.uk)

A graphical user interface allows users (computational linguists) access, composition and visualisation of different components and experimentation

A Java library (gate.jar) for programmers to implement and pack applications

Page 19: Human Language Technology  in Musing

Component ModelComponent Model Language Resources (LR)

data Processing Resources (PR)

algorithms Visualisation Resources (VR)

graphical user interfaces (GUI)

Components are extendable and user-customisable for example adaptation of an information extraction

application to a new domain to a new language where the change involves

adaptation of a module for word recognition and sentence recognition

Language Resources (LR) data

Processing Resources (PR) algorithms

Visualisation Resources (VR) graphical user interfaces (GUI)

Components are extendable and user-customisable for example adaptation of an information extraction

application to a new domain to a new language where the change involves

adaptation of a module for word recognition and sentence recognition

Page 20: Human Language Technology  in Musing

Documents in GATEDocuments in GATE A document is created from a file located

somewhere in your disk or in a remote place or from a string

A GATE document contains the “text” of your file and sets of annotations

When the document is created and if a format analyser for your type is available “parsing” (format) will be applied and annotations will be created xml, sgml, html, etc.

Documents also store features, useful for representing metadata about the document some features are created by GATE

GATE documents and annotations are LRs

A document is created from a file located somewhere in your disk or in a remote place or from a string

A GATE document contains the “text” of your file and sets of annotations

When the document is created and if a format analyser for your type is available “parsing” (format) will be applied and annotations will be created xml, sgml, html, etc.

Documents also store features, useful for representing metadata about the document some features are created by GATE

GATE documents and annotations are LRs

Page 21: Human Language Technology  in Musing

Documents in GATEDocuments in GATE

Annotations have types (e.g. Token) belong to particular annotation sets start and end offsets – where in the

document features and values which are used to store

orthographic, grammatical, semantic information, etc.

Documents can be grouped in a Corpus (set of documents), useful to process a set of documents together

Annotations have types (e.g. Token) belong to particular annotation sets start and end offsets – where in the

document features and values which are used to store

orthographic, grammatical, semantic information, etc.

Documents can be grouped in a Corpus (set of documents), useful to process a set of documents together

Page 22: Human Language Technology  in Musing

Documents in GATEDocuments in GATE

semantics

names in text

information

Page 23: Human Language Technology  in Musing

What to annotate:Annotation Schemas

<?xml version="1.0"?><schema xmlns="http://www.w3.org/2000/10/XMLSchema">

<!-- XSchema definition for token--><element name="Address"> <complexType> <attribute name="kind" use="optional">

<simpleType> <restriction base="string">

<enumeration value="email"/> <enumeration value="url"/> <enumeration value="phone"/> <enumeration value="ip"/> <enumeration value="street"/> <enumeration value="postcode"/> <enumeration value="country"/> <enumeration value="complete"/>

</restriction> …

Page 24: Human Language Technology  in Musing

Manual AnnotationManual Annotation

Page 25: Human Language Technology  in Musing

Annotation in GATE GUI

The following tasks can be carried out manually in the GATE GUI:Adding annotation setsAdding annotationsResizing them (changing boundaries)DeletingChanging highlighting colourSetting features and their values

Page 26: Human Language Technology  in Musing

Text Processing Tools Text Processing Tools

Tokenisation Sentence Identification Parts of speech tagging Gazetteer list lookup process Regular grammars over annotations All these resources have as runtime

parameter a GATE document, and they will produce annotations over it

Tokenisation Sentence Identification Parts of speech tagging Gazetteer list lookup process Regular grammars over annotations All these resources have as runtime

parameter a GATE document, and they will produce annotations over it

Page 27: Human Language Technology  in Musing

AlalaAlala 2727

Implemented in the JAPE language (part of GATE) Regular expressions over annotations Provide access and manipulation of annotations produced

by other modules Rules are hand-coded, so some linguistic expertise is

needed here uses annotations from tokeniser, POS tagger, and

gazetteer modules (lists of keywords) use of contextual information rule priority based on pattern length, rule status and

rule ordering Common entities: persons, locations, organisations,

dates, addresses.

NER in GATENER in GATE

Page 28: Human Language Technology  in Musing

AlalaAlala 2828

JAPE Language A JAPE grammar rule consists of a left hand side (LHS) and a

right hand side (RHS) LHS= what to match (the pattern) RHS = how to annotate the found sequence LHS - - > RHS

A JAPE grammar is a sequence of grammar rules Grammars are compiled into finite state machines Rules have priority (number) There is a way to control how to match

options parameter in the grammar files

Page 29: Human Language Technology  in Musing

JAPE GrammarJAPE Grammar

Phase: example1Input: Token LookupOptions: control = appelt

Rule: PersonMalePriority: 10({Lookup.majorType == first_name, Lookup.minorType == male}({Token.orth == upperInitial})*):annotate -->:annotate.Person = { gender = male }

….(more rules here)

Phase: example1Input: Token LookupOptions: control = appelt

Rule: PersonMalePriority: 10({Lookup.majorType == first_name, Lookup.minorType == male}({Token.orth == upperInitial})*):annotate -->:annotate.Person = { gender = male }

….(more rules here)

In a file with name something.jape we write a Jape grammar (phase)

Page 30: Human Language Technology  in Musing

Main JAPE grammarMain JAPE grammar

Combines a number of single JAPE files in general named “main.jape”

Combines a number of single JAPE files in general named “main.jape”

MultiPhase: CascadeOfGrammarsPhases:grammar1grammar2grammar3

Page 31: Human Language Technology  in Musing

ANNIE SystemANNIE System A Nearly New Information Extraction System

recognizes named entities in text “packed” application combining/sequencing

the following components: document reset, tokeniser, splitter, tagger, gazetteer lookup, NE grammars, name coreference

can be used as starting point to develop a new name entity recogniser

A Nearly New Information Extraction System recognizes named entities in text “packed” application combining/sequencing

the following components: document reset, tokeniser, splitter, tagger, gazetteer lookup, NE grammars, name coreference

can be used as starting point to develop a new name entity recogniser

Page 32: Human Language Technology  in Musing

Ontology-based Information Extraction

Ontology-based Information Extraction

The application domain (concepts, relations, instances, etc.) is modelled through an ontology or set of ontologies (we have different yet interrelated domains)

Onto-based Information Extraction identifies in text instances of concepts and relations expressed in the ontology the extraction task is modelled through “RDF templates” X is a company; Z is a person; Z is manager of X; etc.

Documents are enriched with links to the ontology through automatic annotation

Extracted information is used to populate a knowledge repository

Updating the KR involves a process of identity resolution In the case of the GATE system there is an API to manipulate

the ontology and the ontology can be manipulated in extraction grammars

The application domain (concepts, relations, instances, etc.) is modelled through an ontology or set of ontologies (we have different yet interrelated domains)

Onto-based Information Extraction identifies in text instances of concepts and relations expressed in the ontology the extraction task is modelled through “RDF templates” X is a company; Z is a person; Z is manager of X; etc.

Documents are enriched with links to the ontology through automatic annotation

Extracted information is used to populate a knowledge repository

Updating the KR involves a process of identity resolution In the case of the GATE system there is an API to manipulate

the ontology and the ontology can be manipulated in extraction grammars

Page 33: Human Language Technology  in Musing

Ontology-based IE in MUSING

Ontology-based IE in MUSING

ONTOLOGY-BASED INFORMATION EXTRACTION

SYSTEM

DATA SOURCEPROVIDER

DOCUMENTCOLLECTOR

MUSING DATAREPOSITORY

MUSINGONTOLOGY

ANNOTATEDDOCUMENT

ONTOLOGYPOPULATION

KNOWLEDGEBASE

INSTANCES &RELATIONS

DOCUMENT

DOCUMENT

ONTOLOGY CURATORDOMAIN EXPERT

DOMAIN EXPERT

MUSING APPLICATION

REGIONSELECTIONMODEL

ENTERPRISEINTELLIGENCE

REPORT

REGIONRANK

COMPANYINFORMATION

ECONOMICINDICATORS

USER

USER INPUT

ANNOTATIONTOOL

MANUALLYANNOTATEDDOCUMENTS

Page 34: Human Language Technology  in Musing

Company Information in MUSING

Company Information in MUSING

Page 35: Human Language Technology  in Musing

Data Sources in MUSINGData Sources in MUSING Data sources are provided by MUSING partners and

include balance sheets, company profiles, press data, web data, etc. (some private data) Il Sole 24 ORE – Italian financial news paper Some English press data – Financial Times Companies’ web pages (main, “about us”, “contact us”,

etc.) Wikipedia, CIA Fact Book, etc. CreditReform (data provider): company profiles; payment

information – data provider European Business Registry (data provider): profiles,

appointments Discussion forums Log files for IT related applications

Page 36: Human Language Technology  in Musing
Page 37: Human Language Technology  in Musing

Creation of Gold Standards with an Annotation Tool

Creation of Gold Standards with an Annotation Tool

Web-based Tool for Ontology-based (Human) AnnotationUser can select a document from a

pool of documents load an ontologyannotate pieces of text wrt ontologycorrect/save the results back to the

pool of documents

Web-based Tool for Ontology-based (Human) AnnotationUser can select a document from a

pool of documents load an ontologyannotate pieces of text wrt ontologycorrect/save the results back to the

pool of documents

Page 38: Human Language Technology  in Musing

Joint Venture AnnotationJoint Venture Annotation

Page 39: Human Language Technology  in Musing
Page 40: Human Language Technology  in Musing

Region Information Annotation

Region Information Annotation

Page 41: Human Language Technology  in Musing
Page 42: Human Language Technology  in Musing

MUSING applications requiring HLT

A number of applications have been specified to demonstrate the use of semantic-based technology in BI – some examples include Collecting company Information from multiple

multilingual sources (English, German, Italian) to provide up-to-date information on competitors

Identifying chances of success in regions in a particular country

Semi-automatic form filling in several Musing applications

Identify appropriate partners to do business with

Creation of a joint ventures database from multiple sources

Page 43: Human Language Technology  in Musing

Natural Language Processing Technology

Main components adapted for MUSING applications are gazetteer lists and grammars used for named entity recognition

New components include an ontology mapping component – entities are

mapped into specific classes in the given ontology a component creates RDF statements for ontology

population based on the application specification• for example create a company instance with all

its properties as found in the text

Page 44: Human Language Technology  in Musing

Tools to develop the extraction system

Tools to develop the extraction system

Given a set of documents (corpus) human-annotated, we can index the documents using the human and automatic annotations (e.g. tokens, lookups, pos) with the ANNIC tool

The developer can then devise semantic tagging rules by observing annotations in context

Another alternative is to use ML capabilities of the GATE system – supervised learning

Given a set of documents (corpus) human-annotated, we can index the documents using the human and automatic annotations (e.g. tokens, lookups, pos) with the ANNIC tool

The developer can then devise semantic tagging rules by observing annotations in context

Another alternative is to use ML capabilities of the GATE system – supervised learning

Page 45: Human Language Technology  in Musing

Identifying PatternsIdentifying Patterns

Page 46: Human Language Technology  in Musing

Identifying PatternsIdentifying Patterns

Page 47: Human Language Technology  in Musing

Identifying PatternsIdentifying Patterns

Page 48: Human Language Technology  in Musing

Identifying PatternsIdentifying Patterns

Page 49: Human Language Technology  in Musing

Identifying PatternsIdentifying Patterns

Page 50: Human Language Technology  in Musing

Extracting Company Information

Extracting Company Information

Extracting information about a company requires for example identify the Company Name; Company Address; Parent Organization; Shareholders; etc.

These associated pieces of information should be asserted as properties values of the company instance

Statements for populating the ontology need to be created ( “Alcoa Inc” hasAlias “Alcoa”; “Alcoa Inc” hasWebPage “http://www.alcoa.com”, etc.)

Page 51: Human Language Technology  in Musing

Extraction DemoExtraction Demo

Extracting Company Information Extracting Company Information

Page 52: Human Language Technology  in Musing

Some detailsSome details Rule-based system

reuse of some default components for NE recognition + implementation of document structure analysers for each target source

lexicon/gazetteer list developed specifically for the application to identify keywords that mark presence of concepts

regular grammars that represent “typical” ways in which information (concepts, relations) is expressed in text

Mapping to ontology + RDF statements for Ontology population

Current performance F-score between ~ 80%

Rule-based system reuse of some default components for NE recognition

+ implementation of document structure analysers for each target source

lexicon/gazetteer list developed specifically for the application to identify keywords that mark presence of concepts

regular grammars that represent “typical” ways in which information (concepts, relations) is expressed in text

Mapping to ontology + RDF statements for Ontology population

Current performance F-score between ~ 80%

Page 53: Human Language Technology  in Musing

Rule ExampleRule Example( {Lookup.majorType == produce} (KIND)?) ( ({NP}|(LIST)) ({Lookup.majorType ==

equipment})?):mention --> {//get the mention annotations in a listList annList = new ArrayList((AnnotationSet)bindings.get("mention"));//sort the list by offsetCollections.sort(annList, new OffsetComparator());//iterate through the matched annotationsfor(int i = 0; i < annList.size(); i++) { Annotation anAnn = (Annotation)annList.get(i); if (anAnn.getType().equals("NP")) { // add features and values to annotaction: link to the ontology FeatureMap features = Factory.newFeatureMap(); features.put("class", "Product"); // create the annotation annotations.add(anAnn.getStartNode(), anAnn.getEndNode(), "Mention", features); }}}

( {Lookup.majorType == produce} (KIND)?) ( ({NP}|(LIST)) ({Lookup.majorType == equipment})?):mention

--> {//get the mention annotations in a listList annList = new ArrayList((AnnotationSet)bindings.get("mention"));//sort the list by offsetCollections.sort(annList, new OffsetComparator());//iterate through the matched annotationsfor(int i = 0; i < annList.size(); i++) { Annotation anAnn = (Annotation)annList.get(i); if (anAnn.getType().equals("NP")) { // add features and values to annotaction: link to the ontology FeatureMap features = Factory.newFeatureMap(); features.put("class", "Product"); // create the annotation annotations.add(anAnn.getStartNode(), anAnn.getEndNode(), "Mention", features); }}}

Page 54: Human Language Technology  in Musing

Some detailsSome details “produces X, Y, and Z”

Alcoa is currently the biggest producer of aluminium and alumina (the essential component in the production of the precious metal) …

“Offers services including: X, Y, and Z” The Group offers a wide range of services: insurance

contracts, long and short-term loans, savings accounts and financial advice on what to invest in and savings accounts….

Lexicon/expressions used produce = produce, produces, manufacture,

manufactures… equipment = equipment, apparatus, tools, etc. kind = form, forms, type, kind, etc. LIST = Sequence of NPs

“produces X, Y, and Z” Alcoa is currently the biggest producer of aluminium and

alumina (the essential component in the production of the precious metal) …

“Offers services including: X, Y, and Z” The Group offers a wide range of services: insurance

contracts, long and short-term loans, savings accounts and financial advice on what to invest in and savings accounts….

Lexicon/expressions used produce = produce, produces, manufacture,

manufactures… equipment = equipment, apparatus, tools, etc. kind = form, forms, type, kind, etc. LIST = Sequence of NPs

Page 55: Human Language Technology  in Musing

Region Selection Application

Given information on a company and the desired form of internationalisation (e.g., export, direct investment, alliance) the application provides a ranking of regions which indicate the most suitable places for the type of business

A number of social, political geographical and economic indicators or variables such as the surface, labour costs, tax rates, population, literacy rates, etc. of regions have to be collected to feed an statistical model

Page 56: Human Language Technology  in Musing

Region InformationRegion Information

Indicators such as: Economic Stability Indicators: exports,

imports, etc. Industry Indicators: presence of foreign firms,

number of procedures to start business, etc. Infrastructure Indicators: drinking water,

length of highway system, hospitals, telephones, etc.

Labour Availability Indicators: employment rate, libraries, medical colleges, etc.

Market Size Indicators: GDP, surface, etc. Resources Indicator: Agricultural land, Forest,

number of strikes, etc.

Indicators such as: Economic Stability Indicators: exports,

imports, etc. Industry Indicators: presence of foreign firms,

number of procedures to start business, etc. Infrastructure Indicators: drinking water,

length of highway system, hospitals, telephones, etc.

Labour Availability Indicators: employment rate, libraries, medical colleges, etc.

Market Size Indicators: GDP, surface, etc. Resources Indicator: Agricultural land, Forest,

number of strikes, etc.

Page 57: Human Language Technology  in Musing

Region Information – annotation examplesRegion Information – annotation examples

“the net irrigated area totals 33,500 square kilometres” and “The land drained by these rivers is agriculturally rich” –AGRIC-LAND (agricultural land)

“Males constitute 50.3 million” – URBM (urban population)

“64.14% of the people are employed in allied activities” – EMP (employment)

“The three airports in Himachal Pradesh are….” – AIRP_V (air freight)

“In rural areas over 65% of the population have no access to safe drinking water” – WCHAN (water channels)

“the net irrigated area totals 33,500 square kilometres” and “The land drained by these rivers is agriculturally rich” –AGRIC-LAND (agricultural land)

“Males constitute 50.3 million” – URBM (urban population)

“64.14% of the people are employed in allied activities” – EMP (employment)

“The three airports in Himachal Pradesh are….” – AIRP_V (air freight)

“In rural areas over 65% of the population have no access to safe drinking water” – WCHAN (water channels)

Page 58: Human Language Technology  in Musing

Region Selection Application

Data sources used for the OBIE application are statistics from governmental sources and available region profiles found on the Web (e.g. Wikipedia)

Gazetteer lists contain location names and associated information together with keywords to help identify the key information

Grammars use contextual information and named entities to identify the target variables

Extraction performance obtained: F-score > 80%

Page 59: Human Language Technology  in Musing

5959

Walk-through Example

• Andhra Pradesh has 1330 Arts, Science and Commerce colleges, 238 Engineering colleges and 53 Medical colleges. The student to teacher ratio is 19:1 in the higher education. According to census taken in 2001, Andhra Pradesh has an overall literacy rate of 60.5%. While male literacy rate is at 70.3%, the female literacy rate however is only at 50.4%, a cause for concern.

From the Wikipedia article on Andhra Pradesh (a province of India):

Page 60: Human Language Technology  in Musing

6060

Walk-through Example

According to census taken in 2001, Andhra Pradesh has an overall literacy rate of 60.5%.

keywords and phrases

Page 61: Human Language Technology  in Musing

6161

Walk-through Example

According to census taken in 2001, Andhra Pradesh has an overall literacy rate of 60.5%.

with a rule-generatedGATE annotation:

Type Mention

Featuresarticle_region_code India_APindicator_value 60.50%key LIT_Tyear 2001

Page 62: Human Language Technology  in Musing

6262

Walk-through Example

According to census taken in 2001, Andhra Pradesh has an overall literacy rate of 60.5%.

with additional mapped features:

Type Mention

Featuresarticle_region_code India_APregion_instance http://musing.deri.at/ontologies/v0.5/int/region#AndhraPradeshindicator_value 60.50%key LIT_Tindicator_instance http://musing.deri.at/ontologies/v0.5/int/indicator#LIT_Tyear 2001

Page 63: Human Language Technology  in Musing

6363

RDF output

A program checks the features of the Mention annotation and fills in an appropriate template to generate RDF triple.

In this particular region extraction application, this RDF will create an instance of Measurement with appropriate property values, so the knowledge base can be updated with the extracted information.

Page 64: Human Language Technology  in Musing

6464

RDF output<indicator:Measurement rdf:ID="Measurement_173"><time:hasTimeSlice><time:TimeSlice rdf:ID="TimeSlice_91"><time:hasTemporalEntity><time:ProperInstantYear rdf:ID="ProperInstantYear_33"><time:year

rdf:datatype="http://www.w3.org/2001/XMLSchema#int">2001</time:year>

</time:ProperInstantYear></time:hasTemporalEntity></time:TimeSlice></time:hasTimeSlice><indicator:hasValue

rdf:datatype="http://www.w3.org/2001/XMLSchema#string">60.5%</indicator:hasValue>

<indicator:hasPoliticalRegion rdf:resource="http://musing.deri.at/ontologies/v0.5/int/region#AndhraPradesh"/>

<indicator:hasIndicator rdf:resource="http://musing.deri.at/ontologies/v0.5/int/indicator#LIT_T"/>

</indicator:Measurement>

Page 65: Human Language Technology  in Musing

Region InformationRegion Information

Extracted Information Extracted Information

Page 66: Human Language Technology  in Musing

Ontology PopulationOntology Population Creates instances of concepts and relation in the

ontology or links entities found in text with referents already in the ontology

The asserted instances (or updated properties) can be used to process new documents (i.e. for further links to the ontology)

Problems: decide if entity extracted from text is a known entity

• is company “Metaware” found in this text the “Metaware” we have in the ontology?

decide if found information should replace existing information or asserted as a new instance

Creates instances of concepts and relation in the ontology or links entities found in text with referents already in the ontology

The asserted instances (or updated properties) can be used to process new documents (i.e. for further links to the ontology)

Problems: decide if entity extracted from text is a known entity

• is company “Metaware” found in this text the “Metaware” we have in the ontology?

decide if found information should replace existing information or asserted as a new instance

Page 67: Human Language Technology  in Musing

Identity Resolution in MUSING

Identity Resolution in MUSING

Same Person Name different Entity

P1) Antony John was born in 1960 in Gilfach Goch, a mining town in the Rhondda Valley in Wales. He moved to Canada in 1970 where the woodlands and seasons of Southwestern Ontario provided a new experience for the young naturalist...

P2) Antony John - Managing Director. After working for National Westminster Bank for six years, in 1986, Antony established a private financial service practice. For 10 years he worked as a Director of Hill Samuel Asset Management and between 1999 and 2003 he was an Executive Director at the private Swiss bank, Lombard Odier Darier Hentsch. Antony joined IMS in 2003 as a Partner. Antony's PA is Heidi Beasley...

Same Person Name different Entity

P1) Antony John was born in 1960 in Gilfach Goch, a mining town in the Rhondda Valley in Wales. He moved to Canada in 1970 where the woodlands and seasons of Southwestern Ontario provided a new experience for the young naturalist...

P2) Antony John - Managing Director. After working for National Westminster Bank for six years, in 1986, Antony established a private financial service practice. For 10 years he worked as a Director of Hill Samuel Asset Management and between 1999 and 2003 he was an Executive Director at the private Swiss bank, Lombard Odier Darier Hentsch. Antony joined IMS in 2003 as a Partner. Antony's PA is Heidi Beasley...

Page 68: Human Language Technology  in Musing

Identity Resolution in MUSING

Identity Resolution in MUSING

Same company name, different company

C1) Operating in the market where knowledge processes meet software development, Metaware can support organizations in their attempts to become more competitive. Metaware combines its knowledge of company processes and information technology in its services and software. By using intranet and workflow applications, Metaware offers solutions for quality control, document management, knowledge management, complaints management, and continuous improvement.

C2) Metaware S.r.l. is a small but highly technical software house specialized in engineering software and systems solutions based on internet and distributed systems technology. Metaware has participated in a number of RTD cooperative projects and has a consolidated partnership relationship with Engineering.

Same company name, different company

C1) Operating in the market where knowledge processes meet software development, Metaware can support organizations in their attempts to become more competitive. Metaware combines its knowledge of company processes and information technology in its services and software. By using intranet and workflow applications, Metaware offers solutions for quality control, document management, knowledge management, complaints management, and continuous improvement.

C2) Metaware S.r.l. is a small but highly technical software house specialized in engineering software and systems solutions based on internet and distributed systems technology. Metaware has participated in a number of RTD cooperative projects and has a consolidated partnership relationship with Engineering.

Page 69: Human Language Technology  in Musing

Approaches to Identity Resolution in MUSINGApproaches to Identity Resolution in MUSING

Text based approach clustering informed by semantic analysis

and summarization extract sentences containing entity of

interest and create a summary extract semantic information from

summaries and create term vectors for clustering

apply agglomerative clustering to the set of vectors

good performance on Person information

Text based approach clustering informed by semantic analysis

and summarization extract sentences containing entity of

interest and create a summary extract semantic information from

summaries and create term vectors for clustering

apply agglomerative clustering to the set of vectors

good performance on Person information

Page 70: Human Language Technology  in Musing

Identity Resolution in MUSING

Identity Resolution in MUSING

Identity Resolution Framework using Ontology – Milena Yankova (OntoText) input = entity + property values as

specified in an ontology output = updated ontology identity rules are defined for each entity

type in the ontology (e.g. companies, people)

rules combine different similarity criteria to compute a numeric score

Identity Resolution Framework using Ontology – Milena Yankova (OntoText) input = entity + property values as

specified in an ontology output = updated ontology identity rules are defined for each entity

type in the ontology (e.g. companies, people)

rules combine different similarity criteria to compute a numeric score

Page 71: Human Language Technology  in Musing

Identity Resolution in MUSING

Identity Resolution in MUSING

Identity Resolution Framework pre-filtering component: select candidates from the

ontology using some extracted properties found in text • for companies select those with some name similarity

evidence collection component: computes different identity criteria and produces an score• compute the distance between the company names• identify if one location (Scotland) is part of another

location (UK) decision maker component: decides on the most

similar candidate• a similarity threshold is set optimising over training

data (set at 0.40 for company information) data integration component: updates the ontology

Identity Resolution Framework pre-filtering component: select candidates from the

ontology using some extracted properties found in text • for companies select those with some name similarity

evidence collection component: computes different identity criteria and produces an score• compute the distance between the company names• identify if one location (Scotland) is part of another

location (UK) decision maker component: decides on the most

similar candidate• a similarity threshold is set optimising over training

data (set at 0.40 for company information) data integration component: updates the ontology

Page 72: Human Language Technology  in Musing

Identity Resolution in MUSING

Identity Resolution in MUSING

Identity Resolution Experiments ontology pre-populated with data from

provider (database to ontology KB) – UK companies

UK company profiles feed to our company profile analyser to produce RDF templates for UK companies

Match attempted between extracted companies and the KB• f-score = 0.89

Note: first set of experiments and concentrated on one type of entity

Identity Resolution Experiments ontology pre-populated with data from

provider (database to ontology KB) – UK companies

UK company profiles feed to our company profile analyser to produce RDF templates for UK companies

Match attempted between extracted companies and the KB• f-score = 0.89

Note: first set of experiments and concentrated on one type of entity

Page 73: Human Language Technology  in Musing

Opinion Mining in MUSING: Initial Experiments

Opinion Mining in MUSING: Initial Experiments

Opinion mining (OM) consists on identifying what opinion a particular discourse expresses (it is not interested with what the text is about).

MUSING partners are interested in tracking opinions about business entities: persons, organizations, products & services, etc.

The extracted opinions will be combined with qualitative information in order to create the reputation of a company or person

The field of OM is very active thanks to initiatives such as: the TREC 2006 Blog mining for opinion retrieval NTCIR Workshop on Evaluation of Information Access

Technologies Text Analysis Conference with an opinion summarization

task

Opinion mining (OM) consists on identifying what opinion a particular discourse expresses (it is not interested with what the text is about).

MUSING partners are interested in tracking opinions about business entities: persons, organizations, products & services, etc.

The extracted opinions will be combined with qualitative information in order to create the reputation of a company or person

The field of OM is very active thanks to initiatives such as: the TREC 2006 Blog mining for opinion retrieval NTCIR Workshop on Evaluation of Information Access

Technologies Text Analysis Conference with an opinion summarization

task

Page 74: Human Language Technology  in Musing

Opinions on the WebOpinions on the Web

sentimentopinion

sentiment

opinion

Page 75: Human Language Technology  in Musing

positive opinions

negative opinions

negative opinion, but less evident

Page 76: Human Language Technology  in Musing

OM ApproachOM Approach

We see OM as a classification problem Interested in:

differentiate between positive opinion vs negative opinion

recognising fine grained evaluative texts (1-star to 5-star classification)

We use a supervised learning approach (Support Vector Machines) that uses linguistic features

We see OM as a classification problem Interested in:

differentiate between positive opinion vs negative opinion

recognising fine grained evaluative texts (1-star to 5-star classification)

We use a supervised learning approach (Support Vector Machines) that uses linguistic features

Page 77: Human Language Technology  in Musing

CorpusCorpus 92 texts from a Web Consumer forum

Each text contains a review about a particular company/service/product and a thumbs up/down – texts are short (one/two paragraphs)

67% negative and 33% positive 600 texts from another Web forum containing reviews on

companies or products Each text is short and it is associated with a 1 to 5 stars review * ~ 8%; ** ~ 2; *** ~ 3%; **** ~ 20%; ***** ~ 67%

Each document is processed with default GATE analysers: tokenisation; sentence identification; parts of speech tagging; morphological analysis

n-gram (1,2,3) word-based features used to represent the texts are: string, root, category, and orthography of each word

92 texts from a Web Consumer forum Each text contains a review about a particular

company/service/product and a thumbs up/down – texts are short (one/two paragraphs)

67% negative and 33% positive 600 texts from another Web forum containing reviews on

companies or products Each text is short and it is associated with a 1 to 5 stars review * ~ 8%; ** ~ 2; *** ~ 3%; **** ~ 20%; ***** ~ 67%

Each document is processed with default GATE analysers: tokenisation; sentence identification; parts of speech tagging; morphological analysis

n-gram (1,2,3) word-based features used to represent the texts are: string, root, category, and orthography of each word

Page 78: Human Language Technology  in Musing

Binary classificationBinary classification

A support vector machine algorithm using the word-level features was used for training and evaluation in a 10-fold cross-validation experiment

In the binary classification problem: 80% accuracy is obtained when using root and orthography as features (unigrams)

Higher n-grams decrease performance

A support vector machine algorithm using the word-level features was used for training and evaluation in a 10-fold cross-validation experiment

In the binary classification problem: 80% accuracy is obtained when using root and orthography as features (unigrams)

Higher n-grams decrease performance

Page 79: Human Language Technology  in Musing

Fine-grained classificationFine-grained classification

Same learning system used to produce the 5 star classification

74% overall classification accuracy using word root only

1* classification accuracy = 80%; 5* classification accuracy = 75%

2*, 3*, 4* difficult to classify because or either share vocabulary with extreme cases or are vague

Same learning system used to produce the 5 star classification

74% overall classification accuracy using word root only

1* classification accuracy = 80%; 5* classification accuracy = 75%

2*, 3*, 4* difficult to classify because or either share vocabulary with extreme cases or are vague

Page 80: Human Language Technology  in Musing

Linguistic Information in OM

Linguistic Information in OM

Opinion words in the context of target entity (e.g. company)

Use of positive/negative expressions Banca Italese fa piu utili e accelera sulla crecita

Rules which combine syntactic information with constituent polarity to deduce the polarity of chunks combination of polarities in syntactic chunks (“piu

utili” vs “piu perdite”) Rules to combine chunks to produce polarity of

full sentences

Opinion words in the context of target entity (e.g. company)

Use of positive/negative expressions Banca Italese fa piu utili e accelera sulla crecita

Rules which combine syntactic information with constituent polarity to deduce the polarity of chunks combination of polarities in syntactic chunks (“piu

utili” vs “piu perdite”) Rules to combine chunks to produce polarity of

full sentences

Page 81: Human Language Technology  in Musing

Final RemarksFinal Remarks Musing is deploying ontology-based information

extraction technology for business intelligence A number of information extraction applications

have been developed using a rule-based system

Future applications will use machine learning capabilities we are developing

The ontology is the target of the IE applications, however we are working towards the integration of the ontology in the extraction system to support for example: instance identification and tracking

Thanks to Adam Funk and Diana Maynard developing and packing the IE applications

Musing is deploying ontology-based information extraction technology for business intelligence

A number of information extraction applications have been developed using a rule-based system

Future applications will use machine learning capabilities we are developing

The ontology is the target of the IE applications, however we are working towards the integration of the ontology in the extraction system to support for example: instance identification and tracking

Thanks to Adam Funk and Diana Maynard developing and packing the IE applications