Upload
tracy-harris
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Software Architecture for Language Engineering (SALE)
– where next?
http://gate.ac.uk/ http://nlp.shef.ac.uk/
Hamish Cunningham
IBM TJ Watson, 1st August/2003
2(39)
Structure of the Talk1. SALE and its context
• Definitions• The Knowledge Economy and HLT• Software Lifecycle
2. GATE, a General Architecture for Text Engineering• History• Summary of Features and Principles• Component-base development• Unicode support• Measurement• CREOLE: some components • Users and Projects
3. Where Next (give up and go home)?• Future context• Desirables• Conclusion
3(39)
SALE: definitions
• Computational Linguistics: science of language that uses computation as an investigative tool.
• Natural Language Processing: science of computation whose subject matter is data structures and algorithms for human language processing.
• Language Engineering: building systems whose cost and outputs are measurable and predictable.
• Software Architecture: macro-level organisational principles for families of systems. In this context is also used as infrastructure.
• SALE: software infrastructure, architecture and development tools for applied NLP and LE.
4(39)
The Knowledge Economy and Human Language
Gartner, December 2002: • taxonomic and hierachical knowledge mapping and
indexing will be prevalent in almost all information-rich applications
• through 2012 more than 95% of human-to-computer information input will involve textual language
A contradiction: formal knowledge in semantics-basedsystems vs. ambiguous informal natural language
The challenge: to reconcile these two opposing tendencies
5(39)
HumanLanguage
Formal Knowledge(ontologies andinstance bases)
(A)IE
CLIE
(M)NLG
ControlledLanguage
OIE
SemanticWeb; Semantic Grid;Semantic Web Services
KEYMNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE
IE and Knowledge: Closing the Language Loop
6(39)
Software lifecycle in collaborative research
Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to.
Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg.
Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator.
Integration and Testing: The lead partner gets desperate and decides to hard-code the results for a small set of examples into the demonstrator, and have a fail-safe crash facility for unknown input ("well, you know, it's still a prototype...").
Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry).
7(39)
Where did GATE come from?Early- mid-1990s (e.g. in TIPSTER):• Increasing trend towards multi-site collaborative projects• Role of engineering in scalable, reusable, and portable HLT• Support for large data, in multiple media, languages, formats,
and locations• Lower cost of creation of language processing components • Promote quantitative evaluation metrics via tools and a level
playing field
GATE history:• 1996 – 2002: GATE version 1, proof of concept• March 2002: version 2, rewritten in Java, component based,
LGPL, more users• Fall 2003: new development cycle
8(39)
GATE is...• An architecture A macro-level organisational picture for LE software
systems. • A framework For programmers, GATE is an object-oriented class
library that implements the architecture. • A development environment For language engineers,
computational linguists et al, a graphical development environment.
GATE comes with...• Some free components... ...and wrappers for other people's
components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue;
ontologies; etc.• Free software (LGPL). Download at http://gate.ac.uk/download/
9(39)
Architectural principles
• Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse
XML support, integration of Protégé, Jena, Weka...) • (Almost) everything is a component, and component sets
are user-extendable • (Almost) all operations are available both from API and GUI
10(39)
Component-based development
CREOLE: a Collection of REusable Objects forLanguage Engineering:• Java Beans: an OO way of chunking software• GATE components: modified Java Beans with
XML configuration• The minimal component = 10 lines of Java, 10
lines of XML, 1 URL
Why bother? • Allows the system to load arbitrary language
processing components
11(39)
CREOLE lifecycle• Bootstrap: stub Java class, Makefile, config• Registration: URL / JAR / creole.xml • Instantiation: class loading, parameterisation,
bean object creation– load-time parameters, e.g. a document’s charset– run-time parameters, e.g. a parser’s lexicon
Three types of beans (not a new religion!):• Language Resources, e.g. doc, corpus, lexicon• Processing Resource, e.g. tagger, stat modeller• Visual Resource, e.g. doc editor, syntax editor
12(39)
Language Resources (LRs)• GATE LRs are documents, ontologies, corpora,
lexicons, ……• LRs can be associated with DataStores (Oracle,
PostgreSQL, XML, Java Serialisation)• Documents / corpora:
– Diverse document formats: text, html, XML, email, RTF, SGML
– Optional format-preserving markup analyse / save
• Standoff annotation model (start, end, type, features), derivative of TIPSTER, compatible with ATLAS and XCES
13(39)
Processing Resources (PRs)• Algorithmic components knows as PRs – beans
with execute methods.• Controllers: execute a set of PRs
– SerialController: sequential run of arbitrary PR set– SerialAnalyserController: analyser PRs over corpus– Conditional controllers: execute depend on features– Parallel controller?
• PRs + Controller = Applications• Application parameterisation state can be saved
and restored, and used for embedding / batching
14(39)
Vis
ual R
esou
rces
(V
Rs)
15(39)
VRs (2): Coreference
16(39)
VRs (3): Syntax
17(39)
GATE Unicode Kit (GUK) Complements Java’s facilities
• Support for defining Input Methods (IMs)
• currently 30 IMs for 17 languages
• Pluggable in other applications (e.g. JEdit)
Editing Multilingual Data
18(39)
Processing Multilingual DataAll processing, visualisation and editing tools use GUK
19(39)
Performance Evaluation
• At document level – annotation diff
20(39)
Regression TestAt corpus level – corpus benchmark tool – tracking system’s performance over time
21(39)
More CREOLE
1. JAPE, FSTs over annotations
2. ANNIE, A Nearly-New IE system
3. DAML+OIL, Protégé, Ontology-Aware IE
4. Information Retrieval, Lucene
5. WordNet
6. Machine Learning support
22(39)
FSTs over annotationsJAPE: a Java Annotation Patterns Engine• Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components• Simplifies multi-phase regex processing
Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ {Lookup.kind == companyDesignator} ):match --> :match.NamedEntity = { kind=company, rule=“Company1” }
23(39)
Info Extraction ComponentsThe ANNIE system – a reusable and easily extendable set of components
24(39)
Populating Ontologies with IE
25(39)
Protégé and Ontology Management
26(39)
Information Retrieval
Currently based on the Lucene IR engine
27(39)
Wor
dNet
sup
port
28(39)
Machine Learning support
• Uses classification.
[Attr1, Attr2, Attr3, … Attrn] Class• Classifies annotations.
(Documents can be classified as well using a simple trick.)
• Annotations of a particular type are selected as instances.
• Attributes refer to instance annotations.• Attributes have a position relative to the instance
annotation they refer to.
29(39)
AttributesAttributes can be:
– Boolean
The [lack of] presence of an annotation of a particular type [partially] overlapping the referred instance annotation.
– Nominal
The value of a particular feature of the referred instance annotation. The complete set of acceptable values must be specified a-priori.
– Numeric
The numeric value (converted from String) of a particular feature of the referred instance annotation.
30(39)
Implementation
Machine Learning PR in GATE.Has two functioning modes:
– training– application
Uses an XML file for configuration:<?xml version="1.0" encoding="windows-1252"?><ML-CONFIG>
<DATASET> … </DATASET><ENGINE>…</ENGINE>
<ML-CONFIG>
31(39)
<DATASET><DATASET><INSTANCE-TYPE>Token</INSTANCE-TYPE> <ATTRIBUTE> <NAME>POS_category(0)</NAME> <TYPE>Token</TYPE> <FEATURE>category</FEATURE> <POSITION>0</POSITION> <VALUES> <VALUE>NN</VALUE> <VALUE>NNP</VALUE> … </VALUES> [<CLASS/>] </ATTRIBUTE> …</DATASET>
32(39)
<ENGINE><ENGINE> <WRAPPER>gate.creole.ml.weka.Wrapper</WRAPPER> <OPTIONS> <CLASSIFIER>weka.classifiers.j48.J48</CLASSIFIER> <CLASSIFIER-OPTIONS>-K 3</CLASSIFIER-OPTIONS> <CONFIDENCE-THRESHOLD>0.85</CONFIDENCE-
THRESHOLD> </OPTIONS> </ENGINE>
Now: WEKASoon: Torch? YASMET? TIMBL?
33(39)
Attributes Position
Instances type: Token
34(39)
Standard Use ScenarioTraining• Prepare training annotations. • Run the ML PR in training
mode.• Export the dataset as .arff
and perform experiments using the WEKA interface in order to find the best attribute set / algorithm / algorithm options.
• Update the configuration file accordingly.
• Run the ML PR again to collect the actual data.
• [ Save the learnt model. ]
Application• [ Load the previously
saved model. ]• Run the ML PR in
application mode.• [ Save the learnt model. ]
35(39)
Using Other ML LibrariesThe MLEngine Interface
• void addTrainingInstance(List attributes) Adds a new training instance to the dataset.
• Object classifyInstance(List attributes) Classifies a new instance.
• void init() This method will be called after an engine is created
and has its dataset and options set. • void setDatasetDefinition(DatasetDefintion definition)
Sets the definition for the dataset used. • void setOptions(org.jdom.Element options)
Sets the options from an XML JDom element.• void setOwnerPR(ProcessingResource pr)
Registers the PR using the engine with the engine.
36(39)
A bit of a nuisance (GATE users)GATE team projects. Past:• Conceptual indexing: MUMIS:
automatic semantic indices for sports video
• MUSE, cross-genre entitiy finder• HSL, Health-and-safety IE• Old Bailey: collaboration with HRI
on 17th century court reports• Multiflora: plant taxonomy text
analysis for biodiversity research e-science
Present:• Advanced Knowledge
Technologies: €12m UK five site collaborative project
• EMILLE: S. Asian languages corpus
• ACE / TIDES: Arabic, Chinese NE• JHU summer w/s on semtaggingFuture:• Five new projects (below)
Thousands of users at hundreds of sites. A representative sample: • the American National Corpus project • the Perseus Digital Library project,
Tufts University, US• Longman Pearson publishing, UK• Merck KgAa, Germany• Canon Europe, UK• Knight Ridder, US• BBN (leading HLT research lab), US• SMEs inc. Sirma AI Ltd., Bulgaria• Imperial College, London, the University
of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities
• UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia...
37(39)
Where Next (1)?• Can Universities cope with the long term?• User survey• Future context:
– SEKT: Knowledge Management– KnowledgeWeb: OntoWeb II– PrestoSpace: audiovisual preservation (FSTs for users?)– hTechSight: knowledge portal for petrochemicals– ETCSL: Electronic Text Corpus of Sumerian Language– DERI: Digital Enterprise Research Institute– PhDs: INK, PIE
38(39)
Where Next (2)?
• Some desirables:– Corpus tools (ANNIC in progress)– Audiovisual documents– WS-based backend server, for ML, active learning etc.– Better dialogue support (cf. AMITIES, Galaxy)– Better MT support– PDF documents
– JAPE debugger, editor, 101 language extensions (e.g. quantified ops, deletion ontology callouts)
– Cleverer treatment of large documents in the GUI– PR reloading
39(39)
Conclusion
GATE is:• Addressing the need for scalable, reusable, and portable
HLT solutions• Supporting large data, in multiple media, languages, formats,
and locations• Lowering the cost of creation of new language processing
components • Promoting quantitative evaluation metrics via tools and a
level playing field• Promoting experimental repeatability by developing and
supporting free software
http://gate.ac.uk/