Apache UIMA Introduction
Gestione delle Informazioni su Web - 2010/2011Tommaso Teofili
tommaso [at] apache [dot] org
UIM ?
Unstructured Information Management
A wide topic: text, audio, video
Different (possibly mixed) approaches (NLP, Machine Learning, IR, Ontologies, Automated reasoning, Knowledge Sources)
Apache UIMA
Apache Software Foundation
No profit corporation
“...provides organizational, legal, and financial support for a broad range of open source software projects...”
“...collaborative and meritocratic development process...”
“...pragmatic Apache License...”
Apache UIMA
Architectural framework to manage unstructured data (Java, C++, ...)
Former IBM research project donated to ASF
OASIS Standard for unstructured information management
Apache UIMA - Goals
“Our goal is to support a thriving community of users and developers of UIMA frameworks, tools, and annotators, facilitating the analysis of unstructured content such as text, audio and video”
Apache UIMA - bridging worlds
Apache UIMA - Overview
UIMA supports the development, discovery, composition and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies
Apache UIMA - Multimodal Analysis
Multimodal Analysis means the ability of processing some resource from various “points of view”
Sample: a video stream for which we want to extract subtitles and also automatically recognize the actors involved
We are though mainly interested in text...
Sample scenario
Content Management System containing free text articles about movies
We want such articles to be automatically enriched with metadata contained inside the text (movies, directors, actors/actresses, distribution) and linked to “similar” articles (i.e.: dealing with same movies or actors)
So that we can search for “similar” articles
Sample scenario - articles about movies
Sample scenario
UIMA can help on enriching articles with metadata
Think of filling an Article.java instance variables with proper values
Then persisting it to a database to query articles dealing with the same actors
Filling Article with metadata
Sample scenario - metadata
UIMA - Annotations
Apache UIMA - Annotation
The association of a metadata, such as a label, with a region of text (or other type of artifact).
For example, the label “Person” associated with a region of text “Fred Center” constitutes an annotation. We say “Person” annotates the span of text from X to Y containing exactly “Fred Center”
Apache UIMA - Basic Steps
Domain model definition
Analysis pipeline definition
Arrange components:
Define components draining data from sources
Add and customize analysis components: Patterns, Dictionaries, RegEx, External services, NLP, etc...
Define components outputting information on target storages
Analysis pipeline(s) execution
Defining domain model within UIMA using Type Systems
Type System is the place where we describe which metadata we would like to extract
Low representational gap
Like almost everything in UIMA: described (and generated!) using XML
Possible to define multiple Type Systems for different purposes
How do UIMA extract metadata?
Apache UIMA - Analysis Engines
Basic UIMA building blocks
Analyze a document
Infer and record descriptive attributes (about documents/regions)
Generating analysis results
Apache UIMA - AEs
Analysis Engines are described by a descriptor (XML)
Can be Primitive (a single AE) or Aggregated (a pipeline of AEs)
Analysis algorithms can be switched changing descriptor instead of code
Contain TypeSystems definitions
Define Capabilites
Apache UIMA - AnalysisComponent API
initialize : Performs (once) any startup tasks required by this component
process : Process the resource to analyze generating analysis results (metadata)
destroy : Frees all resources held, called only once when it is finished using this component
Apache UIMA - Annotators
Analysis Engine algorithm
Annotator : A software component implemented to produce and record annotations over regions of an artifact (e.g., text document, audio, and video)
Annotators implement AnalysisComponent interface
Apache UIMA - Roles
AnalysisEngine : High level block responsible for analysis - contains at least one AnalysisComponent
AnalysisComponent : interface for any component responsible for analyzing artifacts
Annotator : implementation of AnalysisComponent responsible for creating Annotations
Apache UIMA - AEs
Analysis Engines in a Pipeline
Apache UIMA - Analysis Results
Where do analysis results end up?
How annotators represent and share their results?
CAS - Common Analysis Structure
Maintain typed indexes of extracted results
Common Analysis Structure
Which algorithms lay under AEs?
Apache UIMA & NLP
NLP (Natural Language Processing) is a theoretically motivated range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of tasks or applications
It’s an AI discipline
Apache UIMA & NLP
“accomplish human-like language processing”
Paraphrase an input text
Translate the text into another language
Answer questions about the contents of the text
Draw inferences from the text
Apache UIMA & NLP
“an NLP-based IR system has the goal of providing more precise, complete information in response to a user’s real information need”
various levels of processing
Apache UIMA - Approaches
Simplest : Write RegEx and Dictionaries and mix them together
NLP-like : Tokenize -> Sentence identification -> PoS Tagging -> Anaphora resolution -> Named Entities Recognition -> Coreference Identification ...
Analysis Engines in a Pipeline
NLP - Language Identifying
NLP takes advantage of language specific syntax, forms, rules and meanings
Not easy to write language independent extraction algorithms
Often this is the first block of NLP pipelines
Techniques: Stopwords dictionaries, statistical models, etc.
NLP - Tokens and Sentences
Humans learn words’ meaning in order to understand whole context semantics
Split the target text in words to be able to analyze their meaning and role
Discover sentences to later assign roles to each token
Easiest for English, Italian & co. but what about Chinese?
NLP - PoS Tagging
Assign a “Part of Speech” (noun, adjective, verb, etc.) to each token generated in the previous step
Many language/domain specific patterns can be discovered and exploited just with pos-tagged-tokens and sentences
NLP - Chunking & ParsingParse sentences into a meaningful set or tree of relationships
Chunks are the sentence building blocks (i.e. verbal forms)
Parse tree highlights the structure of a sentence
Can leverage logic analysis
chunking parsing
NLP - Named Entities Recognition
Answer the questions: where? when? who? how often? how much?
Identify key entities in the text
Common techniques: dictionaries, rules, statistcal models
Debugging NER in UIMA
Using UIMA
Define TypeSystem
Define AnalysisEngine descriptor(s)
Implement Annotator(s)
Execute the UIMA pipeline
Sample scenario - extract actors
Tokenize article text
Identify sentences
Tag PoS
Identify Persons using regular expressions and PoS
Use Person annotations, Tokens’ PoS and Sentences to extract relations between terms to identify Persons who are also Actors
Sample scenario - extract persons
I have a dictionary of names (simple to find and/or build)
I use a dictionary based Annotator to extract annotations of first names (NameAnnotation)
I don’t have a dictionary of surnames
Everytime a matching name (a NameAnnotation) is found we look for one or more (considering persons with double name or surname) subsequent tokens whose PoS is “undefined” or a noun (but not a verb) and starts with Uppercase letter
If found then the name + token(s) sequence annotates a Person (i.e. “Michael J. Fox”)
from Persons to Actors
Getting actors can be simple if we know that Persons who are also actors do some well known actions or there exist widely used patterns
i.e.: a Person “stars as” CharacterInTheMovie (that will be eventually tagged as Person too) when is also an Actor
i.e.: if the snippet “CharacterInTheMovie (Person)” exists, then Person is usually an Actor
then we could build an ActorAnnotator
1. Define TypeSystem
Define at least a Type inside Type System for each object inside the domain model
Useful to define more fine grained Types (for values of type properties, called Features)
If we want to extract information about articles we create an Article type inside the Type System
Also we’ll need to create annotations/entites for movies, actors, directors, etc...
2. Define AnalysisEngine descriptor
Define which type system it’s going to use
Define which capabilities the analysis engine has: which annotations need to work and which annotations it’ll (eventually) generate
Define configuration paramaters for the underlying algorithm
Define resources needed by the analysis engine
3. Implement Annotator
create a new class extending JCasAnnotator_ImplBase
implement the process() method that actually does the job
the algorithm implementation is (called) in the process() method
you can use configuration parameters/resources defined in the descriptor
eventually override initialize() and destroy() methods
DummyPersonAnnotator
4. Execute the UIMA pipeline
Instantiate the AnalysisEngine with its descriptor as a parameter
Create a CAS which will contain the text to be analyzed and the annotations extracted
Run the AnalysisEngine on the given CAS
Browse results
Execute a UIMA pipeline
What’s next
UIMA Use cases
Using UIMA in search engines
Hands on code (assignment)
References
http://www.apache.org
http://uima.apache.org
http://www.oasis-open.org
http://uima.apache.org/d/uimaj-2.3.1/index.html
http://uima.apache.org/d/uimaj-2.3.1/overview_and_setup.html#ugr.ovv.eclipse_setup
http://www.manning.com/ingersoll/
https://github.com/tteofili/samplett/tree/master/giw1011