Practical Project of the 2006 Joint International Master’s Degree

Preview:

Citation preview

Practical Project of the 2006Joint International Master’s Degree

Agenda

Introduction Technologies in use Architecture Demonstration Remaining Issues Work packages for Semester II Questions & Comments

Introduction

Practical project during the course of studies Timeframe: two terms Topic: Prototype of a semantic search engine

using UIMA

Objectives of the first semester Study the UIMA-Framework and OpenNLP library Search for players, teams, matches and dates Semantic search for goal events Implement an executable prototype

Technologies in Use

UIMA-Framework OpenNLP Java / Java Server Pages Tomcat-Server Python (Webcrawler)

ArchitectureOverview

Unstructured informationPlain Text

converter (parser)

Persistent Search index

UIMA-Framework

OpenNLP

Input

Output

Sentence detection

Word detection

Paragraph detection

Date & Time annotator

Player annotator Match annotator

CAS

NLP-Annotator 1

Goal-Event annotator

User Interface

ArchitectureWebcrawler

Usage of web crawler for preselection of Texts

Implemented in Python Crawls ca. 2500 pages in 20 minutes Presently based on keywords Transfer of results to Jimgle still

manual

ArchitectureNLP-Annotator

Usage of the OpenNLP-Tools & API Rule based approach Tagging of paragraphs, sentences and words Part-of-Speech-Tagging

Implementation in UIMA as separate annotator Results are used by consecutive annotators Internal usage only, not displayed in the search

index

Architecture

Identification of players of the WM2006 Rule based implementation Usage of the OpenNLP word-annotations Matching against the player database

(XML-File) Consideration of last names and

nicknames

Player-Annotator

ArchitectureDate & Time-Annotator

Identification of time and date information Usage of the OpenNLP word-annotations Presently custom, rule based implementation Detecs standard conform time & date

information Detection of relative or colloquial time

information not implemented yet

ArchitectureMatch-Annotator

Identification of matches Based on 3 components

Detection of locality Detection of participating teams Detection of the match result

Usage of upstream annotators OpenNLP word-annotations Player annotations Date- & time-annotations

ArchitectureGoal-Event Annotator

Description of goals are too complex for a rule-based detection

Therefore: Machine based learning Usage of the OpenNLP library Based on statistical information of sentences Comprehensive training necessary

Implementation as OpenNLP component Integration into UIMA by wrapper-classes

ArchitecturePersistent Indexing

Functionality Import of all files in a specific directory Annotation of all available texts Compilation of XML-Files with CAS-data of

every source text Adjacent creation of a search index

Provision of index files for the web-server

ArchitectureGraphical User Interface

Linux server with tomcat installation Simple operation via web-based GUI Search queries are handled by Java server

pages Processing of requests by Java beans

Demonstration Search engine

Open IssuesFurther proceeding…?

Search for attributes e.g. Player AND Germany (presently only via OmniFind)

Automate processing of search engine results

Further training of the components Usage improvements at front- and

backend

New scenarios……for the second semester

Automated analysis of eMails Search for phone numbers Search for customer contacts of employee Find employees with specific skills Find links & relations between employees

Competitive analysis Compare own products with ones from competitors Find out about customer opinions in internet portals

Further ideas??

Ideas……for the second semester

Natural language based search queries Design templates for customizable

annotators Machine based learning for the Web-Crawler Mark annotations in the search results Automated processing of search results Implement more anotators via OpenNLP Provide annotators as web-services

Further ideas??

JIMGLEJIM Master-Project

Questions?

Suggestions?

JIMGLEJIM Master-Project

Thanks for your attention…

Recommended