100
Semantic Tech & Business, Washington D.C. November 29, 2011 Getting Started with Unstructured Data Christine Connors & Kevin Lynch TriviumRLG LLC Tuesday, November 29, 2011

Getting Started with Unstructured Data

Embed Size (px)

DESCRIPTION

Slides used as part of a tutorial given at the Semantic Technology in Business conference in Washington D.C. Nov. 29 - Dec. 1, 2011.

Citation preview

Page 1: Getting Started with Unstructured Data

Semantic Tech & Business, Washington D.C.

November 29, 2011

Getting Started with Unstructured DataChristine Connors & Kevin LynchTriviumRLG LLC

Tuesday, November 29, 2011

Page 2: Getting Started with Unstructured Data

Meta

✤ Presenter: Christine Connors

✤ @cjmconnors

✤ Presenter: Kevin Lynch

✤ @kevinjohnlynch

✤ Principals at www.triviumrlg.com

Tuesday, November 29, 2011

Page 3: Getting Started with Unstructured Data

Agenda

✤ What is unstructured data?

✤ Where do we find it?

✤ How important is it?

✤ How do we visualize it?

✤ Machine processing for actionable data

✤ Tools

Tuesday, November 29, 2011

Page 4: Getting Started with Unstructured Data

What is unstructured data?

✤ Data which is

✤ Not in a database

✤ Does not adhere to a formal data model

✤ Content

Tuesday, November 29, 2011

Page 5: Getting Started with Unstructured Data

Isn’t that a misnomer?

✤ Problematic term

✤ The presence of object metadata or aesthetic markup does not alone give ‘structure’ in this sense of the word

✤ Object metadata = machine or applied properties

✤ Aesthetic markup = stylesheets; rendering information

✤ Semi-structured data is typically treated as unstructured for the purposes of machine processing and analysis

Tuesday, November 29, 2011

Page 6: Getting Started with Unstructured Data

Types of ‘un’structured data

✤ Text-based documents

✤ Word processing, presentations, email, blogs, wikis, tweets, web pages, web components (read/write web)

✤ Audio/video files

Tuesday, November 29, 2011

Page 7: Getting Started with Unstructured Data

Where do we find it?

✤ Office productivity suites

✤ Content management systems

✤ Digital asset management systems

✤ Web content management systems

✤ Wikis, blogs, comment & discussion threads

✤ Social networking tools

✤ Twitter, Yammer, instant messengers

Tuesday, November 29, 2011

Page 8: Getting Started with Unstructured Data

85%

15%

Structured Unstructured

Is it really that important?

Tuesday, November 29, 2011

Page 9: Getting Started with Unstructured Data

What’s in that 80-85%?

✤ Progress reports - created in a word processor

Tuesday, November 29, 2011

Page 10: Getting Started with Unstructured Data

What’s in that 80-85%?

✤ Dashboards - created in presentation software

Tuesday, November 29, 2011

Page 11: Getting Started with Unstructured Data

What’s in that 80-85%?

✤ Progress reports - color coded text in a spreadsheet

Tuesday, November 29, 2011

Page 12: Getting Started with Unstructured Data

What’s in that 80-85%?

✤ Brainstorming - in messaging systems

✤ Decision making - in email

Tuesday, November 29, 2011

Page 13: Getting Started with Unstructured Data

What’s in that 80-85%?

✤ Business intelligence - on the web and more

Tuesday, November 29, 2011

Page 14: Getting Started with Unstructured Data

How can we make the data more actionable?

✤ Identify it

✤ Convert to a format you can work with

✤ Add structure, meaning:

✤ information extraction

✤ annotation

✤ content analytics

Tuesday, November 29, 2011

Page 15: Getting Started with Unstructured Data

What about enterprise search?

✤ First line of defense

✤ Points you at the highest relevancy ranked data via pattern matching and statistical analysis

✤ Does not assist in other visualizations or transformations without further machine processing

Tuesday, November 29, 2011

Page 16: Getting Started with Unstructured Data

Machine Processing

Machine Processing Platform

Natural Language Processing

Statistical Analysis

Rules-based Classifica-

tion

Semantic Analysis

Unstructured Data

IndexAPI

Visualizations

Federated Search

Data StoresTuesday, November 29, 2011

Page 17: Getting Started with Unstructured Data

Let’s go a little deeper...

Tuesday, November 29, 2011

Page 18: Getting Started with Unstructured Data

Good News, Bad News

✤ Good: Basic text analysis tools are widely available; cheap or free

✤ Good: The range of information you can now consider has broadened; the intelligence you can bring to bear on that information has increased

✤ Bad: Skillsets not widely available (but they are available!)

✤ Good: You can get started right here, understanding, identifying the sources, and possible approaches

Tuesday, November 29, 2011

Page 19: Getting Started with Unstructured Data

What Data Doesn’t Do

✤ From Coco Krumme in “Beautiful Data”

✤ Data doesn’t drive everything.

✤ Note: “narrative fallacy,” “confirmation bias,” “paradox of choice”

✤ Data doesn’t: scale (cognitively), alone explain, predict

✤ The real world doesn’t create random variables

✤ Data doesn’t stand alone

Tuesday, November 29, 2011

Page 20: Getting Started with Unstructured Data

Images

From Oracle 11g presentation at www.nmoug.org/papers/11g_High_Level_April08.ppt

Integrating Unstructured Data

Tuesday, November 29, 2011

Page 21: Getting Started with Unstructured Data

The Goal: Usable Knowledge

✤ Information extraction is NOT the goal

✤ Information extraction is a means to an end

✤ Knowledge discovery is the goal

✤ To this end, we will perform lots of processing to move from bits to usable meaning

Tuesday, November 29, 2011

Page 22: Getting Started with Unstructured Data

So many <near> synonyms

✤ Text analytics

✤ Content analytics

✤ Text mining

✤ Data mining

✤ Information extraction

✤ And then there’s Natural Language Processing

Tuesday, November 29, 2011

Page 23: Getting Started with Unstructured Data

What’s the same?

✤ Moving from bits to meaning requires processing, and a lot of that processing is the same, no matter what you call it

✤ We will focus primarily on textual information today

Tuesday, November 29, 2011

Page 24: Getting Started with Unstructured Data

Natural Language

✤ From Peter Norvig’s “Natural Language Corpus Data: chapter in “Beautiful Data”

✤ Google’s 1 trillion-word corpus investigating probabilistic language models

✤ 13 million types (unique words, punctuation)

✤ 100k types cover 98% of the corpus

✤ For: word segmentation, spelling correction, language identification, spam detection, author identification

✤ %? = “chooses pain” ; “in sufficient numbers”

Tuesday, November 29, 2011

Page 25: Getting Started with Unstructured Data

Information Extraction

✤ Token identification - “tokenization”

✤ Word segmentation

✤ Sentence splitting

✤ Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective, etc.)

✤ Phrase identification - noun phrase

✤ Entity extraction - people, places, events, dates, organizations

Tuesday, November 29, 2011

Page 26: Getting Started with Unstructured Data

Information Extraction

✤ Cluster analysis - group related information, where relationship may not be known

✤ Classification - mapping to specific categories

✤ Dependency identification / Rule generation

✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM”

✤ Conference resolution (anaphoric reference resolution)

✤ e.g., “Joe is CEO at IBM. He is an IEEE member.”

✤ Summarization - key concepts or key sentences

Tuesday, November 29, 2011

Page 27: Getting Started with Unstructured Data

IR and IE

✤ IR (Information Retrieval) versus IE (Information Extraction)

✤ IR retrieves documents from collections; IE retrieves facts and structured information from collections

✤ In IR, the objects of analysis are documents; in IE, the objects of analysis are facts

✤ IE returns knowledge at a deeper level than traditional IR

✤ Results may be imperfect, and linking them back to documents adds value

✤ Sound familiar? (semantic web, linked data)

Tuesday, November 29, 2011

Page 28: Getting Started with Unstructured Data

Information Extraction

Two primary system types

Knowledge Engineering Learning Systems

Rule based Use statistics or other machine learning

Developed by experienced language engineers Developers do not need language engineering expertise

Make use of human intuition

Require only small amount of training data Require large amounts of annotated training data

Development can be very time consuming

Some changes may be hard to accommodate Some changes may require re-annotation of the entire training corpus

From http://gate.ac.uk/sale/talks/gate-course-may11/track-1/module-2-ie/module-2-ie.pdf

Tuesday, November 29, 2011

Page 29: Getting Started with Unstructured Data

Images from Wikipedia

Subject ObjectPredicate

Text

Two views of the semantic webMachine learning, natural language processing, artificial intelligence and linked data

Tuesday, November 29, 2011

Page 30: Getting Started with Unstructured Data

Named Entities

✤ What is NER?

✤ Named Entity Recognition

✤ identifying proper names in texts, and classification into a set of predefined categories of interest

✤ Named entity recognition is the cornerstone of Information Extraction, providing a foundation from which to build complex information extraction systems

Tuesday, November 29, 2011

Page 31: Getting Started with Unstructured Data

Named Entities

✤ Person names

✤ Organizations (companies, government organizations, committees)

✤ Locations (cities, countries, rivers)

✤ Date and time expressions

✤ Measures (percent, money, weight)

✤ Email addresses, web addresses, street addresses

✤ Some domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references, etc.

Tuesday, November 29, 2011

Page 32: Getting Started with Unstructured Data

NOT Named Entities

✤ Artifacts - Wall Street Journal

✤ Common nouns, referring to named entities

✤ e.g. the company, the committee

✤ Name of groups of people and things named after people

✤ e.g. the Tories, the Nobel Prize

✤ Adjectives derived from names

✤ e.g. Bulgarian, Chinese

✤ Numbers which are not times, dates, percentages or money amountshttp://gate.ac.uk/sale/talks/ne-tutorial.ppt

Tuesday, November 29, 2011

Page 33: Getting Started with Unstructured Data

Break Time!

Tuesday, November 29, 2011

Page 34: Getting Started with Unstructured Data

Open Tools

✤ GATE – General Architecture for Text Engineering, from the University of Sheffield, with many users and excellent documentation.

✤ GATE has customizable document and corpus processing pipelines. GATE is an architecture, a framework, and a development environment, with a clean separation of algorithms, data, and visualization.

Tuesday, November 29, 2011

Page 35: Getting Started with Unstructured Data

GATE

✤ “The Volkswagen Beetle of language processing”

✤ “...more than a decade of collecting reusable code and building a community has lead [to] a mature ecosystem for solving language processing problems quickly.”

✤ Hamish Cunningham 2010

Tuesday, November 29, 2011

Page 36: Getting Started with Unstructured Data

GATE – Key Features

✤ Component-based development

✤ Automatic performance measurement

✤ Clean separation between data structures and algorithms

✤ Consistent use of standard mechanisms for components to communicate data

✤ Insulation from data formats

✤ Provision of a baseline set of language components

Tuesday, November 29, 2011

Page 37: Getting Started with Unstructured Data

GATE – More...

✤ Free – open source, LPGL, Java

✤ Mature, at version 6, actively supported, 15 FTEs

✤ Comprehensive, standards-based, popular

✤ Used by thousands of companies, universities, and research laboratories

✤ Well-known, tested, researched, and very well-documented

Tuesday, November 29, 2011

Page 38: Getting Started with Unstructured Data

GATE Overview

✤ Architectural principles

✤ Non-prescriptive, theory neutral (strength and weakness)

✤ Re-use, interoperation, not reimplementation (diverse support, lots of plugins)

✤ (Almost) everything is a component, and component sets are user-extendable

✤ Component-based development

✤ CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering)

✤ The minimal component = 10 lines of Java, 10 lines of XML, 1 URL

Tuesday, November 29, 2011

Page 39: Getting Started with Unstructured Data

GATE – Family

✤ GATE Developer – an integrated development environment for language processing components bundled with the most widely used Information Extraction system and a comprehensive set of plugins

✤ GATE Embedded – an object library optimized for inclusion in diverse apps

✤ GATE Teamware – web app, a collaborative annotative environment

✤ GATE Cloud – parallel distributed processing

Tuesday, November 29, 2011

Page 40: Getting Started with Unstructured Data

GATE – Embedded

From http://gate.ac.uk/g8/page/print/2/sale/talks/gate-apis.png

Tuesday, November 29, 2011

Page 41: Getting Started with Unstructured Data

GATE – Teamware

✤ GATE Teamware – web app, a collaborative annotative environment for high volume factory-style semantic annotation built with workflow

✤ Running in 5 minutes with Teamware virtual server from GATECloud.net (itself open source):

✤ Reusable project templates

✤ Project-specific roles, users

✤ Applying GATE-based processing routines

✤ Project status, annotator activity, statistics

Tuesday, November 29, 2011

Page 42: Getting Started with Unstructured Data

GATE – First Cousins

✤ Ontotext KIM: UIs demonstrating the multi-paradigm approach to information management, navigation and search

✤ Ontotext Mimir: a massively scalable multi-paradigm index built on Ontotext’s semantic repository family, GATE’s annotation structures database, plus full-text indexing from MG4

✤ Ontotext FactForge: ~4B Linked Data statements, query-able

Tuesday, November 29, 2011

Page 43: Getting Started with Unstructured Data

GATE – Ontotext KIM

✤ Ontotext KIM: UIs, tools, GATE Gazetteers, including a Linked Data gazetteer (experimental)

✤ Pre-loaded knowledge base for entities

✤ Tools to upload, query, tailor the knowledge base, algorithms, UI

✤ Can crawl web, including Linked Data, creating semantic index: your servers, theirs, or cloud

✤ Based on GATE and OWLIM

Tuesday, November 29, 2011

Page 44: Getting Started with Unstructured Data

GATE – Ontotext KIM

From: http://www.ontotext.com/sites/default/files/pictures/diagram.pngTuesday, November 29, 2011

Page 45: Getting Started with Unstructured Data

GATE – Ontotext KIM

Structure

Tuesday, November 29, 2011

Page 46: Getting Started with Unstructured Data

GATE – Ontotext KIM

Patterns

Tuesday, November 29, 2011

Page 47: Getting Started with Unstructured Data

GATE – Ontotext KIM

Ontology

Tuesday, November 29, 2011

Page 48: Getting Started with Unstructured Data

GATE – Ontotext KIM

Facets

Tuesday, November 29, 2011

Page 49: Getting Started with Unstructured Data

GATE – Ontotext MIMIR

✤ Ontotext Mimir: large scale indexing infrastructure supporting hybrid search (text, annotation, meaning); massively scalable multi-paradigm capability, combines MG4J full-text index and BigOWLIM semantic repository; query with text, structural info, and SPARQL

✤ Integrated with GATE, customizable, scalable

✤ Open source components

✤ Can federate multiple MIMIRs

✤ Low acquisition, management cost to scale

Tuesday, November 29, 2011

Page 50: Getting Started with Unstructured Data

GATE – Multi-paradigm

✤ Why “multi-paradigm?” Proliferation of retrieval technology options

✤ Full text, boolean, proximity, ranking; behavior mining, tag clouds; concept indexing: taxonomic, ontological; annotation-based

✤ Choice depends principally on content volume + value:

✤ High volume, low (average) value: web search

✤ Medium volume, higher (personal) value: social networks, photo sharing, tagging

✤ Low volume, high value: controlled vocabularies, taxonomies, ontologies

Tuesday, November 29, 2011

Page 51: Getting Started with Unstructured Data

GATE “Resources”

✤ Applications – groups of processes (that run on one or more documents)

✤ Language Resources – documents or document collections (corpus, corpora)

✤ Processing Resources – annotation tools that operate on text in documents

✤ Applications, made up of Processing Resources, operate on Language Resources

Tuesday, November 29, 2011

Page 52: Getting Started with Unstructured Data

Plugins

✤ Applications – an application consists of any number of Processing Resources, run sequentially over documents

✤ Plugins – a plugin is a collection of one or more Processing Resources, bundled together.

✤ Plugins, then, are applications, that need to be loaded in order to access their Processing Resources.

Tuesday, November 29, 2011

Page 53: Getting Started with Unstructured Data

GATE – Plugins (I)

Tuesday, November 29, 2011

Page 54: Getting Started with Unstructured Data

GATE – Plugins (II)

Tuesday, November 29, 2011

Page 55: Getting Started with Unstructured Data

GATE

Tuesday, November 29, 2011

Page 56: Getting Started with Unstructured Data

GATE Annotations

✤ Annotations are central to understanding GATE

✤ Annotations are associated with each document

✤ Each annotation has:

✤ start and end offsets

✤ an optional set of features

✤ each feature has a name and a valueTuesday, November 29, 2011

Page 57: Getting Started with Unstructured Data

GATE Annotations

Tuesday, November 29, 2011

Page 58: Getting Started with Unstructured Data

GATE Annotations

Tuesday, November 29, 2011

Page 59: Getting Started with Unstructured Data

Information Extraction

✤ NE: Named Entity recognition and typing

✤ CO: CO-reference resolution

✤ TE: Template Elements

✤ TR: Template Relations

✤ ST: Scenario Templates

✤ Example:

The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.

✤ NE: Entities are “rocket,” “Tuesday,” “Dr. Head” and “We Build Rockets”CO: “it” refers to the rocket; “Dr. Head” and “Dr. Big Head” are the sameTE: the rocket is “shiny red” and Head’s “brainchild”TR: Dr. Head works for “We Build Rockets Inc.”ST: a rocket launching event occurred with the various participants

From http://gate.ac.uk/sale/talks/ne-tutorial.pptTuesday, November 29, 2011

Page 60: Getting Started with Unstructured Data

ANNIE

✤ A Nearly-New Information Extraction System, packaged with GATE, used throughout examples, and a great place to start

✤ A collection of GATE Processing Resources to perform Information Extraction on unstructured text

✤ “Nearly new” – its name 10 years ago, that stuck

✤ Other information extraction systems include LingPipe and OpenNLP. GATE includes wrappers for LingPipe and OpenNLP, independently developed NLP pipelines. All three systems are provided as pre-built application through the GATE File menu

Tuesday, November 29, 2011

Page 61: Getting Started with Unstructured Data

ANNIE

✤ “Processing Resources” inside ANNIE:

✤ Tokenizer, sentence splitter, part-of-speech tagger, gazetteers, named entity tagger, and an orthomatcher

✤ Also included are noun phrase and verb phrase chunkers

✤ Each “Processing Resource” inside ANNIE can be used as part of a pipeline you create to add annotations or modify existing ones

✤ ANNIE is a highly customizable, rule-based system, with very useful defaults

Tuesday, November 29, 2011

Page 62: Getting Started with Unstructured Data

ANNIE

✤ “Processing Resources” inside ANNIE:

✤ Gazetteer – lookup annotations (lists)

✤ JAPE transducer – date, person, location, organization, money, percent annotations

✤ Orthomatcher – adds match features to named entity annotations (coreference matching)

✤ Document Reset – removes annotations

Tuesday, November 29, 2011

Page 63: Getting Started with Unstructured Data

IE Steps in ANNIE

✤ “Tokenizer” performs Token identification and word segmentation

✤ “Sentence splitter” identifies sentences

✤ “POS” tagger performs Part-of-speech tagging – (noun, verb, adverb, adjective)

✤ Must run Tokenizer and Sentence Splitter before POS tagger

Tuesday, November 29, 2011

Page 64: Getting Started with Unstructured Data

IE Steps in ANNIE

✤ “Gazetteers” – lists of names (people, cities, groups); you can modify or add lists

✤ Each list has features (majorType, minorType, language)

✤ Gazetteers generate “Lookup” annotations with features corresponding to the matched list. When the text matches a gazetteer entry, a Lookup annotation is created.

✤ Lookup annotation are used by ANNIE’s Named Entity transducer to for entity identification.

Tuesday, November 29, 2011

Page 65: Getting Started with Unstructured Data

ANNIE in GATE

Tuesday, November 29, 2011

Page 66: Getting Started with Unstructured Data

ANNIE in GATE

Tuesday, November 29, 2011

Page 67: Getting Started with Unstructured Data

ANNIE in GATE

Tuesday, November 29, 2011

Page 68: Getting Started with Unstructured Data

ANNIE Sequence

Pipeline sequence matters: tokenizer, sentence splitter, POS tagger, gazetteer

Tuesday, November 29, 2011

Page 69: Getting Started with Unstructured Data

IE Steps in ANNIE

✤ “NE Transducer” – Named Entity Transducer performs named entity recognition (NER)

✤ Once we have built up the processing resource pipeline with the previous steps (tokeniser, sentence splitter, POS tagger, gazetteer), we are ready to add the transducer for named entity recognition

✤ More specific information can be added to the features now, including the “kind” of entity, and the rules that were fired

Tuesday, November 29, 2011

Page 70: Getting Started with Unstructured Data

IE Steps in ANNIE

✤ “OrthoMatcher” – orthographic co-reference matches proper names and their variants.

✤ Will match previously unclassified names, based on relations with classified entities

✤ Matches “Kevin Lynch” with “Dr. Lynch”

✤ Matches acronyms with expansions

Tuesday, November 29, 2011

Page 71: Getting Started with Unstructured Data

IE Steps in ANNIE

✤ Tokenizer, sentence splitter, and OrthoMatcher are language, domain, and application-independent

✤ Part-of-speech tagger is language dependent and application-independent

✤ Gazetteer lists are starting points (60K entries)

✤ ANNIE is a way to get started, with a framework for identifying the kinds of elements that matter to your work, and for quickly testing your ideas against existing data

Tuesday, November 29, 2011

Page 72: Getting Started with Unstructured Data

Annotations In Context

Tuesday, November 29, 2011

Page 73: Getting Started with Unstructured Data

Rules-based Classification

✤ Once a stand-alone project, now often part of annotation services

✤ Regex, Boolean and naive Bayesian algorithms executed on tokens

✤ And, Or, Not, Near (x), Multi, Stem, Exact, Phrase, et al (vendor or source dependent)

✤ Assigns documents to a taxonomic category

✤ Allow for greater control over depth and breadth of categories

✤ Human aided, machine processed

Tuesday, November 29, 2011

Page 74: Getting Started with Unstructured Data

Rules-based Classification

Tuesday, November 29, 2011

Page 75: Getting Started with Unstructured Data

Break Time!

Tuesday, November 29, 2011

Page 76: Getting Started with Unstructured Data

Visualization - Prefuse

Tuesday, November 29, 2011

Page 77: Getting Started with Unstructured Data

Visualization - Prefuse

Tuesday, November 29, 2011

Page 78: Getting Started with Unstructured Data

Visualization - Prefuse

Tuesday, November 29, 2011

Page 79: Getting Started with Unstructured Data

Visualization - Prefuse

Tuesday, November 29, 2011

Page 80: Getting Started with Unstructured Data

Visualization - Prefuse

Tuesday, November 29, 2011

Page 81: Getting Started with Unstructured Data

Visualization - Prefuse

Tuesday, November 29, 2011

Page 82: Getting Started with Unstructured Data

Visualization - Gephi

Tuesday, November 29, 2011

Page 83: Getting Started with Unstructured Data

Visualization - Gephi

Tuesday, November 29, 2011

Page 84: Getting Started with Unstructured Data

Visualization - Cytoscape

Tuesday, November 29, 2011

Page 85: Getting Started with Unstructured Data

Quick!

✤ Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of parliament, and so on and so forth) -- call this your corpus

✤ Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy, or something from the Linked Data cloud) -- call this your ontology

✤ Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to the ontology (2.)

✤ Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and measure performance against the gold standard

✤ Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems using GATE Embedded)

✤ Use GATE Mimir to store the annotations relative to the ontology in a multiparadigm index server. (For techies: this sits in the backroom as a RESTful web service.)

✤ Use Ontotext KIM to add semantic search, knowledge facet search, ontology browsing, entity popularity graphing, time series graphing, annotation structure search and (last but not least) boolean full text search. (More techy stuff: mash up these types of search with your existing UIs.)

Tuesday, November 29, 2011

Page 86: Getting Started with Unstructured Data

Data Warehousing / Business Intelligence

✤ Perspective

✤ Process

✤ Use cases

✤ Implications with unstructured data

Tuesday, November 29, 2011

Page 87: Getting Started with Unstructured Data

DW/BI Perspective

✤ Structured data is an incomplete version of the “truth”

✤ Until information is quantified, it is not very useful

✤ Discover facts, and give them structure

✤ Complement structured data with unstructured data; try to complete the picture (of the business, the customer, performance)

Tuesday, November 29, 2011

Page 88: Getting Started with Unstructured Data

DW/BI Process

✤ Extract, then formalize

✤ Give information structure, then associations

✤ Map to existing structures in the data warehouse

Tuesday, November 29, 2011

Page 89: Getting Started with Unstructured Data

DW/BI Use Cases

✤ Report indexing (of metadata, of instances)

✤ Report sections become possible

✤ Self-service for consumers

✤ “BI Search” (of those reports)

✤ Include in portal

✤ As range of reports and users increases, unstructured data approaches have more value

Tuesday, November 29, 2011

Page 90: Getting Started with Unstructured Data

DW/BI Use Case Ideas

✤ For customers, products, complaints, locations:

✤ Voice recognition indexing

✤ RSS feeds

✤ Wikis, blogs (internal and external)

✤ Instant messages

Tuesday, November 29, 2011

Page 91: Getting Started with Unstructured Data

DW/BI Implications

✤ Have to store these results

✤ Have to model these results

✤ Have to map these results to something meaningful

✤ Have to include the results in a useful way (Where? Use taxonomies? Which ones?)

✤ Quality, cost, and complexity matter; extracted entities don’t relate directly to performance

✤ Not a replacement, an addition to the technology

Tuesday, November 29, 2011

Page 92: Getting Started with Unstructured Data

Some Technical Issues

✤ Quality

✤ Integration

✤ Concurrency

✤ Security

✤ Skills

Tuesday, November 29, 2011

Page 93: Getting Started with Unstructured Data

Additional Open Tools

✤ UIMA – Unstructured Information Management Architecture (IBM’s Watson uses this), originated at IBM, now an Apache project.

✤ Component software architecture with a document processing pipeline similar to GATE. Focus on performance and scalability, with distributed processing (web services).

Tuesday, November 29, 2011

Page 94: Getting Started with Unstructured Data

UIMA

Fred is theCenter CEO of

OrganizationPerson

CeoOf

Arg2:OrgArg1:Person

PPVPNPParser

Named Entity

Relationship

Center Micros

Common Analysis Structure (CAS)

Artifact (e.g., Document)

Analysis Results (i.e., Artifact Metadata)

UIMA CASRepresentation now

Alignedwith XMI standard

UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for

upstream processing.

Chart byIBM

Tuesday, November 29, 2011

Page 95: Getting Started with Unstructured Data

UIMA

Image byIBM

Tuesday, November 29, 2011

Page 96: Getting Started with Unstructured Data

Commercial Tools

✤ Oracle Data Mining (Text Mining)

✤ IBM SPSS

✤ SAS Text Miner

✤ Smartlogic

✤ Lots of acquisitions going on in the “big data” space

✤ HP acquired Autonomy

✤ Oracle acquired Endeca

Tuesday, November 29, 2011

Page 97: Getting Started with Unstructured Data

A Note on Tools

✤ UIMA and GATE – comprehensive suite of capabilities, with learning curves.

✤ Commercial tools range from unstructured capabilities inside DBMSs like Oracle, to Business Objects business intelligence tools (who acquired Inxight from Xeroc Parc).

✤ Your mileage will vary. The biggest differentiator is your knowledge of your data.

Tuesday, November 29, 2011

Page 98: Getting Started with Unstructured Data

Questions?

Tuesday, November 29, 2011

Page 99: Getting Started with Unstructured Data

Thank youChristine ConnorsKevin Lynchwww.triviumrlg.com

Tuesday, November 29, 2011

Page 100: Getting Started with Unstructured Data

What can unstructured data look like post-processing?

Tuesday, November 29, 2011