Upload
ikanow
View
1.346
Download
0
Embed Size (px)
DESCRIPTION
Learn how to create understanding from big data and how entity extraction and open analytics creates understanding from the deep web.
Citation preview
Value Mining: How Entity Extraction Informs Analysis
June 2012 | Andrew Strite
Agenda• Big Data and Document Analysis• Case Study: Federal Agency
– Problem Definition– Open Analytics & Entity Extraction– Reporting and Visualization– Results Assessment
• Questions
The Big Data Problem
Data is becoming the new raw material of business: an economic input almost on par with
capital and labor.
“Every day I wake up and ask, ‘how can I flow data better, manage data better, analyze data better?”
Rollin Ford, the CIO of Wal-Mart
Solution: Document Analysis"Document Analysis refers tocomputer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.”Source: http://www.text-tech.com/docanalysis/definition.html
Document Analysis
• The goal is to:– Extract Entities (people, places, things)– Create Associations between entities (in the
form of noun-verb-noun), e.g.:• John Doe lives in Washington, D.C• John Doe is married to Jane Doe• John Doe is a Virgo• John Doe traveled to Mexico on July 6th, 2011
• And…
Document Analysis
• Turn Who, What, When andWhere into a unified data structure that supports data analytics and visualization.
Whopeople, organizations, facilities, company
Whatevents, summaries,facts, themes
Whenpast, present, future dates
Wherecity, state, country, coordinate
Document Analysis
Case Study: Federal Agency
Overview
A Federal client produced reports for other DoD components and wanted to know:
“Did our reports meet customer needs?”
First step: assess historical reporting
“What were teams writing about and when?”
Problem: Unstructured Data
• Plenty of raw data, but no way to get at it– 6K+ unstructured documents – 15+ file types– No standard formats
• Teams (Who)• Dates (When)• Topics (What)
– Some content not relevant
Early Attempts
• Initial client attempts to solve the problem mostly involved manual review– High document volume = labor intensive– Assessing relevance = skilled labor
• Total process tied up skilled analysts for hundreds of man-hours.
• Manual review prone to error– Incomplete attempts corrupted data
Solution: Open Analytics
• Process to design and implement analytical solutions
• Joins open tools and agile engineering techniques
• Goal is to enable organizations to quickly deliver smart analysis and enable top line growth
Mechanism: Infinit.e
CollectingStoring
EnrichingRetrieving
AnalyzingVisualizing
Unstructured documents &
Structured records
Infinit.e is a scalable
framework for
Infinit.e Concept
• Documents• Presentations• Spreadsheets• Meeting notes• Email• IM chats• Reports• Social
• Log files• Databases• Apps
80% Unstructured
20% Structured
Unstructured and Structured Data
• Entities• Events• Facts• Sentiment• Geospatial• Temporal• Themes
Infinit.e Data Model
Tablet ownership levels hit 18% in China, the UK and US versus 3% in November 2010
Bernanke, 57 said in his testimony price increases “have begun to moderate” after a jump in oil costs earlier this year
Duke and Progress announced merger plans in January 2012
<Incident> <uid>20101043423</uid> <subject>1 person killed in armed attack by suspected Boko Haram in Maiduguri, Borno, Nigeria</subject> <multipleDays>No</multipleDays> <eventDate>06/04/2011</eventDate></Incident>
Whopeople, organizations, facilities, company
Whatevents, summaries,facts, themes
Whenpast, present, future dates
Wherecity, state, country, coordinate
Applying Infinit.e
Open Analytic and Agile Intelligence architecture
“What were teams writing about and when?”
Harvested Entities
Reporting and Visualization
• Queries performed on the data, providing breakouts by team, topic, and dates
• Flexible visualization– Built-in visualization framework– Multiple export options
Finding Value
• Over the course of 2.5 weeks, we applied the entity-based data model to our client’s document analysis problem
• Major advantages to this approach were:– Agility– Precision– Relevance
Agility
• Automation reduced processing time:– Manual processing time: ~480 hours– Automated processing time: 2-3 hours
• Speed enabled iterative development– Extraction adapted alongside analysts’
understanding of data– Positive feedback loop
Precision
• Entity definitions created from original data– Definitions improved based on feedback
• Automation ensures uniform application across data set
entity1
entity2
entity3
entity3 entity1TOPIC1
TOPIC2
TOPIC1TOPIC2
Relevance
• Entity extraction informs quality control– Duplicates identified based on similar entities– Exclude documents based on missing entities– Minimizes risk of data corruption– Reduced need for analyst review Duplicates
Missing Meta-Data
The Results
• Extracted entities became key meta-data6K+ unstructured documents became…
…3.5K documents with value to the study
The Results
• Our client was able to complete the research shortly after final extraction
• Confidence in methodology and results bolstered the value of recommendations
• Considering similar approaches for future projects
Bottom Line
Using document analysis significantly…
… reduces the time to ingest data.
… cuts right to relevant information.
… builds a framework for future analysis.
Thank You!
Andrew Strite
www.ikanow.com
301.513.1384