Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Superior Search
How Text Analytics Improves Search, and the Resulting Applications
Jeff Fried CTO and VP Engineering, BA Insight [email protected]
Text Analytics and Search • How they’re connected • How they’re not
Beware of Over-hype • The technology will never be perfect • It doesn’t need to be
Applications • Search-Based Applications • Context Matters
3
Examples from:
Opinions from:
Search
BI Text
Analytics
5
10-12 © IDC 6
Search and Discovery Software
§ Enterprise search engines, information access platforms, and applications for browsing and navigation
§ Text mining and text analytics
§ Categorizers and clustering engines
§ Information infrastructure: metadata extractors, connectors, normalizers, taxonomy tools, controlled vocabularies
§ Language analyzers
§ Rich media search
§ Visualization, conversational systems, question answering systems
Search engines crawl documents to create an index. They match queries to documents using exact or fuzzy matching and may rank results by relevance to return pertinent information.
Grab-Bag of Related Technologies • Problem – linguistic variations in concept expression
– Technology: natural language processing (NLP)
• Problem – huge numbers of documents that are the same or versions of the same – Technologies : text mining, text analytics, normalizing & de-duping
• Problem – amount of content exceeds amount of human expertise to analyze & categorize – Technologies : entity extraction, contextual analysis, categorization
• Problem – understanding trends and relative values expressed in content – Technology : sentiment analysis
• Problem – retrieving & federating contextually related and relevant content – Technologies – All of the above
Common Techniques across Applications
8
Entity Extraction and Search
9
• Well Established
• Often essential to faceted navigation
Categorization, Clustering, and Search
10
• Mainstream but less common
• “Display names” are important
• Value of taxonomy to search debatable
Sentiment Analysis and Search
11
• Still “leading edge”
• Essential for some applications
• UGC often used instead of machine-generated ratings
Social Media and Search
• Lots of buzz
• Some real applications and strong showcases
• Surprisingly robust to twitter language
//twitterviz
So.cl
14
Social Search Covers a Wide Range
• Relevance – Filtering the document web
• Social Media Content – Filtering the social web
• Trends / Group Insight – Tapping Community
Knowledge
• Answers – Trusted Advisor
Recommendation
• “Java” (coffee, island, or language?)
• “compliance”
• “What should I do in New York?” • Where are my friends now?
• Why did power go out in Palo Alto?
• How does adoption work?
• ( on FB update) anybody give their
babies baby Benedryl for travel/jet lag? Want to hear from parents whether they have or not and how it went
“Concept Search” & Semantics
• LOTS of exciting innovations • Simple techniques still predominate
– Synonyms, Query Expansion, and Content Expansion “{auto car} {tire wheel}”
– Taxonomy and similarity search • Parent + sibling matches • Vector-based similarity searching
– Relationship-based searching • Relationship extraction • Relationship-based query orchestration
Are these “Semantic Search”?
17
The State of Semantics 18
Content, composites, connections.
Text Analytics and Search • How they’re connected • How they’re not
Beware of Over-hype • The technology will never be perfect • It doesn’t need to be
Applications • Search-Based Applications • Context Matters
19
Text Analytics Isn’t Perfect
Realistic Expectations for Powerful Technology
Search Isn’t Perfect Human Language is Complex
Analytics! Semantics! Machine Learning!
22
Linguistics, Statistics, & Gymnastics
Lexicon Base
Language-specific Common Words
Inflection Dictionaries
Part-of-speech Dictionaries
Synonymy Dictionaries
Subject-specific ontologies
Spellcheck dictionaries
Geographical and people’s names
Special terminology lexica
Basic Linguistic Algorithms
Pattern extraction
Stemming / Lemmatization
Part-of-speech Tagging
Language normalization
Vectorization
Applications
Data Cleansing
Categori-zation
Entity Extraction
Suggest Synonyms
Find similar
Stop word elimination
Spell checking
Machine Translation
Relationship Extraction
From Entity Extraction
Acronym
Person Location End of sentence
End of paragraph
Date Base = 2002-03-XX
To Fact Extraction....
Substance Base=„Gold“ Class=„Element“ Number=79 Symbol=Au
Location Base=„Qilian“ Country=„China“ Region=„Asia“ Subregion=„East“
„The Red Valley property lies within the Qilian fold belt which is host to gold deposits.“
Qilian is location of gold
Extracted Fact: Substances x Locations
Substance Base=„Gold“ Class=„Element“ Number=79 Symbol=Au Location=„Qilian“
Location Base=„Qilian“ Country=„China“ Region=„Asia“ Subregion=„East“ Substance=„Gold“
Indicates a gold location
Intelligent Answers from Text
Internal/external text sources
Courtesy of Linguamatics
Text Analytics and Search • How they’re connected • How they’re not
Beware of Over-hype • The technology will never be perfect • It doesn’t need to be
Applications • Search-Based Applications • Context Matters
27
10-12 © IDC 29
Num
ber &
com
plex
ity o
f te
chno
logi
es/d
ata
sour
ces
Big Data + TA: Applications Spectrum
Time Frame
eCommerce
Sentiment extraction
Smarter Planet
eDiscovery
Decision support
Alerting
Answer Machines (IBM Watson)
Predictions
Historic
Relationship Detection
Pattern Detection
Find influencers
Brand management
Climate Modeling And Prediction
Investment Trend Detection
Reputation management
Voice of Customer
Gov’t Intelligence Apps
Log Analysis
Future(Predict)
Ad targeting
Churn detection
Find drug interactions
Fraud Detection
Current (Monitor)
10-12 © IDC
Role of Text Analytics
Interface
Smart Connectors, Schemas, XML tagged data
Information Infrastructure
Knowledge Bases
Workflow
Access and Analysis
InfoApps
Content and Data Sources
Data Prep: • Normalize • Extract • Tag • Parse
Analyze: • Relationships • Cause and Effect • Trends
Question Analysis: • Disambiguate • Expand
Context Ma5ers Informa:on Overload
Relevancy Overload
What’s important to me right now
Audience-specific search experiences
User context
Inform-ation
context
Application context
Social context
Renee Lo Engineering Contoso Consulting ”What should I know about implementing ERP?”
Alan Brewer Sales Manager Contoso Consulting ”What should I know about selling ERP consulting?”
Username & Group Memberships Loca:on Languages
Business Unit Department
Team Time of Day
Preferred Sites SharePoint Audiences
Interests & Current Projects Context of Current Task
34
End-User Trend: New Class of Apps
35
Consumer Apps Search for Product, Restaurant, Travel
Enterprise Apps Search for Experts, Projects, Customers, Vendors, Parts,
ü Intuitive ü Unified View ü Intuitive ü Targeted ü Context-aware ü Device-
independent ü Actionable
New Class of Enterprise Apps, Powered by Search
36
TotalView for Customer Service
New Class of Enterprise Apps, Powered by Search
37
TotalView for Sales
New Class of Enterprise Apps, Powered by Search
38
TotalView for Professional Services
New Class of Enterprise Apps, Powered by Search
39
TotalView for Research
New Class of Enterprise Apps, Powered by Search
40
TotalView for Intelligence
Search-Based Applications
Low Cost
Quick to Deploy
$
Products Portal
Associate Portal
Tax Portal
Sales Portal
Tax Sales/Marketing Products HR
Agile Informa:on Integra:on No Data Movement. No code. Leverage Search-‐based unified store.
Search-Based Information Warehouse (shared service)
Search Platform (Microsoft, Lucene/Solr)
Presentation Services Metadata Enrichment Engine
Connector Framework
Sear
ch-B
ased
In
form
atio
n La
yer
41
Dat
a Si
los
Email & Messaging ERP CRM ECM, Search,
Collaboration Structured Data
(databases)
Unstructured Data
(file shares) Public
Web Sites Cloud/
Office 365
Services
Client 360 Portal
Case Study: Pharma R&D
Top 3 reasons for 56% effort duplication:
1) Research done in separate groups • Seemingly unrelated research projects • Later in lifecycle (mfg, reg/test)
2) Data not accessible • Isolated content source • Restricted / limited access • Source not searchable • Special knowledge required
3) Data not linked • Various names/changes leave data disconnected • People not connected to data (experts) • Data managed in many unconnected systems
Case Study: Pharma R&D
1. Documentum Image 2. SharePoint Doc 3. Regulatory Record 4. MEDLINE article
Multiple Sources One Search
Search: amgen 655
Relationships Discovered: Antibodies: mAb Receptors: DR5, IGF-1R Labs: Oncology 1 People: David Chang
The Anatomy of Search
Source: http://searchpatterns.org
Summary
45
Text Analytics and Search continue to be kissing cousins
Let’s conspire to avoid more over-hype
Applications are where the action is; a new class of “mainstream” search-based applications is emerging