Upload
lucenerevolution
View
965
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presented by Ted Dunning, Chief Application Architect, MapR & Grant Ingersoll, Chief Technology Officer, LucidWorks Search has quickly evolved from being an extension of the data warehouse to being run as a real time decision processing system. Search is increasingly being used to gather intelligence on multi-structured data leveraging distributed platforms such as Hadoop in the background. This session will provide details on how search engines can be abused to use not text, but mathematically derived tokens to build models that implement reflected intelligence. In such a system, intelligent or trend-setting behavior of some users is reflected back at other users. More importantly, the mathematics of evaluating these models can be hidden in a conventional search engine like SolR, making the system easy to build and deploy. The session will describe how to integrate Apache Solr/Lucene with Hadoop. Then we will show how crowd-sourced search behavior can be looped back into analysis and how constantly self-correcting models can be created and deployed. Finally, we will show how these models can respond with intelligent behavior in realtime.
Citation preview
1 ©MapR Technologies - Confidential
Crowd Sourcing Reflected Intelligence Using Search and Big Data Ted Dunning, MapR Grant Ingersoll, LucidWorks
2 ©MapR Technologies - Confidential
Is Search Enough?
• Keyword search is a commodity
• Holistic view of the data and the user interactions with that data are critical
• Search, Discovery and Analytics are the key to unlocking this view of users and data
Search
MapR and LucidWorks
3 ©MapR Technologies - Confidential
Grant’s Background
Co-founder: – LucidWorks – CTO
– Apache Mahout
Long time Lucene/Solr committer
Author: Taming Text
Background in IR and NLP – Built CLIR, QA and a variety of other search-based apps
4 ©MapR Technologies - Confidential
Ted’s Background
Academia, Startups – MapR, MusicMatch, ID Analytics, Veoh Networks
– Big data since before big
Open source – since the dark ages before the internet
– Mahout, Zookeeper, Drill
– bought the beer at first HUG
Co-Author – Mahout in Action
MapR – Chief Application Architect
Founding member of Apache Drill
5 ©MapR Technologies - Confidential
Agenda
Intro
Search (R)evolution
Reflected Intelligence Use Cases
Building a Next Generation Search and Discovery Platform – MapR
– LucidWorks
1+1=3
6 ©MapR Technologies - Confidential
User Interactions With Big Data
Data
Data
Data
DFS
Key Value Store
Index
Command Line
Query Language
Keyword Search
System Administrator
Engineer
End User
7 ©MapR Technologies - Confidential
User Interactions With Big Data
Data
Data
Data
DFS
Key Value Store
Index
Command Line
Query Language
Keyword Search
System Administrator
Engineer
End User
Reflected Intelligence
8 ©MapR Technologies - Confidential
Search (R)evolution
Search use leads to search abuse – denormalization frees your mind – scoring is just a sparse matrix multiply
Lucene/Solr evolution – non free text usages abound – many DB-like features – noSQL before NoSQL was cool – flexible indexing – finite State Transducers FTW!
Scale
“This ain’t your father’s relevance anymore”
9 ©MapR Technologies - Confidential
Search, Discovery and Analytics
Large-scale analysis is key to reflected intelligence – correlation analysis
• based on queries, clicks, mouse tracks,
even explicit feedback
• produce clusters, trends, topics, SIP’s
– start with engineered knowledge,
refine with user feedback
Large-scale discovery features
encourage experimentation
Always test, always enrich!
Search
Discovery Analytics
10 ©MapR Technologies - Confidential
Social Media Analysis in Telecom
Goal – Detect flash-mob traffic events
– Provision additional resources before failures
Method: Correlate mobile traffic analysis with social media analysis – events cause traffic micro-bursts
– participants tweet the events ahead of time
– tweet locations converge on burst location
Deploy operations faster to predict outages and better handle emergency situations – high cost bandwidth augmentation can be marshaled as the traffic appears
– anticipation beats reaction
11 ©MapR Technologies - Confidential
Provenance is 80% of Value
Problem – Broadcasters don’t know what audiences really like at a micro level
Method: – Analysis of social media to determine advertising reach and response
– Time resolution of social traffic can provide detailed response metrics
Results: – In one case the untargeted advertising was worth 5x more if with
supporting response data
12 ©MapR Technologies - Confidential
Claims Analysis
Goal – Insurance claims processing and analysis
– fraud analysis
Method – Combine free text search with metadata analysis to identify high risk
activities across the country
– Integrate with corporate workflows to detect and fix outliers in customer relations
Results – Questions that took 24-48 hours now take seconds to answer
13 ©MapR Technologies - Confidential
Virginia Tech - Help the World
Grab data around crisis
Search immediately
Large-scale analysis enriches data to find ways to improve responses and understanding
http://www.ctrnet.net
14 ©MapR Technologies - Confidential
Can Search Catch the Bad Guys?
Online Drug Counterfeit detection
Identify commonly used language indicating counterfeits – you know it when you see it
– and you know you have seen it
Feed to analyst via search-driven application – enrich based on analysts feedback
15 ©MapR Technologies - Confidential
Veoh - Cross Recommendations
Cross recommendation as search – with search used to build cross recommendation!
Recommend content to people who exhibit certain behaviors (clicks, query terms, other)
(Ab)use of a search engine – but not as a search engine for content
– more like a search engine for behavior
16 ©MapR Technologies - Confidential
Recommendation Basics
History:
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1
17 ©MapR Technologies - Confidential
Recommendation Basics
History as matrix:
t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1
18 ©MapR Technologies - Confidential
Recommendation Basics
Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
t3 not t3
t1 2 1
not t1 1 1
21 ©MapR Technologies - Confidential
Spot the Anomaly
Root LLR is roughly like standard deviations
A not A
B 13 1000
not B 1000 100,000
A not A
B 1 0
not B 0 2
A not A
B 1 0
not B 0 10,000
A not A
B 10 0
not B 0 100,000
0.44 0.98
2.26 7.15
22 ©MapR Technologies - Confidential
Threshold by Score
Coocurrence
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
23 ©MapR Technologies - Confidential
Threshold by Score
Indicators
t1 t2 t3 t4
t1 1 0 0 1
t2 0 1 0 1
t3 0 0 1 1
t4 1 0 0 1
indicator: t2, t4
24 ©MapR Technologies - Confidential
Search Engine for Reflected Intelligence
Map-reduce “big data” part – Logs record user + item occurrence
– Group by user to get rows of occurrence matrix
– Self-join to get cooccurrence
– Log-likelihood test to find anomalies
Search part – Anomalous cooccurrences are indicators
(or use statistical scores to provide fancy boosts)
– Indicator fields and other meta-data are indexed
– Recommendation implemented using a single search
– Boosts, functions, similarity also can reflect learned behavior
25 ©MapR Technologies - Confidential
What Platform Do You Need?
Fast, efficient, scalable search – bulk and near real-time indexing
– handle billions of records with sub-second search and faceting
Large scale, cost effective storage and processing capabilities
NLP and machine learning tools that scale to enhance discovery and analysis
Integrated log analysis workflows that close the loop between the raw data and user interactions
26 ©MapR Technologies - Confidential
Shards
1 2
3 N
Search View
•Documents •Users •Logs
Document Store
Analytic Services
View into numeric/historic data
Classification Recommendation
Personalization & Machine Learning
Services
Classification Models In memory Replicated Multi-tenant
Discovery & Enrichment Clustering, classification, NLP, topic identification, search log analysis, user behavior
Content Acquisition ETL, batch or near real-time
Access APIs
Data • LucidWorks Search
connectors • Push
Reference Architecture
27 ©MapR Technologies - Confidential
MapR
MapR provides the technology leading Hadoop distribution – full eco-system distribution
– integrated data platform
– complete solution for data integrity
MapR clusters also provide tight integration with search technologies like LucidWorks – integration is key for effective ops
28 ©MapR Technologies - Confidential
LucidWorks
LucidWorks provides the leading packaging of Apache Lucene and Solr – build your own, we support
– founded by the most prominent Lucene/Solr experts
LucidWorks Search – “Solr++”
• UI, REST API, MapR connectors, relevance tools, much more
LucidWorks Big Data – Big Data as a Service
– Integrated LucidWorks Search, Hadoop, machine learning with prebuilt workflows for many of these tasks
29 ©MapR Technologies - Confidential
LucidWorks Big Data
Inputs
API
Mgmt Search, Discovery, Analytics
Processing & Storage
Analytics Service Document Service
Big Data LucidWorks Web HDFS
Admin
Service Mgmt
Data Mgmt
Provisioning, Monitoring & Configuration
30 ©MapR Technologies - Confidential
Easy Wins
Analyze logs from application stored in MapR
Seamlessly store search indexes in MapR – and feed to Pig, Mahout and others
– use mirrors + NFS to directly deploy indexes
Snapshots make backups a snap – A lot like version control for data
LucidWorks 2.5.2 easily connects with MapR
31 ©MapR Technologies - Confidential
1 + 1 = 3
32 ©MapR Technologies - Confidential
Learn More
Talk to Ted @ted_dunning
Talk to Grant @gsingers
MapR and Lucid Works http://www.mapr.com
http://www.lucidworks.com
Hash Tags
#mapr #lucenesolr #lucidworks