21
Confidential © Copyright 2012 Large Scale Search, Discovery and Analysis in Action Grant Ingersoll Chief Scientist September 18, 2012

Large Scale Search, Discovery and Analytics in Action

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Large Scale Search, Discovery and Analytics in Action

Confidential © Copyright 2012

Large Scale Search, Discovery and Analysis in Action

Grant IngersollChief Scientist

September 18, 2012

Page 2: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

• Good keyword search is a commodity and easy to get up and running

• The Bar is Raised• Relevance is (always will

be?) hard

• Holistic view of the data AND the users is critical

• Search, Discovery and Analytics are the key to unlocking this view of users and data

Search is Dead, Long Live Search

Documents

User Interaction

Access

Content Relationships

Page 3: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Topics

•Background and needs

•Architecture•Road Ahead

•SDA In Action• Components• Challenges and Lessons Learned

•Wrap Up

Page 4: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Why Search, Discovery and Analytics (SDA)?

• User Needs• Real-time, ad hoc access to

content

• Aggressive Prioritization based on Importance

• Serendipity

• Feedback/Learning from past

• Business Needs• Deeper insight into users

• Leverage existing internal knowledge

• Cost effective

Search

DiscoveryAnalytics

Page 5: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Sample Use Cases

• Claims processing and analysis, including fraud analysis

• Large scale content acquisition and access for:• Defense, intelligence and pharmaceutical applications

• Views of data surrounding natural disasters and other tragedies for research, archiving and therapeutic purposes

• Analysis of Website and social media interactions

• Access and processing of genetic information for improved medical treatments

• Log processing and fraud detection in telecommunications

5

Page 6: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

In Focus: Personalized Medicine

6

Genetic Variations

Patient DNA

Alignment and other analysis

Search and Faceting

Standard Therapies

Alternative Therapies

Page 7: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

In Focus: Log Processing in Telecommunications

• Each year, large sums of money are lost due to fraudulent calls and poor service

• Logs are usually semi-structured and contain vital information about errors and fraud

• Deeper batch analytics can provide insight into patterns across vast amounts of data

• Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities

7

Page 8: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

What Does an SDA Platform Need?

• Fast, efficient, scalable search• Bulk and Near Real Time Indexing

• Handle billions of records with sub-second search and faceting

• Large scale, cost effective storage and processing capabilities• Need whole data consumption and analysis

• Experimentation/Sampling tools

• NLP and machine learning tools that scale to enhance discovery and analysis

Page 9: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Architecture

Page 10: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Under the Hood

• Lucene/Solr 4.0-dev

• Sharded with SolrCloud• 1 second (default) soft commits for NRT

updates

• 1 minute (default) hard commits (no searcher reopen)

• Transaction logs for recovery

• Solr takes care of leader election, etc. so no more master/slave

• RESTful services built on Restlet 2.1

• Service Discovery, load balancing, failover enabled via ZooKeeper + Netflix Curator

• Authentication and authorization over SSL (optional)

• Proxies for LucidWorks and WebHDFS API

• Workflow engine coordinates data flow

LucidWorks 2.1 SDA Engine

Page 11: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Under the Hood, cont.

• Apache Hadoop• Map-Reduce (MR) jobs for ETL

and bulk indexing into SolrCloud sharded system

• Leverage Pig and custom MR jobs for log processing and metric calculation

• WebHDFS

• Apache Mahout• K-Means Clustering

• Statistically Interesting Phrases

• Similar Docs

• More to come

• Apache HBase• Key-value and time series of all

calculated metrics

• Document storage

• Apache Pig• ETL

• Log analysis -> Hbase

• Apache ZooKeeper• Netflix Curator for service

discovery and higher level ZK client

• Apache Kafka• Pub-sub for collecting logs from

LucidWorks into HDFS

Page 12: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

The Road Ahead

• Our approach is from search and discovery outwards to analytics• Analytics in beta are focused around analysis of search logs

• Apache Hive support

• Analytics Themes• Relevance

• Data quality

• Discovery

• Experiment Management

• Machine Learning• Classification

• Recommendations

• Natural Language Processing

• Incorporate latest LucidWorks/Solr

Page 13: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Computation and Storage

LucidWorks Search/Solr

• Document Index

• Faceting

• SolrCloud makes sharding easy

Hadoop

• Stores Logs, Raw files, intermediate files, etc.

• WebHDFS

• Small files are an unnatural act

HBase

• Metric Storage

• User Histories/Profile

• Document Storage

Challenges• Who is the authoritative store?• Real time vs. Batch• Where should analysis be done?

Page 14: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Search In Practice

•Three primary concerns• Performance/Scaling

• Relevance

• Operations: monitoring, failover, etc.

•Business typically cares more about relevance

•Devs care more about performance at first…

Page 15: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Search: Relevance

•Always Be Testing•Experiment management is critical•Top X + sampling•Click Logs

•Track Everything!• Queries• Clicks• Displayed Documents • Mouse/Scroll tracking?

•Phrases are your friends

Page 16: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Discovery Components

Serendipity

• Trends• Topics• Recommendations• Related Items• More Like This• Did you mean?• Stat. Interesting

Phrases

Organization

• Importance• Clustering• Classification

• Named Entities• Time Factors• Faceting

Data Quality

• Document factor Distributions• Length• Boosts

• Duplicates

Challenges• Many of these are intense calculations or iterative• Many are subjective and require a lot of experimentation

Page 17: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Discovery with Mahout

• Mahout’s 3 “C”s provide tools for helping across many aspects of discovery• Collaborative Filtering

• Classification

• Clustering

• Also: • Collocations (Statistically Interesting Phrases)

• Singular Value Decomposition (SVD)

• Others

• Challenges:• High cost to iterative machine learning algorithms

• Mahout is very command line oriented

• Some areas less mature

Page 18: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Aside: Experiment Management

• Plan for running experiments from the beginning across Search and Discovery components• Your engine should help!

• Types of Experiments to consider• Indexing/Analysis

• Query parsing

• Scoring formulas

• Machine Learning Models

• Recommendations, many more

• Make it easy to do A/B testing across all experiments and compare and contrast the results

Page 19: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Analytics in Practice

• Many of the components discussed provide analytical features• Leverage existing tools: R, etc.

• Simple Counts:• Facets

• Term and Document frequencies

• Clicks

• Search and Discovery example metrics• Relevance measures like Mean Reciprocal Rank

• Histograms/Drilldowns around Number of Results

• Log and navigation analysis

• Data cleanliness analysis is helpful for finding potential issues in content

Page 20: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Wrap

• Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users

• LucidWorks has combined many of these things into LucidWorks Big Data• http://www.lucidworks.com/products/lucidworks-big-data

• Design for the big picture when building search-based applications

Page 21: Large Scale Search, Discovery and Analytics in Action

Confidential and Proprietary © 2012 LucidWorks

Discussion and Resources

• Questions?

• http://www.lucidworks.com

[email protected]• @gsingers

21