DATAWEEK KEYNOTE: LARGE SCALE SEARCH, DISCOVERY AND ANALYSIS IN ACTION

Preview:

DESCRIPTION

Session Description: Understanding and accessing large volumes of content often requires a multi-faceted approach that goes well beyond the basics of simple batch processing jobs. In many cases, one needs both ad hoc, real time access to the content as well as the ability to discover interesting information based on a variety of features such as recommendations, summaries and other insights. In this talk, we`ll discuss real world use cases across several industries as well as how to effectively leverage open source tools like Hadoop, Solr, Mahout and others to better enable user access to big data.

Citation preview

Confidential © Copyright 2012

Large Scale Search, Discovery and Analysis in Action

Ivan ProvalovResearch EngineerOffice of the Chief ScientistSeptember 25, 2012

Confidential and Proprietary © 2012 LucidWorks

User Interactions With Big Data

2

Data

Data

Data

DFS

Key Value Store

Index

Command Line

Query Language

Keyword Search

SystemAdministrator

Engineer

End User

Confidential and Proprietary © 2012 LucidWorks

Search, Discovery and Analytics

Is Search Enough?

• Keyword search is a commodity

• Holistic view of the data and the user interactions with that data

• Search, Discovery and Analytics are the key to unlocking this view of users and data

Search

endeavour shuttle bay area

3

Confidential and Proprietary © 2012 LucidWorks

Why Search, Discovery and Analytics?

• User Needs- real-time, ad hoc access to

content- aggressive prioritization

based on importance- serendipity- feedback/learning from past

• Business Needs- deeper insight into users- leverage existing internal

knowledge- cost effective

Search

DiscoveryAnalytics

4

Confidential and Proprietary © 2012 LucidWorks

Topics

• Background and needs• Architecture• Search, Discovery and Analytics in action• Road map• Wrap up

5

Confidential and Proprietary © 2012 LucidWorks

Search

• Performance• Real time• Relevance and importance• Presenting results• Experiment management

6

Confidential and Proprietary © 2012 LucidWorks

Discovery

• Content clustering • Discovering near duplicate documents• Finding ‘dark data’• Making recommendations• Uncovering trends• Recognizing topics• More like this

7

Confidential and Proprietary © 2012 LucidWorks

Analytics

• Term frequency• Facets• Click analysis• Relevancy metrics• Zero results queries• Hot spots• Statistically interesting phrases

8

Confidential and Proprietary © 2012 LucidWorks

Some Use Cases

• Video streaming- classification- recommendations

• Financial, transportation, telecommunications- fraud detection

• Social media- trend monitoring

• Information technology- logs monitoring

•Healthcare- identifying patients for clinical studies

9

Confidential and Proprietary © 2012 LucidWorks

In Focus: Personalized Medicine

10

Genetic Variations

Patient DNA

Alignment and other analysis

Search and Faceting

Standard Therapies

Alternative Therapies

Confidential and Proprietary © 2012 LucidWorks11

In Focus: Log Processing in Telecommunications

• Each year, large sums of money are lost due to fraudulent calls and poor service

• Logs are usually semi-structured and contain vital information about errors and fraud

• Deeper batch analytics can provide insight into patterns across vast amounts of data

• Search of call and network information (via logs) is critical to providing deeper analysis and understanding of these errors and fraudulent activities

Confidential and Proprietary © 2012 LucidWorks

What Does a Search, Discovery and Analytics Platform Need?

• Fast, efficient, scalable search- bulk and near real time indexing

- handle billions of records with sub-second search and faceting

• Large scale, cost effective storage and processing capabilities- need whole data consumption and analysis

- experimentation/sampling tools

• NLP and machine learning tools that scale to enhance discovery and analysis

12

Confidential and Proprietary © 2012 LucidWorks

Building a Search, Discovery and Analytics Platform

Inpu

tsAPI

Man

agem

entSearch, Discovery, Analytics

Processing & Storage

Provisioning, Monitoring & Configuration

Bulk & Real Time

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs

API

Provisioning, Monitoring & Configuration

Man

agem

entSearch, Discovery, Analytics

Processing & Storage

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs

Processing & Storage

API

Provisioning, Monitoring & Configuration

Man

agem

entSearch, Discovery, Analytics

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs Search, Discovery, Analytics

Processing & Storage

Analytics Service Document Service

API

Provisioning, Monitoring & Configuration

Man

agem

ent

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs MgmtSearch, Discovery, Analytics

Processing & Storage

Analytics Service Document ServiceAdmin

ServiceMgmt

DataMgmt

API

Provisioning, Monitoring & Configuration

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs MgmtSearch, Discovery, Analytics

Processing & Storage

Provisioning, Monitoring & Configuration

Analytics Service Document ServiceAdmin

ServiceMgmt

DataMgmt

API

Confidential and Proprietary © 2012 LucidWorks

LucidWorks Big Data

Inputs

API

MgmtSearch, Discovery, Analytics

Processing & Storage

Analytics Service Document Service

Big Data LucidWorks Web HDFS

Admin

ServiceMgmt

DataMgmt

Provisioning, Monitoring & Configuration

Confidential and Proprietary © 2012 LucidWorks20

Components – LucidWorks Search

Component Benefit

LucidWorks Search (2.1.1)• connector framework• security• user click framework• business process integration• administration

Lucene/Solr 4.0-dev, sharded with SolrCloud, near-real time indexing, transaction logs for recovery.

LucidWorks Search

Confidential and Proprietary © 2012 LucidWorks21

Components - Hadoop

Component Benefit

Apache Hadoop (1.0.3) Distributed computing and processing for ETL and analytics jobs.

Apache HBase (0.92) Key-value store allowing fast access to the data.

Apache Oozie (modified 3.2) Workflow orchestration.

Confidential and Proprietary © 2012 LucidWorks22

Components - Analysis/ML/NLP

Component Benefit

Apache Mahout (trunk)• k-means clustering• statistically interesting phrases• similar documents• classification

Distributed machine learning processing framework.

Apache UIMA (2.4.0) Text processing and annotations.

Apache OpenNLP (1.5.2)• named entity extraction

Machine learning toolkit for natural language processing.

Behemoth (modified trunk) Makes easier M/R data extraction, abstracts annotations frameworks.

Apache Pig (0.9.2)• ETL• log analysis

Helps with writing analytics M/R programs.

Confidential and Proprietary © 2012 LucidWorks23

Components - Middleware

Component Benefit

Apache ZooKeeper (3.4.3)• Netflix Curator

Service discovery.

Apache Kafka (0.7) Logs consumption and event-based real-time document processing framework.

Confidential and Proprietary © 2012 LucidWorks

Components - SDA Engine

• RESTful services (Restlet 2.1)• ZooKeeper + Netflix Curator• Authentication and authorization• Proxies for LucidWorks and

WebHDFS API• Workflow engine

24

Confidential and Proprietary © 2012 LucidWorks

Road Map

• Analytics themes- relevance- data quality- discovery- integration with other packages (R)

• Machine learning- NLP- recommendations

• Experiment management

25

Confidential and Proprietary © 2012 LucidWorks

Conclusions

• Search, Discovery and Analytics, when combined into a single, integrated system provides powerful insight into both your content and your users

• LucidWorks has combined many of these things into LucidWorks Big Data

26

Confidential and Proprietary © 2012 LucidWorks27

LucidWorks Big Data

• Unified development platform for Big Data applications• Integrated open source stack: Lucene/Solr, Hadoop,

Mahout, NLP• Single, uniform REST API• Pre-tuned by open source industry experts• Out of the box provisioning - hosted or on premise

Confidential and Proprietary © 2012 LucidWorks

www.lucidworks.com/bigdata

ivan.provalov@lucidworks.com

@iprovalov

Search | Discover | Analyze

28

Recommended