44
1 ©MapR Technologies - Confidential The Search Is Over: Integrating SOLR and Hadoop to Simplify Big Data Analytics

The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

Embed Size (px)

DESCRIPTION

Presented by M.C. Srivas | MapR -See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.

Citation preview

Page 1: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

1 ©MapR Technologies - Confidential

The Search Is Over: Integrating SOLR and Hadoop to Simplify Big Data Analytics

Page 2: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

2 ©MapR Technologies - Confidential

Evolution of Search

Documents

•Models

•Feature Selection

User Interaction

•Clicks

•Ratings/Reviews

•Learning to Rank

•Social Graph

Queries

•Phrases

•NLP

Content Relationships

•Page Rank, etc.

•Organization

Page 3: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

3 ©MapR Technologies - Confidential

Search Discovery and Analytics

Search

Discovery Analytics

Page 4: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

4 ©MapR Technologies - Confidential

Data Volume Growing 44x

2020: 35.2

Zettabytes

2010:

1.2

Zettabytes

Data is Growing Quickly

Business Analytics Requires a New Approach

Source: IDC Digital Universe Study, sponsored by EMC, May 2010

IDC Digital Universe

Study 2011

Data is Growing Faster than Moore’s Law

Page 5: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

5 ©MapR Technologies - Confidential

MapReduce: A Paradigm Shift

Distributed computing platform

– Large clusters

– Commodity hardware

Pioneered at Google

– Bigtable and Google File System

Commercially available as Hadoop

Page 6: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

6 ©MapR Technologies - Confidential

Hadoop Explosion

6

Page 7: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

7 ©MapR Technologies - Confidential

How does Map/Reduce work?

1. Map

– Spread data across servers based on key/value pairs

– Each node independently scans local data

2. Servers produce Map results

3. Reduce - combine/merge Map results

4. Process complete or Map a new function

Like shuffling multiple decks of playing cards

Page 8: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

8 ©MapR Technologies - Confidential

The Cost of Enterprise Storage

SAN Storage

$2 - $10/Gigabyte

$1M gets: 0.5Petabytes 200,000 IOPS

1Gbyte/sec

NAS Filers

$1 - $5/Gigabyte

$1M gets: 1 Petabyte

400,000 IOPS 2Gbyte/sec

Local Storage

$0.05/Gigabyte

$1M gets: 20 Petabytes

10,000,000 IOPS 800 Gbytes/sec

Page 9: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

9 ©MapR Technologies - Confidential

Deep Object Store

Billions and Billions of Files

For some use cases it’s not the storage capacity it’s the number of objects – Messages

– Attachments

– Images

– Recordings

Provides a deep storage pool that is analytic ready – Store it until you need it

– Derive secondary value from analytic processing

Makes more sense to perform analytics on the data and send results over the network

9

Page 10: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

10 ©MapR Technologies - Confidential

Problems with Integrating Solr with Hadoop

Simple to integrate with Hadoop as a data source

Difficult to integrate distributed search and scale

SolrCloud simplifies Sharding and Replication coordination

Integration limitations based on capabilities of large scale storage

– High availability

– Data protection

– Ease of Access

Page 11: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

11 ©MapR Technologies - Confidential

Sharded text Indexing

Map

Reducer

Input documents

Local disk Search

Engine

Local disk

Clustered index storage

Assign documents to shards

Index text to local disk and then copy index to

distributed file store

Copy to local disk typically required before

index can be loaded

Page 12: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

12 ©MapR Technologies - Confidential

Problems with Solr and Hadoop

Map

Reducer

Input documents

Local disk Search

Engine

Local disk

Clustered index storage

Failure of a reducer causes garbage to accumulate in the

local disk

Failure of search engine requires

another download of the index from clustered storage.

Page 13: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

13 ©MapR Technologies - Confidential

Limitations of HDFS

NAS appliance

NameNode

A B

DataNode DataNode DataNode

DataNode DataNode DataNode

DataNode DataNode DataNode

HDFS is Append Only

Data Access is through the HDFS API

High Availability is a challenge

Single points of failure

Limited to 50-200 million files

Performance bottleneck

Page 14: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

14 ©MapR Technologies - Confidential

Logs, Flume, aggregates incoming events to Solr –Requires Multi-Step, Batch Process

Hadoop Cluster Application

Server

Application Server

Application Server

Page 15: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

15 ©MapR Technologies - Confidential

What’s Required for SDA?

Ease of Data Access through Open Standards

Large Scale, Reliable Storage

Ease of Integration

– Management ( REST)

– Security (LDAP, NIS, Linux PAM…)

– Analytics (NFS, ODBC, HDFS)

Search

Discovery Analytics

Page 16: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

16 ©MapR Technologies - Confidential

Ease of Data Access

ENTERPRISE NFS Access

HDFS API

Page 17: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

17 ©MapR Technologies - Confidential

Multiple Architectures Possible

Export to the world

– NFS gateway runs on selected gateway hosts

Local server

– NFS gateway runs on local host

– Enables local compression and check summing

Export to self

– NFS gateway runs on all data nodes, mounted from localhost

Page 18: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

18 ©MapR Technologies - Confidential

Data Access through Standard Protocols

NFS Server

NFS Server

NFS Server

NFS Server NFS

Client

Page 19: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

19 ©MapR Technologies - Confidential

Client

NFS Server

NFS Access through a Local server

Application

Cluster Nodes

Page 20: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

20 ©MapR Technologies - Confidential

Cluster Node

NFS Server

Universal export to self

Task

Cluster Nodes

Page 21: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

21 ©MapR Technologies - Confidential

Cluster Node

NFS Server

Task

Cluster Node

NFS Server

Task

Cluster Node

NFS Server

Task

Nodes are identical

Page 22: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

22 ©MapR Technologies - Confidential

Search Engine

Simplifies Solr Hadoop Integration

Map

Reducer

Input documents

Clustered index storage

Failure of a reducer is cleaned up by

map-reduce framework

Search engine reads mirrored index directly.

Page 23: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

23 ©MapR Technologies - Confidential

How Does this Integration Happen?

Elegantly simple

Direct Integration a result of leveraging architectures

Data in the Hadoop cluster is written to a Volume

Solr Crawler discovers content being entered into Hadoop

Accesses the data in the cluster through NFS

Builds Search Index

Users access Solr to find data directly into Hadoop

Page 24: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

24 ©MapR Technologies - Confidential

Distributed Shard Indexing

24

Input Map Combine Shuffle and sort

Reduce Output

Reduce

doc1 doc2 doc3

shard#1,doc1 shard#2,doc2 shard#1,doc3 shard#3,doc4 shard#3,doc5 …

shard#1,[doc3,doc1] shard#2,[doc2] shard#3, [doc5] …

index/s1 index/s2 index/s3 …

Page 25: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

25 ©MapR Technologies - Confidential

How Does this Work at Scale with Distributed Indices?

MapReduce jobs analyze distributed, disparate data in a cluster

In distributed indexing, the input is split arbitrarily into chunks and each chunk is handled separately. There can be many more chunks than there are shards to be created.

Mapper assigns document to shard

– Shard is usually hash of document id

Reducer indexes all documents for a shard

– Indexes created on local disk

– On success, copy index to DFS

Zookeeper is used to manage Solr instances

A large Solr Search is distributed across multiple shards

Page 26: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

26 ©MapR Technologies - Confidential

What about HA and Data Protection?

Automated re-replication

Self-healing from HW and SW failures

Load balancing

Rolling upgrades

No lost jobs or data

99999’s of uptime

Reliable Compute Dependable Storage

Business continuity with snapshots and mirrors

Recover to a point in time

End-to-end check summing

Strong consistency

Mirror across sites to meet Recovery Time Objectives

Cluster Capabilities can Extend to Integrated Search and Discovery

Page 27: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

27 ©MapR Technologies - Confidential

MapReduce failure to write the Index

Highly Available JobTracker and TaskTracker ensures that any failures are recovered with state to completion

MapReduce will clean up partially written indexes

No administrator intervention required

Page 28: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

28 ©MapR Technologies - Confidential

Solr Node Fails

Other Solr nodes start serving shards that were being served by failed node

Page 29: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

29 ©MapR Technologies - Confidential

Node Containing the Index Fails

Data is already replicated across the cluster

Zookeeper assigns Solr instance on the replicated node to the replicated shard

Page 30: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

30 ©MapR Technologies - Confidential

Additional High Availability and Replication

Snapshots are available

Administrator sets frequency at the Volume

Snapshots with automatic de-duplication

Saves space by sharing blocks

Redirect on write, fast with no performance or storage penalty

Zero performance loss on writing to original

Scheduled, or on-demand

Easy recovery with drag and drop

Page 31: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

31 ©MapR Technologies - Confidential

Mirroring Support in Hadoop Cluster

EC2

Business Continuity and Efficiency

Efficient design

Differential deltas are updated

Compressed and check-summed

Easy to manage

Scheduled or on-demand

WAN, Remote Seeding

Consistent point-in-time

WAN Datacenter 2

Production Research

Production WAN

Datacenter 1

Page 32: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

32 ©MapR Technologies - Confidential

Simplified NFS data flows for Distributed Search

Map

Reducer

Input documents

Search Engine

Mirrors

Search Engine

Mirroring allows exact placement

of index data

Aribitrary levels of replication also possible

Page 33: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

33 ©MapR Technologies - Confidential

Improving Search Relevancy

Requires a continuous Feedback Loop

–The quality of the search is influenced by the end-user selections

–Fully automated process that improves with use

–Does not require manual tags or classification

Search

Discovery Analytics

Page 34: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

34 ©MapR Technologies - Confidential

Recommendations

Often referred to as collaborative filtering

Actors interact with items

– observe successful interaction

We want to suggest additional successful interactions

Observations inherently very sparse

Page 35: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

35 ©MapR Technologies - Confidential

Examples

Customers buying books (Linden et al)

Web visitors rating music (Shardanand and Maes) or movies (Riedl, et al), (Netflix)

Internet radio listeners not skipping songs (Musicmatch)

Internet video watchers watching >30 s

Page 36: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

36 ©MapR Technologies - Confidential

Examples

Query for Friends results in links to Seinfeld

Search for kittens, get results for baby otters

Page 37: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

37 ©MapR Technologies - Confidential

Dyadic Structure

Functional

– Interaction: actor -> item*

Relational

– Interaction ⊆ Actors x Items

Matrix

– Rows indexed by actor, columns by item

– Value is count of interactions

Predict missing observations

Page 38: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

38 ©MapR Technologies - Confidential

Fundamental Algorithmics

Co-occurrence

A is actors x items, K is items x items

Product has general shape of matrix

K tells us “users who interacted with x also interacted with y”

Page 39: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

39 ©MapR Technologies - Confidential

Why not Expand it?

Users enter queries (A)

– (actor = user, item=query)

Users view videos (B)

– (actor = user, item=video)

A’A gives query recommendation

– “did you mean to ask for”

B’B gives video recommendation

– “you might like these videos”

Page 40: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

40 ©MapR Technologies - Confidential

The punch-line

B’A recommends videos in response to a query

– (isn’t that a search engine?)

– (not quite, it doesn’t look at content or meta-data)

Page 41: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

41 ©MapR Technologies - Confidential

Real-life example

Query: “Paco de Lucia”

Conventional meta-data search results:

– “hombres del paco” times 400

– not much else

Recommendation based search:

– Flamenco guitar and dancers

– Spanish and classical guitar

– Van Halen doing a classical/flamenco riff

Page 42: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

42 ©MapR Technologies - Confidential

Real-life example

Page 43: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics

43 ©MapR Technologies - Confidential

The Search for Relevancy

Updating Search to Reflect Relevancy

– Big Map Reduce jobs can use behaviorial traces in logs to improve results and identify Importance

The power of this virtuous loop depends on ease of frictionless data access, high availability, performance

Search

Discovery Analytics

Page 44: The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simplify Big Data Analytics