27

Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Embed Size (px)

DESCRIPTION

Presented at Lucene/Solr Revolution 2014

Citation preview

Page 1: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach
Page 2: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Solr Compute Cloud – An Elastic Solr Infrastructure

Nitin Sharma - Member of technical staff, BloomReach - [email protected]

Page 3: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Abstract

Scaling search platforms is an extremely hard problem •  Serving hundreds of millions of documents •  Low latency •  High throughput workloads •  Optimized cost.

At BloomReach, we have implemented SC2, an elastic Solr infrastructure for big data applications that: •  Supports heterogeneous workloads while hosted in the cloud. •  Dynamically grows/shrinks search servers

•  Application and Pipeline level isolation, NRT search and indexing. •  Offers latency guarantees and application-specific performance tuning. •  Provides high-availability features like cluster replacement, cross-data center support, disaster

recovery etc.

Page 4: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

About Us BloomReach BloomReach has developed a personalized discovery platform that features applications that analyze big data to makes our customers’ digital content more discoverable, relevant and profitable. Myself I work on search platform scaling for BloomReach’s big data. My relevant experience and background includes scaling real-time services for latency sensitive applications and building performance and search-quality metrics infrastructure for personalization platforms.

Page 5: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

The BloomReach Personalized

Discovery Platform

Page 6: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

BloomReach’s Applications

Organic  Search  

Cont

ent u

nder

stand

ing

What it does

Content optimization, management and measurement

Benefit

Enhanced discoverability and customer acquisition in organic search

What it does

Personalized onsite search and navigation across devices

Benefit

Relevant and consistent onsite experiences for new and known users

What it does

Merchandising tool that understands products and identifies opportunities

Benefit

Prioritize and optimize online merchandising

SNAP  

Compass  

Page 7: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Agenda

•  BloomReach search use cases and architecture •  Old architecture and issues •  Scaling challenges •  Elastic SolrCloud architecture and benefits •  Lessons learned

Page 8: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

BloomReach Search Use Cases 1.  Front-end (serving) queries – Uptime and Latency sensitive 2.  Batch search pipelines – Throughput sensitive 3.  Time bound indexing requirements – Customer Specific 4.  Time bound Solr config updates

Page 9: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

BloomReach Search Architecture

Solr Cluster

Zookeeper Ensemble Map Reduce Pipelines (Reads)

Indexing Pipelines Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Heavy Load

Moderate Load

Light Load

Legend

Public API

Search Traffic

Search Traffic

Page 10: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Throughput Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

●  Heterogeneous read workload

●  Same collection - different pipelines, different query patterns, different schedule

●  Cache tuning is virtually

impossible

●  Larger pipeline starving the small ones

●  Machine utilization determines throughput and stability of a pipeline at any point

●  No isolation among jobs

Page 11: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Stability and Uptime Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

●  Bad clients – bring down the cluster/degrade performance

●  Bad queries (with heavy load) – render nodes unresponsive

●  Garbage collection issues

●  ZK stability issues (as we scale collections)

●  CPU /Load Issues ●  Higher number of

concurrent pipelines, higher number of issues

Page 12: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Indexing Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

●  Commit frequencies vary with indexer types

●  Indexer run during another pipeline – performance

●  Indexer client leaks

●  Too many stored fields

●  Non-batch updates

Page 13: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Rethinking…

•  Shared cluster for pipelines does not scale.

•  Guaranteeing an uptime of 99.99+ - non trivial

•  Every job runs great in isolation. When you put them together, they fail. •  Running index-heavy load and read-heavy load - cluster performance issues.

•  Any direct access to production cluster – cluster stability (client leaks, bad queries etc.). What if every pipeline had its own cluster?

Page 14: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Solr Compute Cloud (SC2)

•  Elastic Infrastructure – Provision Solr Clusters on demand, on-the-fly.

•  Create, Use, Terminate Model - Create a temporary cluster with necessary data, use it and throw it away. •  Technologies behind SC2 (built in House)

Cluster Management API - Dynamic cluster provisioning and resource allocation.

Solr HAFT – High availability and data management library for SolrCloud.

•  Isolation - Pipelines get their own cluster. One cannot disrupt another. •  Dynamic Scaling – Every pipeline can state its own replication requirements.

•  Production Safeguard - No direct access. Safeguards from bad clients/access patterns.

•  Cost Saving – Provision for the average; withstand peak with elastic growth.

Page 15: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Solr Compute Cloud

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Solr Compute

Cloud API

Solr Cluster Collection A Replicas: 6

1.  Read pipeline requests collection and desired replicas from SC2 API.

2.  SC2 API provisions cluster dynamically with needed setup (and streams Solr data).

3.  SC2 calls HAFT service to replicate data from production to provisioned cluster.

4.  Pipeline uses this cluster to run job.

1  

4  

Request: {Collection: A, Replica: 6}

2  

Solr HAFT

Service

3  

3  

Read  

Replicate

Page 16: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Solr Compute Cloud…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Solr Compute

Cloud API

Solr Cluster Collection A Replicas: 6

1.  Pipeline finishes running the job.

2.  Pipeline calls SC2 API to terminate the cluster.

3.  SC2 terminates the cluster.

2  Terminate: {Cluster}

3  

Solr HAFT

Service

1  

Page 17: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Solr Compute Cloud – Read Pipeline View

Zookeeper Ensemble Pipeline 1

Solr Compute

Cloud API

Solr Cluster Collection A Replicas: 6

Request: {Collection: A, Replica: 6}

Pipeline 2 Solr Cluster Collection B Replicas: 2

Request: {Collection: B, Replica: 2}

Pipeline n Solr Cluster Collection C Replicas: 1

Request: {Collection: C, Replica: 1}

Solr HAFT

Service

Production Solr Cluster

Page 18: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Solr Compute Cloud – Indexing

Production Solr Cluster

Zookeeper Ensemble

Indexing

Solr Compute

Cloud API

Solr Cluster Collection A Replicas: 6

1.  Read pipeline requests collection and desired replicas from SC2 API.

2.  SC2 API provisions cluster dynamically with needed setup (and streams Solr data).

3.  Indexer uses this cluster

to index the data.

4.  Indexer calls HAFT service to replicate the index from dynamic cluster to production.

5.  HAFT service reads data from dynamic cluster and replicates to production Solr.

1  

3  

Request: {Collection: A, Replica: 2}

2  

Replicate

Solr HAFT Service

4  

5  Read

Page 19: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Solr Compute Cloud – Global View

Zookeeper Ensemble

Solr Compute

Cloud API

Solr HAFT Service

Production Solr Cluster

Indexing Pipelines 1

Elastic Clusters

Read Pipelines 1

Read Pipelines n

Indexing Pipelines n

Provision: {Cluster}

Terminate: {Cluster}

Replicate Index

Replicate Index

Run Job

Page 20: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Solr Compute Cloud API

1.  API to provision clusters on demand.

2.  Dynamic cluster and resource allocation (includes cost optimization)

3.  Track request state, cluster performance and cost.

4.  Terminate long-running, runaway clusters.

Page 21: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Solr HAFT Service 1.  High availability and fault tolerance 2.  Home-grown technology 3.  Open Source - J (Work in progress) 4.  Features

•  One push disaster recovery •  High availability operations

•  Replace node •  Add replicas •  Repair collection •  Collection versioning

•  Cluster backup operations •  Dynamic replica creation •  Cluster clone •  Cluster swap •  Cluster state reconstruction

Page 22: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Solr HAFT Service

Clone Alias

Clone Collections

Custom Commit Node Replacement

Node Repair

Clone Cluster

Collection Versioning

Black Box Recording

Lucene Segment Optimize

Index Management Actions

High Availability Actions

Cluster Backup Operations

Solr Metadata Zookeeper Metadata

Verification Monitoring

Solr HAFT Service – Functional View

Dynamic Replica Creation

Cluster Clone

Cluster Swap

Cluster State Reconstruction

Page 23: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Disaster Recovery in New Architecture

Old Production

Solr Cluster

Zookeeper Ensemble

New Solr

Cluster

Zookeeper Ensemble

Solr HAFT Service

Push Button

Recovery

Brave Soul on Pager Duty

1  

2  

DNS

3  

1.  Guy on Pager clicks the recovery button 2.  Solr HAFT Service

triggers Cluster Setup State Reconstruction Cluster Clone Cluster Swap

3. Production DNS – New

Cluster

Page 24: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

SC2 vs Non-SC2 (Stability Features) Property   Non-­‐SC2   SC2  

Linear  Scalability  for  Heterogeneous  Workload  

       

Pipeline  Level  IsolaGon  

Dynamic  CollecGon  Scaling    

PrevenGon  from  Bad  Clients  

Pipeline  Specific  Performance  

No  Direct  Access  to  ProducGon  Cluster    

Can  Sleep  at  night?  J  

Page 25: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

SC2 vs Non-SC2 (Availability Features)

Property   Non-­‐SC2   SC2  

 Cross  Data-­‐Center  Support          

 Cluster  Cloning  

 CollecGon  Versioning    

One-­‐Push  Disaster  Recovery  

 Repair  API  for  Nodes/CollecGons  

Node  Replacement  

Page 26: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Lessons Learned 1. Solr is a search platform. Do not use it as a database (for scans and lookups).

Evaluate your stored fields.

2. Understand access patterns, QPS and queries in detail. Be careful when tuning caches.

3. Have access control for large-scale jobs that directly talk to your cluster. (Internal DDOS attacks are hard to track.)

4. Instrument every piece of infrastructure and collect metrics.

5. Build automated disaster recovery (You will need it. J)

Page 27: Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharma, BloomReach

Questions?

Thank You!

NiGn  Sharma  [email protected]  hQps://www.linkedin.com/in/kniGnsharma