Bloomreach - BloomStore Compute Cloud Infrastructure

Nitin Sharma - Data Infrastructure EngineerJorge Rodriguez - Data Infrastructure Engineer

Data Infrastructure Scaling at BloomReach

AbstractScaling data platforms for serving hundreds of millions of documents with low latency and high throughput workloads at an optimized cost is an extremely hard problem.

At BloomReach, we have implemented BC2, an elastic infrastructure for big data applications that

1. Supports heterogeneous workloads while hosted in the cloud.

2. Dynamically grows/shrinks search servers to provide application and pipeline level isolation, NRT search and indexing.

3. Offers latency guarantees and application-specific performance tuning.

4. Provides high-availability features like cluster replacement, cross-data center support, disaster recovery etc.

Agenda• Data Infrastructure V1• Scaling Challenges

– Cassandra– Solr

• Elastic Data Infrastructure– Cassandra– Solr

• Questions?

Nitin works on search platform scaling for BloomReach’s big data. His relevant experience and background includes scaling real-time services for latency sensitive applications and building performance and search-quality metrics infrastructure for personalization platforms.

BloomReach has developed a personalized discovery platform that features applications that analyze big data to makes our customers’ digital content more discoverable, relevant and profitable.

Jorge works on cassandra db platform scaling for BloomReach’s big data. Previously he worked on our organic search applications and customer integration infrastructure. Prior to BloomReach, Jorge also worked on an eCommerce platform.

About Us

The BloomReach Personalized

Discovery Platform

BloomReach’s Applications

Organic Search

Con

ten

t u

nd

ers

tan

din

g

What it does

Content optimization, management and measurement

Benefit

Enhanced discoverability and customer acquisition in organic search

What it does

Personalized onsite search and navigation across devices

Benefit

Relevant and consistent onsite experiences for new and known users

What it does

Merchandising tool that under-stands products and identifies

opportunities

Benefit

Prioritize and optimize online merchandising

SNAP

Compass

Data Infrastructure• Cassandra Database• SOLR Reverse Index• Write Heavy MapReduce Jobs• Read/Scan Heavy MapReduce Jobs (Analytics and

ETL)• Large Scale Indexers• RT API’s

Data Infrastructure V1

SOLR

C* Frontend DC

C* Backend DC

Read PipelineWrite Pipelines

Write Pipelines

Read Pipeline

Read Pipeline

APIAPI

API

Write Pipelines

Cassandra

Cassandra: How we startedFr

on

ten

d A

pp

licati

on

s

Cassandra Cluster

FrontendDC

BackendDC

EM

R Job

s

Fixed Resource Issue

Cassandra Cluster

Backend

DC

EM

R Job

s

EM

R Job

s

EM

R Job

s

EM

R Job

s

EM

R Job

s

EM

R Job

s

Frontend DC Spillover

reads

Starvation Issue

BackendDC

Large EMR Jobs with

relaxed SLA

Small EMR job

with tighter SLA

Frontend Latencies vs Replication LoadFr

on

ten

d A

pp

licati

on

s

Cassandra Cluster

Frontend

DC

Backend

DC EM

R Job

s

Stabilizing Cassandra: Rate LimiterFr

on

ten

d A

pp

licati

on

s

Cassandra Cluster

Frontend

DC

BackendDC E

MR

Job

s

Token Server (Redis)

Cost of Rate Limiter• We converted EMR from an elastic resource to

a fixed resource• To scale EMR we have to scale Cassandra• Adding capacity to Cassandra cluster is not

trivial• Adding capacity under heavy load is harder• Auto scaling and reducing under heavy load is

even harder

SOLR

BloomReach Search Architecture

Solr Cluster

Zookeeper Ensemble

Map Reduce Pipelines (Reads)

Indexing Pipelines Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Heavy Load

Moderate Load

Light Load

Legend

Public API

Search Traffic

Search Traffic

Throughput Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

● Heterogeneous read workload

● Same collection - different pipelines, different query patterns

● Cache tuning is virtually impossible

● Larger pipeline starving the small ones

● Machine utilization determines throughput and stability of a pipeline at any point

● No isolation among jobs

Stability and Uptime Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

● Bad clients – bring down the cluster/degrade performance

● Bad queries (with heavy load) – render nodes unresponsive

● Garbage collection issues

● ZK stability issues (as we scale collections)

● Higher number of concurrent pipelines, higher number of issues

Indexing Issues…

Solr Cluster

Zookeeper Ensemble

Pipeline 1

Pipeline 2

Pipeline n

Indexing 1

Indexing 2

Indexing n

Public API

Search Traffic

● Commit frequencies vary with indexer types

● Indexer run during another pipeline – performance

● Indexer client leaks

● Too many stored fields

● Non-batch updates

Rethinking…• Shared cluster for pipelines does not scale.

• Every job runs great in isolation. When you put them together, they choke.

• Running index-heavy load and read-heavy load simultaneously - cluster performance issues.

• Any direct access to production cluster – cluster stability (client leaks, bad queries etc.).

• Dynamic way of scaling collections (SOLR) – increase/decrease replicas on the fly to help pipelines finish faster.

• What if every pipeline had its own cluster?

• Elastic Infrastructure – Provision Clusters on demand, on-the-fly.

• Create, Use, Terminate Model - Create a temporary cluster with necessary data, use it and throw it away.

• Technologies behind BC2 (built in House)

• Cluster Management - Dynamic cluster provisioning and resource allocation.

• Solr HAFT – High availability and data management library for SolrCloud.

• Cassandra Replication Service – Replicating Cassandra Data to elastic clusters on demand.

• Isolation - Pipelines get their own cluster. One cannot disrupt another.

• Dynamic Scaling – Every pipeline can state its own replication requirements.

• Production Safeguard - No direct access. Safeguards from bad clients/access patterns.

• Cost Saving – Provision for the average; withstand peak with elastic growth.

BloomStore Compute Cloud (BC2)

SOLR Scaling with BC2

SOLR on BC2

Solr Cluster

Zookeeper Ensemble

Pipeline 1

BC2 API

Solr Cluster Collection A Replicas: 6

1. Read pipeline requests collection and desired replicas from SC2 API.

2. SC2 API provisions cluster dynamically with needed setup (and streams Solr data).

3. SC2 calls HAFT service to replicate data from production to provisioned cluster.

4. Pipeline uses this cluster to run job.

1

4

Request: {Collection: A, Replica: 6}

2

Solr HAFT

Service

3

3

Read

Replicate

SOLR on BC2 …

Solr Cluster

Zookeeper Ensemble

Pipeline 1

BC2 API


1. Pipeline finishes running the job.

2. Pipeline calls SC2 API to terminate the cluster.

3. SC2 terminates the cluster.

2Terminate: {Cluster}

3

Solr HAFT

Service

1

SOLR on BC2– Read View

Zookeeper Ensemble

Pipeline 1

BC2 API



Pipeline 2Solr Cluster Collection B Replicas: 2

Request: {Collection: B, Replica: 2}

Pipeline nSolr Cluster Collection CReplicas: 1

Request: {Collection: C, Replica: 1}

Solr HAFT

Service

Production Solr Cluster

SOLR on BC2– Indexing


Zookeeper Ensemble

Indexing

BC2 API


1. Read pipeline requests collection and desired replicas from SC2 API.

2. SC2 API provisions cluster dynamically with needed setup (and streams Solr data).

3. Indexer uses this cluster to index the data.

4. Indexer calls HAFT service to replicate the index from dynamic cluster to production.

5. HAFT service reads data from dynamic cluster and replicates to production Solr.

1

3


2

Replicate

Solr HAFT Service

4

5Read

SOLR on BC2– Global View

Zookeeper Ensemble

BC2 API

Solr HAFT Service


Indexing Pipelines 1

Elastic Clusters

Read Pipelines 1

Read Pipelines n

Indexing Pipelines n

Provision: {Cluster}

Terminate: {Cluster}

Replicate Index

Replicate Index

Run Job

Cassandra Scaling with BC2

Cassandra BC2 Diagram

Source Cluster

BC2 API

On-demand cluster

On-demand cluster

On-demand cluster

On-demand cluster

On-demand cluster E

MR

Job

s

How Cassandra Replication Works

Source Cluster Destination

Cluster

SSTable file copy SSTable split

computation

Cassandra from Gains BC2• Very high throughput in moving raw data from source to destination cluster (10 X increase

network usage compared to normal)• Little CPU/Memory load on the source cluster• Time to scale varies between 10 minutes to 40 minutes• API driven so automatically scales up and down with demand • Application agnostic• Allows use of AWS spot instances and optimize instance choice around current spot instance

pricing.• Removes scan/read load from backend cluster

Write Throughput• Write capacity still defined by frontend latencies

– Compute delta changes, as most of our data does not change.– Add more frontend nodes– Experimental changes:

• Prioritize reads over writes in frontend DC.• Column level replication – filter mutations to frontend DC by

removing columns not needed in frontend view.

BC2 vs Non-BC2Property Non-BC2 BC2

Linear Scalability for Heterogeneous Workload

Pipeline Level Isolation

Dynamic Collection Scaling

Prevention from Bad Clients

Pipeline Specific Performance

No Direct Access to Production Cluster

Can Sleep at night?

SOLR HAFT Service1. High availability and fault tolerance2. Home-grown technology 3. Features

• One push disaster recovery • High availability operations

• Replace node• Add replicas• Repair collection• Collection versioning

• Cluster backup operations• Dynamic replica creation• Cluster clone• Cluster swap• Cluster state reconstruction

Solr HAFT Service

Clone Alias

Clone Collections

Custom Commit Node Replacement

Node Repair

Clone Cluster

Collection Versioning

Black Box Recording

Lucene Segment Optimize

Index Management Actions

High Availability Actions

Cluster Backup Operations

Solr MetadataZookeeper Metadata

Verification Monitoring

Solr HAFT Service – Functional View

Dynamic Replica Creation

Cluster Clone

Cluster Swap

Cluster State Reconstruction

Solr Disaster Recovery in New Architecture

Old Production Solr Cluster

Zookeeper Ensemble

New Solr Cluster

Zookeeper Ensemble

Solr HAFT Service

Push Button

Recovery

Brave Soul on Pager Duty

1

2

DNS

3

1. Guy on Pager clicks the recovery button

2. Solr HAFT Service triggers

Cluster Setup

State Reconstruction

Cluster Clone

Cluster Swap 3. Production DNS – New

Cluster

BC2 vs Non-BC2 (Availability Features)

Property Non-BC2 BC2

Cross Data-Center Support

Cluster Cloning

Collection Versioning

One-Push Disaster Recovery

Repair API for Nodes/Collections

Solr Node Replacements

V2 Architecture

SOLR

C* Frontend DC

C* Backend DC

Write Pipelines Read

Pipeline

API APIAPI

On-demand cluster

On-demand cluster

On-demand cluster

HAFT SERVICE

Write-Back

Replication

Rate Limiter

BC2 API

Questions ???

Questions?Thank You!

Nitin [email protected]://www.linkedin.com/in/knitinsharma

Jorge [email protected]://www.linkedin.com/pub/jorge-rodriguez/5/559/12b

mailto:[email protected]



https://www.linkedin.com/in/knitinsharma





https://www.linkedin.com/pub/jorge-rodriguez/5/559/12b

https://www.linkedin.com/pub/jorge-rodriguez/5/559/12b

Engineering

Bloomreach - BloomStore Compute Cloud Infrastructure