Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay

Storing eBay's Media Metadata on MongoDB

Yuri FinkelsteinLead Platform Services [email protected]

John FeibuschLead DBA [email protected]

May 2013

mailto:[email protected]




About eBay Platform Services

Platform Services is an org within a larger eBay Platform

org which is responsible for developing and operating

common services that are used by Web Application

running on eBay Platform

• Media Storage platform services: image blob and metadata

• Unified Monitoring platform: logs and metrics

• User Behavior Tracking

• Ad Content management and analytics

• Messaging and other middleware services

Platform Services and Media Metadata Service Requirements

Platform Services is a DevOps organization

• We develop, we test, we deploy, we operate, we monitor

• Whatever we are responsible for, we own and understand at the depth of

the entire stack

• Therefore, we require transparency of the components we build on

• Transparency at the level of source code visibility is ideal

Key Requirements

Key requirements of Media Metadata Service

• 99.999% availability

• Strictly defined invocation latency @95 %

• Simultaneous operation in multiple data centers with short replication

latency

• Reliable writes: synchronous writes to at least 2 nodes.

• Read-write workload with reads / write ~= 10/1

• Agility, fluid metadata content; constantly changing business

requirements

• Terabyte scale, billions of small entities to store and query

• Scalability at extreme: number of pictures on eBay is constantly growing

Enters MongoDB

We have been operating MongoDB in this project for over a year now

Sharded cluster in 2 data centers

Service nodes are built in Java and use Morphia and Mongo driver

MongoS runs on the service nodes

1st year we were maturing the cluster for writes only; this year we are taking reads

Reads are from the user-facing web applications with strong SLA requirements

For reads, client first sets SlaveOK=true and if required document is not found flips to SlaveOK=false to read from Primary

---- Shards -----

--- R

eplicas --->

P P P

H H H

--- D

C1--->

-- D

C2-->

S

S

S

S

S

S

S

Morphia

Service Layer

Mongo Driver

MongoS

Metadata Service Node

S – service instance; P – primary mongod; H – hidden member

Centralized MongoDB configuration store

Our MongoDB deployment package is based on

custom-build RPM and contains heavy customization

scripts

One of them is responsible for fetching configuration for

the node it’s running on from a remote configuration

repository at start-up time

Benefits:• Can change MongoDB configuration instantly on arbitrary

large number of nodes

• Can change local system settings affecting MongoDB: read-

ahead –settings on block devices and IO scheduler

• Can relocate replica set members across machines (subject

to data migration)

• Consistent inventory tracking, visibility into config settings

on any Mongo machine

CentralMongoDB

ConfigRepository

P P P

@ startup time

Upstart

Upstart is a replacement for init.d; developed for Ubuntu, also used in

RHEL 6

Can automatically start our monitoring agent whenever mongod starts.

Handles multiple mongod instances well

Example:

sudo start mongod interface=0

Future: Upstart can be controlled by Puppet.

Run multiple MongoD instances on the same machine

Starting to run multiple mongod processes on one node

Instead of using multiple ports we create multiple virtual interfaces on a single host and register them in DNS as if they were real IP addresses

MongoD supports bind_ip which makes it possible to bind to a specific virtual interface

Why virtual interfaces ? So that DB hosts can be moved with just a DNS change

Why do we want to run multiple MongoD on a single host? On large machines with lots of disk IO and storage capacity mongod can not

utilize all IO resources Running multiple shards on the same machine reduces data granularity and

reduces the scope of each write lock. This works well only when multiple MongoD on the same machine have similar

workload

Home grow MongoDB monitoring system

Home grown agent runs on each MongoDB host and collects very specific metrics that are not available in MMS:• Per block-device disk write

latency and disk IOPS• Details of per-collection

MongoDB metrics

Can overlay multiple graphs form RS members on the same chart

GLE latency – very important since we are doing • getLastError ({w:2})

Media Metadata Service: Data Model

2 main collections: Item and Image• Item references multiple Images

Item represents eBay Item:• _id in Item is external ID of the item in eBay site DB• These IDs are already sharded in balanced across N

logical DB hosts using ID ranges• We use MongoDB pre-split points for initial

mapping our N site DB shards to M MongoDB shards

• This ensures good balance between the shards;

Image represents a picture attached to an Item

• _id in Image is based on modified ObjectID of Mongo

• This ensures good distribution across any number of shards

Our choice of document IDs in both collections ensures good balance across Mongo shards

Problem #1: What should be the ID for the documents?

ObjectId is not a good shard key for sharded collection as timestamp occupies the first 4 bytes.

Problem: how should the app generate the ID when this is required?

Requirements:• Even distribution across shards both long term and short

term• Localization of the placement of the indexed _id values in the

B-Tree – minimize the chance of page fault on the index page and increase the chance of collation of the dirty pages in page cache to reduce the amount of random IO when flushing pages to diss

• Compactness in size is always good to preserve space

One possible solution: 6 byte ID in the following order• 1 byte – rotating sequence ID incremented by each writer on

every document• 1 byte – writer ID; assuming number of writers < 256• 4 byte – timestamp in seconds

Works with limitation that each writer can not insert more than 256 documents per second

TTimestamp MachineID SequenceNo

MongoDB ObjectId():

4 4 4

SequenceNo WriterID TTimestamp

1 1 4

Shard-Friendly ID:

Shard Friednly ID details

TimeSeq=0

Seq=16

6-byte ID value

Seq=255

ff …

0f…

00…

55…

aa…

N-th min N-th+1 min

20 contiguous ranges for each

sequence

Let’s say we have 20 writers and 3 shards

Number of contiguous intervals in each shard:256/3 * 20 = 1100

Worse case scenario: each contiguous range requires a separate IO. At 200 IOPS:~5 sec to flush itIn reality it’s much better because of 4 k pages

Rate of writes 256 docs/sec

Number of dirty locations over 1 minute: 256 * 60 * 20 = 307,000 So, if _id was md5 or some other random value generator with ~perfect distribution this would require 300 times more IOPS

Problem #2: md5 lookup problem

Md5 is a digest of the image content; used for de-dupe

Requirement: find image documents with a given md5 val

Option 1: secondary index on the image documents; does not work because:• Large DB, random reads cause disk IO• Image collections is sharded by image ID;

forced to query all shards

Option 2: Stand-alone replica set (cache)• Works since data is compact and fits in RAM;

no disk IO• How do we store md5->image IDs in Mongo?• Option 2.1: As an array

Does not work well since when refs are added documents will grow and relocate.

• Option 2.2: Single Binary Packed into an ID Works; lookup is based on prefix search and

covering index

{ _id:Binary(md5), ref: [ref1, ref2, ref3 …]}

{ _id:Binary(md5|ref) }

Query:Db.coll.find ( { _id: {$gt : Binary(md5|0x0000)} }, { _id : 1})

Problem #3: Item’s main picture size lookup

Image document has image dimensions: width and height

Item document references N pictures; one of them is main

Problem: lookup image dimensions of the item’s main picture for 50 item documents at once with SLA for latency < 20 msec

It’s a variation of Problem #2 except it’s worse because ItemID and image dimensions are in different documents and 50 lookups at once are required

Again we need a dedicated replica set

Option 1: prefix search with $or and $and

Option 2: just query by _id

Option 3: query by id but on another compound index: {_id:1, wh:1}

Winner is option #3! Hint: covering index

{_id:Binary(item|WxH) }

Query:Db.coll.find ({ $or: [{_id: {$gt : Binary(id1|0x0000), {$lt : Binary(id1|0xffff)}},{_id: {$gt : Binary(id2|0x0000), {$lt : Binary(id2|0xffff)}},…]})

{ _id:item, wh:WxH }

Query:Db.coll.find ( { _id : {$in : [item1, item2, .]})

{ _id:item, wh:WxH }

Query:Db.coll.find ( { _id : {$in : [item1, item2, .]}) .hint({_id:1, wh:1})

Problem #4: Periodic export to Hadoop

Problem: daily copy of the new or updated documents to Hadoop

Option 1: service does 2 writes: to mongo and to hadoop• Does not work since Hadoop is not an

online system

Option 2: secondary index on lastUpdated (date); then query on lastUpdated > T• Does not work well since updating indexed

lastUdated is costly; also consuming a large number of docs from a live cluster is disruptive to latency SLAs

Option 3: OpLog replication• Winner:

decouples export from site activity,

Makes lastUpdated index unccessary

P P P

Problem:

P P P

OpLog Listener

??

Problem #5: What’s the fastest way to perform a full scan?

Problem: you have a huge database/collection, with terabytes of data and billions of documents

You need to perform a form of batch processing on all the documents and you want the fastest pipe out of mongo

Option 1: Do it on a live node as it’s serving traffic• Does not work well when the node is busy• Also – data consistency may be an issue

Ok, need to take the node off-line

Option 2: execute a natural-order scan:• Natural order cursor• Works, but slow; lot’s synchronization between two

sides

Option 3: N cursors using range query on _id or any other indexed field• Slow in general case when order of indexed values

on B-Tree and order on disk do not match

Option 4: N natural-order cursors

One cursor:db.collection.find

({}, {$natural: 1})

N cursors:db.collection.find ({}, {$natural: 1}).skip (i*N).limit (N)

Summary

We are running MongoDB in a demanding environment where it’s

exposed to business sensitive online applications

It seems to be reliable – this is what matters

It has lots of features and gives the user lots of option to choose from

It’s the user’s depth of understanding of the product and desire to

have visibility into every aspect of its performance that will determine

when a particular use case will be a success or not

Questions?

Thank you!

Btw, if any of this sounds interesting, we have lots of similar challenges to work on. So, you know the drill: yfinkelstein at ebay dot com

Technology

Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay