100
Retail Reference Architecture with MongoDB Antoine Girbal Principal Solutions Engineer, MongoDB Inc. @antoinegirbal

Retail Reference Architecture

  • Upload
    mongodb

  • View
    3.155

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Retail Reference Architecture

Retail Reference Architecturewith MongoDB

Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal

Page 2: Retail Reference Architecture

Introduction

Page 3: Retail Reference Architecture

4

• it is way too broad to tackle with one solution

• data maps so well to the document model

• needs for agility, performance and scaling

• Many (e)retailers are already using MongoDB

• Let's define the best ways and places for it!

Retail solution

Page 4: Retail Reference Architecture

5

• Holds complex JSON structures

• Dynamic Schema for Agility

• complex querying and in-place updating

• Secondary, compound and geo indexing

• full consistency, durability, atomic operations

• Near linear scaling via sharding

• Overall, MongoDB is a unique fit!

MongoDB is a great fit

Page 5: Retail Reference Architecture

6

MongoDB Strategic Advantages

Horizontally Scalable-Sharding

AgileFlexible

High Performance &Strong Consistency

Application

HighlyAvailable-Replica Sets

{ customer: “roger”, date: new Date(), comment: “Spirited Away”, tags: [“Tezuka”, “Manga”]}

Page 6: Retail Reference Architecture

7

build your data to fit your application

Relational MongoDB{ customer_id : 1,

name : "Mark Smith",city : "San Francisco",orders: [ {

order_number : 13,store_id : 10,date: “2014-01-03”,products: [

{SKU: 24578234,

Qty: 3, Unit_price:

350},{SKU:

98762345, Qty: 1, Unit_Price:

110}]

},{ <...> }

]}

CustomerID First Name Last Name City0 John Doe New York1 Mark Smith San Francisco2 Jay Black Newark3 Meagan White London4 Edward Danields Boston

Order Number Store ID Product Customer ID10 100 Tablet 011 101 Smartphone 012 101 Dishwasher 013 200 Sofa 114 200 Coffee table 115 201 Suit 2

Page 7: Retail Reference Architecture

8

Notions

RDBMS MongoDB

Database Database

Table Collection

Row Document

Column Field

Page 8: Retail Reference Architecture

Retail Components Overview

Page 9: Retail Reference Architecture

10

Information Management

Merchandising

Content

Inventory

Customer

Channel

Sales & Fulfillment

Insight

Social

Architecture Overview

Customer

ChannelsAmazon

Ebay…

StoresPOSKiosk

MobileSmartphone

Tablet

Website

Contact Center

APIData and Service

Integration

SocialFacebook

Twitter…

Data Warehouse

Analytics

Supply Chain Management

System

Suppliers

3rd Party

In Network

Web Servers

Application Servers

Page 10: Retail Reference Architecture

11

Commerce Functional Components

Information Layer

Look & Feel

Navigation

Customization

Personalization

Branding

Promotions

Chat

Ads

Customer's Perspective

ResearchBrowseSearch

SelectShopping Cart

PurchaseCheckout

ReceiveTrack

UseFeedbackMaintain

DialogAssist

Market / Offer

Guide

Offer

Semantic Search

Recommend

Rule-based Decisions

Pricing

Coupons

Sell / Fullfill

Orders

Payments

Fraud Detection

Fulfillment

Business Rules

InsightSession CaptureActivity

Monitoring

Customer Enterprise

Information Management

Merchandising

Content

Inventory

Customer

Channel

Sales & Fulfillment

Insight

Social

Page 11: Retail Reference Architecture

Merchandising

Page 12: Retail Reference Architecture

13

Merchandising

Merchandising

MongoDB

Variant

Hierarchy

Pricing

Promotions

Ratings & Reviews

Calendar

Semantic Search

Item

Localization

Page 13: Retail Reference Architecture

14

• Single view of a product, one central catalog service

• Read volume high and sustained, 100k reads / s

• Write volume spikes up during catalog update

• Advanced indexing and querying

• Geographical distribution and low latency

• No need for a cache layer, CDN for assets

Merchandising - principles

Page 14: Retail Reference Architecture

15

Merchandising - requirements

Requirement Example Challenge MongoDB

Single-view of product Blended description and hierarchy of product to ensure availability on all channels

Flexible document-oriented storage

High sustained read volume with low latency

Constant querying from online users and sales associates, requiring immediate response

Fast indexed querying, replication allows local copy of catalog, sharding for scaling

Spiky and real-time write volume

Bulk update of full catalog without impacting production, real-time touch update

Fast in-place updating, real-time indexing, , sharding for scaling

Advanced querying Find product based on color, size, description

Ad-hoc querying on any field, advanced secondary and compound indexing

Page 15: Retail Reference Architecture

16

Merchandising - Product Page

Product images

General Informatio

n

List of Variants

External Informatio

n

Localized Descriptio

n

Page 16: Retail Reference Architecture

17

> db.item.findOne()

{ _id: "301671", // main item id

department: "Shoes",

category: "Shoes/Women/Pumps",

brand: "Guess",

thumbnail: "http://cdn…/pump.jpg",

image: "http://cdn…/pump1.jpg", // larger version of thumbnail

title: "Evening Platform Pumps",

description: "Those evening platform pumps put the perfect finishing touches on your most glamourous night-on-the-town outfit",

shortDescription: "Evening Platform Pumps",

style: "Designer",

type: "Platform",

rating: 4.5, // user rating

lastUpdated: Date("2014/04/01"), // last update time

… }

Merchandising - Item Model

Page 17: Retail Reference Architecture

18

• Get item by id

db.definition.findOne( { _id: "301671" } )

• Get item from Product Ids

db.definition.findOne( { _id: { $in: ["301671", "301672" ] } } )

• Get items by department

db.definition.find({ department: "Shoes" })

• Get items by category prefix

db.definition.find( { category: /^Shoes\/Women/ } )

• Indices

productId, department, category, lastUpdated

Merchandising - Item Definition

Page 18: Retail Reference Architecture

19

> db.variant.findOne()

{

_id: "730223104376", // the sku

itemId: "301671", // references item id

thumbnail: "http://cdn…/pump-red.jpg", // variant specific

image: "http://cdn…/pump-red.jpg",

size: 6.0,

color: "Red",

width: "B",

heelHeight: 5.0,

lastUpdated: Date("2014/04/01"), // last update time

}

Merchandising – Variant Model

Page 19: Retail Reference Architecture

20

• Get variant from SKU

db.variation.find( { _id: "730223104376" } )

• Get all variants for a product, sorted by SKU

db.variation.find( { productId: "301671" } ).sort( { _id: 1 } )

• Indices

productId, lastUpdated

Merchandising – Variant Model

Page 20: Retail Reference Architecture

22

Per store Pricing could result in billions of documents,

unless you build it in a modular way

Price: {

_id: "sku730223104376_store123",

currency: "USD",

price: 89.95,

lastUpdated: Date("2014/04/01"), // last update time

}

_id: concatenation of item and store.

Item: can be an item id or sku

Store: can be a store group or store id.

Indices: lastUpdated

Merchandising – per store Pricing

Page 21: Retail Reference Architecture

23

• Get all prices for a given item

db.prices.find( { _id: /^p301671_/ )

• Get all prices for a given sku (price could be at item level)

db.prices.find( { _id: { $in: [ /^sku730223104376_/, /^p301671_/ ])

• Get minimum and maximum prices for a sku

db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price },

max: { $max : price} } })

• Get price for a sku and store id (returns up to 4 prices)

db.prices.find( { _id: { $in: [ "sku730223104376_store1234",

"sku730223104376_sgroup0",

"p301671_store1234",

"p301671_sgroup0"] , { price: 1 })

Merchandising – per store Pricing

Page 22: Retail Reference Architecture

26

Merchandising – Browse and Search products

Browse by category

Special Lists

Filter by attributes

Lists hundreds of item

summaries

Ideally a single query is issued to the database to obtain all items and metadata to display

Page 23: Retail Reference Architecture

27

The previous page presents many challenges:

• Response within milliseconds for hundreds of items

• Faceted search on many attributes: category, brand, …

• Attributes at the variant level: color, size, etc, and the variation's image should be shown

• thousands of variants for an item, need to de-duplicate

• Efficient sorting on several attributes: price, popularity

• Pagination feature which requires deterministic ordering

Merchandising – Browse and Search products

Page 24: Retail Reference Architecture

28

Merchandising – Browse and Search products

Hundreds of sizes

One Item

Dozens of colors

A single item may have thousands of variants

Page 25: Retail Reference Architecture

29

Merchandising – Browse and Search products

Images of the matching variants are displayed

HierarchySort

parameter

Faceted Search

Page 26: Retail Reference Architecture

30

Merchandising – Traditional Architecture

Relational DBSystem of Records

Full Text SearchEngine

Indexing

#1 obtain search

results IDs

ApplicationCache

#2 obtain objects by

ID

Pre-joined into objects

Page 27: Retail Reference Architecture

31

The traditional architecture issues:

• 3 different systems to maintain: RDBMS, Search engine, Caching layer

• search returns a list of IDs to be looked up in the cache, increases latency of response

• RDBMS schema is complex and static

• The search index is expensive to update

• Setup does not allow efficient pagination

Merchandising – Traditional Architecture

Page 28: Retail Reference Architecture

32

MongoDB Data Store

Merchandising - Architecture

SummariesItems Pricing

PromotionsVariantsRatings & Reviews

#1 Obtain results

Page 29: Retail Reference Architecture

33

The summary relies on the following parameters:

• department e.g. "Shoes"

• An indexed attribute

– Category path, e.g. "Shoes/Women/Pumps"

– Price range

– List of Item Attributes, e.g. Brand = Guess

– List of Variant Attributes, e.g. Color = red

• A non-indexed attribute

– List of Item Secondary Attributes, e.g. Style = Designer

– List of Variant Secondary Attributes, e.g. heel height = 4.0

• Sorting, e.g. Price Low to High

Merchandising – Summary Model

Page 30: Retail Reference Architecture

34

> db.summaries.findOne()

{ "_id": "p39",

"title": "Evening Platform Pumps 39",

"department": "Shoes", "category": "Shoes/Women/Pumps",

"thumbnail": "http://cdn…/pump-small-39.jpg", "image": "http://cdn…/pump-39.jpg",

"price": 145.99,

"rating": 0.95,

"attrs": [ { "brand" : "Guess"}, … ],

"sattrs": [ { "style" : "Designer"} , { "type" : "Platform"}, …],

"vars": [

{ "sku": "sku2441",

"thumbnail": "http://cdn…/pump-small-39.jpg.Blue",

"image": "http://cdn…/pump-39.jpg.Blue",

"attrs": [ { "size": 6.0 }, { "color": "Blue" }, …],

"sattrs": [ { "width" : "B"} , { "heelHeight" : 5.0 }, …],

}, … Many more skus …

] }

Merchandising – Summary Model

Page 31: Retail Reference Architecture

35

• Get summary from item iddb.variation.find({ _id: "p301671" })

• Get summary's specific variation from SKUdb.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } )

• Get summary by department, sorted by ratingdb.variation.find( { department: "Shoes" } ).sort( { rating: 1 } )

• Get summary with mix of parametersdb.variation.find( { department : "Shoes" ,

"vars.attrs" : { "color" : "Gray"} , "category" : ^/Shoes/Women/ , "price" : { "$gte" : 65.99 , "$lte" :

180.99 } } )

Merchandising - Summary Model

Page 32: Retail Reference Architecture

36

Merchandising – Summary Model

• The following indices are used:– department + attr + category + _id– department + vars.attrs + category + _id– department + category + _id– department + price + _id– department + rating + _id

• _id used for pagination

• Can take advantage of index intersection

• With several attributes specified (e.g. color=red and size=6), which one is looked up?

Page 33: Retail Reference Architecture

37

Facet samples:

{ "_id" : "Accessory Type=Hosiery" , "count" : 14}

{ "_id" : "Ladder Material=Steel" , "count" : 2}

{ "_id" : "Gold Karat=14k" , "count" : 10138}

{ "_id" : "Stone Color=Clear" , "count" : 1648}

{ "_id" : "Metal=White gold" , "count" : 10852}

Single operations to insert / update:

db.facet.update( { _id: "Accessory Type=Hosiery" },

{ $inc: 1 }, true, false)

The facet with lowest count is the most restrictive…

It should come first in the query!

Merchandising – Facet

Page 34: Retail Reference Architecture

38

Merchandising – Query stats

Department Category Price Primary attribute

Time Average (ms)

90th (ms) 95th (ms)

1 0 0 0 2 3 3

1 1 0 0 1 2 2

1 0 1 0 1 2 3

1 1 1 0 1 2 2

1 0 0 1 0 1 2

1 1 0 1 0 1 1

1 0 1 1 1 2 2

1 1 1 1 0 1 1

1 0 0 2 1 3 3

1 1 0 2 0 2 2

1 0 1 2 10 20 35

1 1 1 2 0 1 1

Page 35: Retail Reference Architecture

Inventory

Page 36: Retail Reference Architecture

42

Inventory – Traditional Architecture

Relational DBSystem of Records

NightlyBatches

Analytics, Aggregations,

Reports

Caching Layer

Field Inventory

Internal & External Apps

Point-in-time Loads

Page 37: Retail Reference Architecture

43

Opportunities Missed

• Can’t reliability detect availability

• Can't redirect purchasers to in-store pickup

• Can’t do intra-day replenishment

• Degraded customer experience

• Higher internal expense

Page 38: Retail Reference Architecture

44

Inventory – Principles

• Single view of the inventory

• Used by most services and channels

• Read dominated workload

• Local, real-time writes

• Bulk writes for refresh

• Geographically distributed

• Horizontally scalable

Page 39: Retail Reference Architecture

45

Inventory – Requirements

Requirement Challenge MongoDB

Single view of inventory

Ensure availability of inventory information on

all channels and services

Developer-friendly, document-oriented

storage

High volume, low latency reads

Anytime, anywhere access to inventory

data without overloading the system

of record

Fast, indexed readsLocal reads

Horizontal scaling

Bulk updates,intra-day deltas

Provide window-in-time consistency for highly

available services

Bulk writesFast, in-place updates

Horizontal scaling

Rapid application development cycles

Deliver new services rapidly to capture new

opportunities

Flexible schemaRich query language

Agile-friendly iterations

Page 40: Retail Reference Architecture

46

Inventory – Target Architecture

Relational DBSystem of Records

Analytics, Aggregations,

Reports

Field Inventory

Internal & External Apps

Inventory

Assortments

Shipments

Audits

Products

Stores

Point-in-time Loads

NightlyRefresh

Real-timeUpdates

Page 41: Retail Reference Architecture

47

Horizontal Scaling

Inventory – Technical Decisions

Store

Inventory

Schema

Indexing

Page 42: Retail Reference Architecture

48

Inventory – Collections

Stores InventoryProducts

AuditsAssortmen

tsShipments

Page 43: Retail Reference Architecture

49

Stores – Sample Document

• > db.stores.findOne()• {• "_id" :

ObjectId("53549fd3e4b0aaf5d6d07f35"),• "className" : "catalog.Store",• "storeId" : "store0",• "name" : "Bessemer store",• "address" : {• "addr1" : "1st Main St",• "city" : "Bessemer",• "state" : "AL",• "zip" : "12345",• "country" : "US"• },• "location" : [ -86.95444, 33.40178 ],

...• }

Page 44: Retail Reference Architecture

50

Stores – Sample Queries

• Get a store by storeId

db.stores.find({ "storeId" : "store0" })

• Get a store by zip code

db.stores.find({ "address.zip" : "12345" })

Page 45: Retail Reference Architecture

51

What’s near me?

Page 46: Retail Reference Architecture

52

Stores – Sample Geo Queries

• Get nearby stores sorted by distance

db.runCommand({ geoNear : "stores", near : { type : "Point", coordinates : [-82.8006, 40.0908] }, maxDistance : 10000.0, spherical : true })

Page 47: Retail Reference Architecture

53

Stores – Sample Geo Queries

• Get the five nearest stores within 10 km

db.stores.find({ location : { $near : { $geometry : { type : "Point", coordinates : [-82.80, 40.09] }, $maxDistance : 10000 } } }).limit(5)

Page 48: Retail Reference Architecture

54

Stores – Indices

• { "storeId" : 1 }

• { "name" : 1 }

• { "address.zip" : 1 }

• { "location" : "2dsphere" }

Page 49: Retail Reference Architecture

55

Inventory – Sample Document

• > db.inventory.findOne()• { • "_id": "5354869f300487d20b2b011d",• "storeId": "store0",• "location": [-86.95444, 33.40178],• "productId": "p0",• "vars": [• { "sku": "sku1", "q": 14 },• { "sku": "sku3", "q": 7 },• { "sku": "sku7", "q": 32 },• { "sku": "sku14", "q": 65 },• ...• ]• }

Page 50: Retail Reference Architecture

56

Inventory – Sample Queries

• Get all items in a store

db.inventory.find({ storeId : "store100" })

• Get quantity for an item at a store

db.inventory.find({ "storeId" : "store100", "productId" : "p200" })

Page 51: Retail Reference Architecture

57

Inventory – Sample Queries

• Get quantity for a sku at a store

db.inventory.find( { "storeId" : "store100", "productId" : "p200", "vars.sku" : "sku11736" }, { "vars.$" : 1 } )

Page 52: Retail Reference Architecture

58

Inventory – Sample Update

• Increment / decrement inventory for an item at a store

db.inventory.update( { "storeId" : "store100", "productId" : "p200", "vars.sku" : "sku11736" }, { "$inc" : { "vars.$.q" : 20 } } )

Page 53: Retail Reference Architecture

59

Inventory – Sample Aggregations

• Aggregate total quantity for a product

db.inventory.aggregate( [ { $match : { productId : "p200" } }, { $unwind : "$vars" }, { $group : { _id : "result", count : { $sum : "$vars.q" } } } ] )

{ "_id" : "result", "count" : 101752 }

Page 54: Retail Reference Architecture

60

Inventory – Sample Aggregations

• Aggregate total quantity for a store

db.inventory.aggregate( [ { $match : { storeId : "store100" } }, { $unwind : "$vars" }, { $match : { "vars.q" : { $gt : 0 } } }, { $group : { _id : "result", count : { $sum : 1 } } } ] )

{ "_id" : "result", "count" : 29347 }

Page 55: Retail Reference Architecture

61

Inventory – Sample Aggregations

• Aggregate total quantity for a store

db.inventory.aggregate( [ { $match : { storeId : "store100" } }, { $unwind : "$vars" }, { $group : { _id : "result", count : { $sum : "$vars.q" } } } ] )

{ "_id" : "result", "count" : 29347 }

Page 56: Retail Reference Architecture

63

Page 57: Retail Reference Architecture

64

Inventory – Sample Geo-Query

• Get inventory for an item near a point

db.runCommand( { geoNear : "inventory", near : { type : "Point", coordinates : [-82.8006, 40.0908] }, maxDistance : 10000.0, spherical : true, limit : 10, query : { "productId" : "p200", "vars.sku" : "sku11736" } } )

Page 58: Retail Reference Architecture

65

Inventory – Sample Geo-Query

• Get closest store with available sku

db.runCommand( { geoNear : "inventory", near : { type : "Point", coordinates : [-82.800672, 40.090844] }, maxDistance : 10000.0, spherical : true, limit : 1, query : { productId : "p200", vars : { $elemMatch : { sku : "sku11736", q : { $gt : 0 } } } } } )

Page 59: Retail Reference Architecture

66

Inventory – Sample Geo-Aggregation

• Get count of inventory for an item near a point db.inventory.aggregate( [ { $geoNear: { near : { type : "Point", coordinates : [-82.800672, 40.090844] }, distanceField: "distance", maxDistance: 10000.0, spherical : true, query: { productId : "p200", vars : { $elemMatch : { sku : "sku11736", q : {$gt : 0} } } }, includeLocs: "dist.location", num: 5 } }, { $unwind: "$vars" }, { $match: { "vars.sku" : "sku11736" } }, { $group: { _id: "result", count: {$sum: "$vars.q"} } }])

Page 60: Retail Reference Architecture

67

Inventory – Sample Indices

• { storeId : 1 }

• { productId : 1, storeId : 1 }

• Why not "vars.sku"?– { productId : 1, storeId : 1, "vars.sku" : 1 }

• { productId : 1, location : "2dsphere" }

Page 61: Retail Reference Architecture

68

Horizontal Scaling

Inventory – Technical Decisions

Store

Inventory

Schema

Indexing

Page 62: Retail Reference Architecture

69

Shard

East

Shard

Central

Shard

West

East DC

Inventory – Sharding Topology

West DC Central DCLegacy

Inventory

Primary

Primary

Primary

Page 63: Retail Reference Architecture

70

Inventory – Shard Key

• Choose shard key– { productId : 1, storeId : 1 }

• Set up sharding– sh.enableSharding("inventoryDB")– sh.shardCollection( "inventoryDB.inventory", { productId : 1, storeId : 1 } )

Page 64: Retail Reference Architecture

71

Inventory – Shard Tags

• Set up shard tags– sh.addShardTag("shard0000", "west")

– sh.addShardTag("shard0001", "central")

– sh.addShardTag("shard0002", "east")

• Set up tag ranges– Add new field: region– sh.addTagRange("inventoryDB.inventory",

{ region : 0 }, { region : 100}, "west" )

– sh.addTagRange("inventoryDB.inventory",

{ region : 100 }, { region : 200 }, "central" )

– sh.addTagRange("inventoryDB.inventory",

{ region : 200 }, { region : 300 }, "east" )

Page 65: Retail Reference Architecture

Insight

Page 66: Retail Reference Architecture

87

Insight

Insight

MongoDB

Advertising metrics

Clickstream

Recommendations

Session Capture

Activity Logging

Geo Tracking

Product Analytics

Customer Insight

Application Logs

Page 67: Retail Reference Architecture

88

Many user activities can be of interest:

• Search

• Product view, like or wish

• Shopping cart add / remove

• Sharing on social network

• Ad impression, Clickstream

Activity Logging – Data of interest

Page 68: Retail Reference Architecture

89

Will be used to compute:

• Product Map (relationships, etc)

• User Preferences

• Recommendations

• Trends …

Activity Logging – Data of interest

Page 69: Retail Reference Architecture

90

Activity logging - Architecture

MongoDB

HVDFAPI

Activity LoggingUser History

External Analytics:Hadoop,Spark,Storm,

User Preferences

Recommendations

Trends

Product MapApps

Internal Analytics:

Aggregation,MR

All user activity is recorded

MongoDB – Hadoop

Connector

Personalization

Page 70: Retail Reference Architecture

91

Activity Logging

Page 71: Retail Reference Architecture

92

• store and manage an incoming stream of data samples– High arrival rate of data from many sources– Variable schema of arriving data– control retention period of data

• compute derivative data sets based on these samples– Aggregations and statistics based on data – Roll-up data into pre-computed reports and summaries

• low latency access to up-to-date data (user history)– Flexible indexing of raw and derived data sets – Rich querying based on time + meta-data fields in samples

Activity Logging – Problem statement

Page 72: Retail Reference Architecture

93

Activity logging - Requirements

Requirement MongoDB

Ingestion of 100ks of writes / sec

Fast C++ process, multi-threads, multi-locks. Horizontal scaling via sharding. Sequential IO via time partitioning.

Flexible schema Dynamic schema, each document is independent. Data is stored the same format and size as it is inserted.

Fast querying on varied fields, sorting

Secondary Btree indexes can lookup and sort the data in milliseconds.

Easy clean up of old data Deletes are typically as expensive as inserts. Getting free deletes via time partitioning.

Page 73: Retail Reference Architecture

94

Activity Logging using HVDF

HVDF (High Volume Data Feed):

• Open source reference implementation of high volume writing with MongoDB https://github.com/10gen-labs/hvdf

• Rest API server written in Java with most popular libraries

• Public project, issues can be logged https://jira.mongodb.org/browse/HVDF

• Can be run as-is, or customized as needed

Page 74: Retail Reference Architecture

95

Feed

High volume data feed architecture

Channel

Sample Sample Sample Sample

Source

Source

Processor

Inline Processing

Batch Processing

Stream Processing

Grouping by Feed and Channel

Sources send samples

Processors generate derivative Channels

Page 75: Retail Reference Architecture

96

HVDF -- High Volume Data Feed engine

HVDF – Reference implementation

REST Service API

Processor Plugins

Inline

Batch

Stream

Channel Data Storage

Raw Channel

Data

Aggregated Rollup T1

Aggregated Rollup T2

Query Processor Streaming spout

Custom Stream Processing Logic

Incoming Sample Stream

POST /feed/channel/data

GET /feed/channeldata?time=XXX&range=YYY

Real-time Queries

Page 76: Retail Reference Architecture

97

{ _id: ObjectId(),

geoCode: 1, // used to localize write operations

sessionId: "2373BB…",

device: { id: "1234",

type: "mobile/iphone",

userAgent: "Chrome/34.0.1847.131"

}

userId: "u123",

type: "VIEW|CART_ADD|CART_REMOVE|ORDER|…", // type of activity

itemId: "301671",

sku: "730223104376",

order: { id: "12520185",

… },

location: [ -86.95444, 33.40178 ],

tags: [ "smartphone", "iphone", … ], // associated tags

timeStamp: Date("2014/04/01 …")

}

User Activity - Model

Page 77: Retail Reference Architecture

98

Dynamic schema for sample data

Sample 1{ deviceId: XXXX, time: Date(…) type: "VIEW", …}

Channel

Sample 2{ deviceId: XXXX, time: Date(…) type: "CART_ADD", cartId: 123, …}

Sample 3{ deviceId: XXXX, time: Date(…) type: “FB_LIKE”}

Each sample can have

variable fields

Page 78: Retail Reference Architecture

99

Channels are sharded

Shard

Shard

Shard

Shard

Shard

Shard Key: Customer_id

Sample{ customer_id: XXXX, time: Date(…) type: "VIEW",}

ChannelYou choose how

to partition samples

Samples can have dynamic

schema

Scale horizontally by adding shards

Each shard is highly available

Page 79: Retail Reference Architecture

100

Channels are time partitioned

Channel

Sample Sample Sample Sample Sample Sample Sample Sample

- 2 days - 1 Day Today

Partitioning keeps indexes manageable

This is where all of the writes

happen

Older partitions are read only for

best possible concurrency

Queries are routed only to needed

partitions

Partition 1 Partition 2 Partition N

Each partition is a separate collection

Efficient and space reclaiming

purging of old data

Page 80: Retail Reference Architecture

101

Dynamic queries on Channels

Channel

Sample Sample Sample Sample

AppApp

App

Indexes

Queries Pipelines Map-Reduce

Create custom indexes on Channels

Use full mongodb query language to access samples

Use mongodb aggregation pipelines to

access samples

Use mongodb inline map-reduce to access samples

Full access to field, text, and geo

indexing

Page 81: Retail Reference Architecture

102

North America - West

North America - East

Europe

Geographically distributed system

Channel

Sample Sample Sample Sample

Source

Source

Source

Source

Source

Source

Sample

Sample

Sample

Sample

Geo shards per location

Clients write local nodes

Single view of channel available

globally

Page 82: Retail Reference Architecture

103

Insight

Page 83: Retail Reference Architecture

104

Insight – Useful Data

Useful data for better shopping:

• User history (e.g. recently seen products)

• User statistics (e.g. total purchases, visits)

• User interests (e.g. likes videogames and SciFi)

• User social network

Page 84: Retail Reference Architecture

105

Insight – Useful Data

Useful data for selling more:

• Cross-selling: people who bought this item had tendency to buy those other items (e.g. iPhone, then bought iPhone case)

• Up-selling: people who looked at this item eventually bought those items (alternative product that may be better)

Page 85: Retail Reference Architecture

106

• Get the recent activity for a user, to populate the "recently viewed" list

db.activities.find({ userId: "u123", time: { $gt: DATE }}).

sort({ time: -1 }).limit(100)

• Get the recent activity for a product, to populate the "N users bought this in the past N hours" list

db.activities.find({ itemId: "301671", time: { $gt: DATE }}).

sort({ time: -1 }).limit(100)

• Indices: time, userId + time, deviceId + time, itemId + time

• All queries should be time bound, since this is a lot of data!

Insight – User History

Page 86: Retail Reference Architecture

107

• Get the recent number of views, purchases, etc for a userdb.activities.aggregate(([

{ $match: { userId: "u123", time: { $gt: DATE } }}, { $group: { _id: "$type", count: {$sum: 1} } }])

• Get the total recent sales for a userdb.activities.aggregate(([

{ $match: { userId: "u123", time: { $gt: DATE }, type: "ORDER" }}, { $group: { _id: "result", count: {$sum: "$totalPrice"} } }])

• Get the recent number of views, purchases, etc for an itemdb.activities.aggregate(([

{ $match: { itemId: "301671", time: { $gt: DATE } }}, { $group: { _id: "$type", count: {$sum: "1"} } }])

• Those aggregations are very fast, real-time

Insight – User Stats

Page 87: Retail Reference Architecture

108

• number of activities for unique visitors for the past hour. Calculation of uniques is hard for any system!

db.activities.aggregate(([ { $match: { time: { $gt: NOW-1H } }}, { $group: { _id: "$userId", count: {$sum: 1} } }], { allowDiskUse: 1 })

• Aggregation above can have issues (single shard final grouping, result not persisted). Map Reduce is a better alternative here

var map = function() { emit(this.userId, 1); }var reduce = function(key, values) { return Array.sum(values); }db.activities.mapreduce(map, reduce,

{ query: { time: { $gt: NOW-1H } }, out: { replace: "lastHourUniques", sharded: true })

db.lastHourUniques.find({ userId: "u123" }) // number activities for a userdb.lastHourUniques.count() // total uniques

Insight – User Stats

Page 88: Retail Reference Architecture

109

User Activity – Items bought together

Time to cross-sell!

Page 89: Retail Reference Architecture

110

Let's simplify each activity recorded as the following:

{ userId: "u123", type: order, itemId: 2, time: DATE }

{ userId: "u123", type: order, itemId: 3, time: DATE }

{ userId: "u234", type: order, itemId: 7, time: DATE }

Calculate items bought by a user with Map Reduce:

- Match activities of type "order" for the past 2 weeks

- map: emit the document by userId

- reduce: push all itemId in a list

- Output looks like { _id: "u123", items: [2, 3, 8] }

User Activity – Items bought together

Page 90: Retail Reference Architecture

111

Then run a 2nd mapreduce job from the previous output to compute the number of occurrences of each item combination:

- query: go over all documents (1 document per userId)

- map: emit every combination of 2 items, starting with lowest itemId

- reduce: sum up the total.

- output looks like { _id: { a: 2, b: 3 } , count: 36 }

User Activity – Items bought together

Page 91: Retail Reference Architecture

112

Then obtain the most popular combinations per item:

- Index created on { _id.a : 1, count: 1 } and { _id.b: 1, count: 1 }

- Query with a threshold:

- db.combinations.find( { _id.a: "u123", count: { $gt: 10 }} ).sort({ count: -1 })

- db.combinations.find( { _id.b: "u123", count: { $gt: 10 }} ).sort({ count: -1 })

Later we can create a more compact recommendation collection that includes popular combinations with weights, like:

{ itemId: 2, recom: [ { itemId: 32, weight: 36},

{ itemId: 158, weight: 23}, … ] }

User Activity – Items bought together

Page 92: Retail Reference Architecture

113

User Activity – Hadoop integration

EDW

Man

ag

em

en

t &

Mon

itori

ng

Secu

rity &

Au

ditin

g

RDBMS

CRM, ERP, Collaboration, Mobile, BI

OS & Virtualization, Compute, Storage, Network

RDBMS

Applications

Infrastructure

Data Management

Operational Analytical

Page 93: Retail Reference Architecture

114

Commerce

Applicationspowered by

Analysispowered by

• Products & Inventory• Recommended products• Customer profile• Session management

• Elastic pricing• Recommendation models• Predictive analytics• Clickstream history

MongoDB Connector for

Hadoop

Page 94: Retail Reference Architecture

115

Connector Overview

Data

Read/Write MongoDB

Read/Write BSON

Tools

MapReduce

Pig

Hive

Spark

Platforms

Apache Hadoop

Cloudera CDH

Hortonworks HDP

Amazon EMR

Page 95: Retail Reference Architecture

116

Connector Features and Functionality

• Open-source on github https://github.com/mongodb/mongo-hadoop

• Computes splits to read data– Single Node, Replica Sets, Sharded Clusters

• Mappings for Pig and Hive– MongoDB as a standard data source/destination

• Support for– Filtering data with MongoDB queries– Authentication– Reading from Replica Set tags– Appending to existing collections

Page 96: Retail Reference Architecture

117

MapReduce Configuration

• MongoDB input

– mongo.job.input.format = com.hadoop.MongoInputFormat

– mongo.input.uri = mongodb://mydb:27017/db1.collection1

• MongoDB output

– mongo.job.output.format = com.hadoop.MongoOutputFormat

– mongo.output.uri = mongodb://mydb:27017/db1.collection2

• BSON input/output

– mongo.job.input.format = com.hadoop.BSONFileInputFormat

– mapred.input.dir = hdfs:///tmp/database.bson

– mongo.job.output.format =

com.hadoop.BSONFileOutputFormat

– mapred.output.dir = hdfs:///tmp/output.bson

Page 97: Retail Reference Architecture

118

Pig Mappings

• Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader

• Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage

Page 98: Retail Reference Architecture

119

Hive Support

CREATE TABLE mongo_users (id int, name string, age int)STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

• Access collections as Hive tables

• Use with MongoStorageHandler or BSONStorageHandler

Page 99: Retail Reference Architecture

Thank You!

Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal

Page 100: Retail Reference Architecture