Retail Reference Architecture

Preview:

Citation preview

Retail Reference Architecturewith MongoDB

Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal

Introduction

4

• it is way too broad to tackle with one solution

• data maps so well to the document model

• needs for agility, performance and scaling

• Many (e)retailers are already using MongoDB

• Let's define the best ways and places for it!

Retail solution

5

• Holds complex JSON structures

• Dynamic Schema for Agility

• complex querying and in-place updating

• Secondary, compound and geo indexing

• full consistency, durability, atomic operations

• Near linear scaling via sharding

• Overall, MongoDB is a unique fit!

MongoDB is a great fit

6

MongoDB Strategic Advantages

Horizontally Scalable-Sharding

AgileFlexible

High Performance &Strong Consistency

Application

HighlyAvailable-Replica Sets

{ customer: “roger”, date: new Date(), comment: “Spirited Away”, tags: [“Tezuka”, “Manga”]}

7

build your data to fit your application

Relational MongoDB{ customer_id : 1,

name : "Mark Smith",city : "San Francisco",orders: [ {

order_number : 13,store_id : 10,date: “2014-01-03”,products: [

{SKU: 24578234,

Qty: 3, Unit_price:

350},{SKU:

98762345, Qty: 1, Unit_Price:

110}]

},{ <...> }

]}

CustomerID First Name Last Name City0 John Doe New York1 Mark Smith San Francisco2 Jay Black Newark3 Meagan White London4 Edward Danields Boston

Order Number Store ID Product Customer ID10 100 Tablet 011 101 Smartphone 012 101 Dishwasher 013 200 Sofa 114 200 Coffee table 115 201 Suit 2

8

Notions

RDBMS MongoDB

Database Database

Table Collection

Row Document

Column Field

Retail Components Overview

10

Information Management

Merchandising

Content

Inventory

Customer

Channel

Sales & Fulfillment

Insight

Social

Architecture Overview

Customer

ChannelsAmazon

Ebay…

StoresPOSKiosk

MobileSmartphone

Tablet

Website

Contact Center

APIData and Service

Integration

SocialFacebook

Twitter…

Data Warehouse

Analytics

Supply Chain Management

System

Suppliers

3rd Party

In Network

Web Servers

Application Servers

11

Commerce Functional Components

Information Layer

Look & Feel

Navigation

Customization

Personalization

Branding

Promotions

Chat

Ads

Customer's Perspective

ResearchBrowseSearch

SelectShopping Cart

PurchaseCheckout

ReceiveTrack

UseFeedbackMaintain

DialogAssist

Market / Offer

Guide

Offer

Semantic Search

Recommend

Rule-based Decisions

Pricing

Coupons

Sell / Fullfill

Orders

Payments

Fraud Detection

Fulfillment

Business Rules

InsightSession CaptureActivity

Monitoring

Customer Enterprise

Information Management

Merchandising

Content

Inventory

Customer

Channel

Sales & Fulfillment

Insight

Social

Merchandising

13

Merchandising

Merchandising

MongoDB

Variant

Hierarchy

Pricing

Promotions

Ratings & Reviews

Calendar

Semantic Search

Item

Localization

14

• Single view of a product, one central catalog service

• Read volume high and sustained, 100k reads / s

• Write volume spikes up during catalog update

• Advanced indexing and querying

• Geographical distribution and low latency

• No need for a cache layer, CDN for assets

Merchandising - principles

15

Merchandising - requirements

Requirement Example Challenge MongoDB

Single-view of product Blended description and hierarchy of product to ensure availability on all channels

Flexible document-oriented storage

High sustained read volume with low latency

Constant querying from online users and sales associates, requiring immediate response

Fast indexed querying, replication allows local copy of catalog, sharding for scaling

Spiky and real-time write volume

Bulk update of full catalog without impacting production, real-time touch update

Fast in-place updating, real-time indexing, , sharding for scaling

Advanced querying Find product based on color, size, description

Ad-hoc querying on any field, advanced secondary and compound indexing

16

Merchandising - Product Page

Product images

General Informatio

n

List of Variants

External Informatio

n

Localized Descriptio

n

17

> db.item.findOne()

{ _id: "301671", // main item id

department: "Shoes",

category: "Shoes/Women/Pumps",

brand: "Guess",

thumbnail: "http://cdn…/pump.jpg",

image: "http://cdn…/pump1.jpg", // larger version of thumbnail

title: "Evening Platform Pumps",

description: "Those evening platform pumps put the perfect finishing touches on your most glamourous night-on-the-town outfit",

shortDescription: "Evening Platform Pumps",

style: "Designer",

type: "Platform",

rating: 4.5, // user rating

lastUpdated: Date("2014/04/01"), // last update time

… }

Merchandising - Item Model

18

• Get item by id

db.definition.findOne( { _id: "301671" } )

• Get item from Product Ids

db.definition.findOne( { _id: { $in: ["301671", "301672" ] } } )

• Get items by department

db.definition.find({ department: "Shoes" })

• Get items by category prefix

db.definition.find( { category: /^Shoes\/Women/ } )

• Indices

productId, department, category, lastUpdated

Merchandising - Item Definition

19

> db.variant.findOne()

{

_id: "730223104376", // the sku

itemId: "301671", // references item id

thumbnail: "http://cdn…/pump-red.jpg", // variant specific

image: "http://cdn…/pump-red.jpg",

size: 6.0,

color: "Red",

width: "B",

heelHeight: 5.0,

lastUpdated: Date("2014/04/01"), // last update time

}

Merchandising – Variant Model

20

• Get variant from SKU

db.variation.find( { _id: "730223104376" } )

• Get all variants for a product, sorted by SKU

db.variation.find( { productId: "301671" } ).sort( { _id: 1 } )

• Indices

productId, lastUpdated

Merchandising – Variant Model

22

Per store Pricing could result in billions of documents,

unless you build it in a modular way

Price: {

_id: "sku730223104376_store123",

currency: "USD",

price: 89.95,

lastUpdated: Date("2014/04/01"), // last update time

}

_id: concatenation of item and store.

Item: can be an item id or sku

Store: can be a store group or store id.

Indices: lastUpdated

Merchandising – per store Pricing

23

• Get all prices for a given item

db.prices.find( { _id: /^p301671_/ )

• Get all prices for a given sku (price could be at item level)

db.prices.find( { _id: { $in: [ /^sku730223104376_/, /^p301671_/ ])

• Get minimum and maximum prices for a sku

db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price },

max: { $max : price} } })

• Get price for a sku and store id (returns up to 4 prices)

db.prices.find( { _id: { $in: [ "sku730223104376_store1234",

"sku730223104376_sgroup0",

"p301671_store1234",

"p301671_sgroup0"] , { price: 1 })

Merchandising – per store Pricing

26

Merchandising – Browse and Search products

Browse by category

Special Lists

Filter by attributes

Lists hundreds of item

summaries

Ideally a single query is issued to the database to obtain all items and metadata to display

27

The previous page presents many challenges:

• Response within milliseconds for hundreds of items

• Faceted search on many attributes: category, brand, …

• Attributes at the variant level: color, size, etc, and the variation's image should be shown

• thousands of variants for an item, need to de-duplicate

• Efficient sorting on several attributes: price, popularity

• Pagination feature which requires deterministic ordering

Merchandising – Browse and Search products

28

Merchandising – Browse and Search products

Hundreds of sizes

One Item

Dozens of colors

A single item may have thousands of variants

29

Merchandising – Browse and Search products

Images of the matching variants are displayed

HierarchySort

parameter

Faceted Search

30

Merchandising – Traditional Architecture

Relational DBSystem of Records

Full Text SearchEngine

Indexing

#1 obtain search

results IDs

ApplicationCache

#2 obtain objects by

ID

Pre-joined into objects

31

The traditional architecture issues:

• 3 different systems to maintain: RDBMS, Search engine, Caching layer

• search returns a list of IDs to be looked up in the cache, increases latency of response

• RDBMS schema is complex and static

• The search index is expensive to update

• Setup does not allow efficient pagination

Merchandising – Traditional Architecture

32

MongoDB Data Store

Merchandising - Architecture

SummariesItems Pricing

PromotionsVariantsRatings & Reviews

#1 Obtain results

33

The summary relies on the following parameters:

• department e.g. "Shoes"

• An indexed attribute

– Category path, e.g. "Shoes/Women/Pumps"

– Price range

– List of Item Attributes, e.g. Brand = Guess

– List of Variant Attributes, e.g. Color = red

• A non-indexed attribute

– List of Item Secondary Attributes, e.g. Style = Designer

– List of Variant Secondary Attributes, e.g. heel height = 4.0

• Sorting, e.g. Price Low to High

Merchandising – Summary Model

34

> db.summaries.findOne()

{ "_id": "p39",

"title": "Evening Platform Pumps 39",

"department": "Shoes", "category": "Shoes/Women/Pumps",

"thumbnail": "http://cdn…/pump-small-39.jpg", "image": "http://cdn…/pump-39.jpg",

"price": 145.99,

"rating": 0.95,

"attrs": [ { "brand" : "Guess"}, … ],

"sattrs": [ { "style" : "Designer"} , { "type" : "Platform"}, …],

"vars": [

{ "sku": "sku2441",

"thumbnail": "http://cdn…/pump-small-39.jpg.Blue",

"image": "http://cdn…/pump-39.jpg.Blue",

"attrs": [ { "size": 6.0 }, { "color": "Blue" }, …],

"sattrs": [ { "width" : "B"} , { "heelHeight" : 5.0 }, …],

}, … Many more skus …

] }

Merchandising – Summary Model

35

• Get summary from item iddb.variation.find({ _id: "p301671" })

• Get summary's specific variation from SKUdb.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } )

• Get summary by department, sorted by ratingdb.variation.find( { department: "Shoes" } ).sort( { rating: 1 } )

• Get summary with mix of parametersdb.variation.find( { department : "Shoes" ,

"vars.attrs" : { "color" : "Gray"} , "category" : ^/Shoes/Women/ , "price" : { "$gte" : 65.99 , "$lte" :

180.99 } } )

Merchandising - Summary Model

36

Merchandising – Summary Model

• The following indices are used:– department + attr + category + _id– department + vars.attrs + category + _id– department + category + _id– department + price + _id– department + rating + _id

• _id used for pagination

• Can take advantage of index intersection

• With several attributes specified (e.g. color=red and size=6), which one is looked up?

37

Facet samples:

{ "_id" : "Accessory Type=Hosiery" , "count" : 14}

{ "_id" : "Ladder Material=Steel" , "count" : 2}

{ "_id" : "Gold Karat=14k" , "count" : 10138}

{ "_id" : "Stone Color=Clear" , "count" : 1648}

{ "_id" : "Metal=White gold" , "count" : 10852}

Single operations to insert / update:

db.facet.update( { _id: "Accessory Type=Hosiery" },

{ $inc: 1 }, true, false)

The facet with lowest count is the most restrictive…

It should come first in the query!

Merchandising – Facet

38

Merchandising – Query stats

Department Category Price Primary attribute

Time Average (ms)

90th (ms) 95th (ms)

1 0 0 0 2 3 3

1 1 0 0 1 2 2

1 0 1 0 1 2 3

1 1 1 0 1 2 2

1 0 0 1 0 1 2

1 1 0 1 0 1 1

1 0 1 1 1 2 2

1 1 1 1 0 1 1

1 0 0 2 1 3 3

1 1 0 2 0 2 2

1 0 1 2 10 20 35

1 1 1 2 0 1 1

Inventory

42

Inventory – Traditional Architecture

Relational DBSystem of Records

NightlyBatches

Analytics, Aggregations,

Reports

Caching Layer

Field Inventory

Internal & External Apps

Point-in-time Loads

43

Opportunities Missed

• Can’t reliability detect availability

• Can't redirect purchasers to in-store pickup

• Can’t do intra-day replenishment

• Degraded customer experience

• Higher internal expense

44

Inventory – Principles

• Single view of the inventory

• Used by most services and channels

• Read dominated workload

• Local, real-time writes

• Bulk writes for refresh

• Geographically distributed

• Horizontally scalable

45

Inventory – Requirements

Requirement Challenge MongoDB

Single view of inventory

Ensure availability of inventory information on

all channels and services

Developer-friendly, document-oriented

storage

High volume, low latency reads

Anytime, anywhere access to inventory

data without overloading the system

of record

Fast, indexed readsLocal reads

Horizontal scaling

Bulk updates,intra-day deltas

Provide window-in-time consistency for highly

available services

Bulk writesFast, in-place updates

Horizontal scaling

Rapid application development cycles

Deliver new services rapidly to capture new

opportunities

Flexible schemaRich query language

Agile-friendly iterations

46

Inventory – Target Architecture

Relational DBSystem of Records

Analytics, Aggregations,

Reports

Field Inventory

Internal & External Apps

Inventory

Assortments

Shipments

Audits

Products

Stores

Point-in-time Loads

NightlyRefresh

Real-timeUpdates

47

Horizontal Scaling

Inventory – Technical Decisions

Store

Inventory

Schema

Indexing

48

Inventory – Collections

Stores InventoryProducts

AuditsAssortmen

tsShipments

49

Stores – Sample Document

• > db.stores.findOne()• {• "_id" :

ObjectId("53549fd3e4b0aaf5d6d07f35"),• "className" : "catalog.Store",• "storeId" : "store0",• "name" : "Bessemer store",• "address" : {• "addr1" : "1st Main St",• "city" : "Bessemer",• "state" : "AL",• "zip" : "12345",• "country" : "US"• },• "location" : [ -86.95444, 33.40178 ],

...• }

50

Stores – Sample Queries

• Get a store by storeId

db.stores.find({ "storeId" : "store0" })

• Get a store by zip code

db.stores.find({ "address.zip" : "12345" })

51

What’s near me?

52

Stores – Sample Geo Queries

• Get nearby stores sorted by distance

db.runCommand({ geoNear : "stores", near : { type : "Point", coordinates : [-82.8006, 40.0908] }, maxDistance : 10000.0, spherical : true })

53

Stores – Sample Geo Queries

• Get the five nearest stores within 10 km

db.stores.find({ location : { $near : { $geometry : { type : "Point", coordinates : [-82.80, 40.09] }, $maxDistance : 10000 } } }).limit(5)

54

Stores – Indices

• { "storeId" : 1 }

• { "name" : 1 }

• { "address.zip" : 1 }

• { "location" : "2dsphere" }

55

Inventory – Sample Document

• > db.inventory.findOne()• { • "_id": "5354869f300487d20b2b011d",• "storeId": "store0",• "location": [-86.95444, 33.40178],• "productId": "p0",• "vars": [• { "sku": "sku1", "q": 14 },• { "sku": "sku3", "q": 7 },• { "sku": "sku7", "q": 32 },• { "sku": "sku14", "q": 65 },• ...• ]• }

56

Inventory – Sample Queries

• Get all items in a store

db.inventory.find({ storeId : "store100" })

• Get quantity for an item at a store

db.inventory.find({ "storeId" : "store100", "productId" : "p200" })

57

Inventory – Sample Queries

• Get quantity for a sku at a store

db.inventory.find( { "storeId" : "store100", "productId" : "p200", "vars.sku" : "sku11736" }, { "vars.$" : 1 } )

58

Inventory – Sample Update

• Increment / decrement inventory for an item at a store

db.inventory.update( { "storeId" : "store100", "productId" : "p200", "vars.sku" : "sku11736" }, { "$inc" : { "vars.$.q" : 20 } } )

59

Inventory – Sample Aggregations

• Aggregate total quantity for a product

db.inventory.aggregate( [ { $match : { productId : "p200" } }, { $unwind : "$vars" }, { $group : { _id : "result", count : { $sum : "$vars.q" } } } ] )

{ "_id" : "result", "count" : 101752 }

60

Inventory – Sample Aggregations

• Aggregate total quantity for a store

db.inventory.aggregate( [ { $match : { storeId : "store100" } }, { $unwind : "$vars" }, { $match : { "vars.q" : { $gt : 0 } } }, { $group : { _id : "result", count : { $sum : 1 } } } ] )

{ "_id" : "result", "count" : 29347 }

61

Inventory – Sample Aggregations

• Aggregate total quantity for a store

db.inventory.aggregate( [ { $match : { storeId : "store100" } }, { $unwind : "$vars" }, { $group : { _id : "result", count : { $sum : "$vars.q" } } } ] )

{ "_id" : "result", "count" : 29347 }

63

64

Inventory – Sample Geo-Query

• Get inventory for an item near a point

db.runCommand( { geoNear : "inventory", near : { type : "Point", coordinates : [-82.8006, 40.0908] }, maxDistance : 10000.0, spherical : true, limit : 10, query : { "productId" : "p200", "vars.sku" : "sku11736" } } )

65

Inventory – Sample Geo-Query

• Get closest store with available sku

db.runCommand( { geoNear : "inventory", near : { type : "Point", coordinates : [-82.800672, 40.090844] }, maxDistance : 10000.0, spherical : true, limit : 1, query : { productId : "p200", vars : { $elemMatch : { sku : "sku11736", q : { $gt : 0 } } } } } )

66

Inventory – Sample Geo-Aggregation

• Get count of inventory for an item near a point db.inventory.aggregate( [ { $geoNear: { near : { type : "Point", coordinates : [-82.800672, 40.090844] }, distanceField: "distance", maxDistance: 10000.0, spherical : true, query: { productId : "p200", vars : { $elemMatch : { sku : "sku11736", q : {$gt : 0} } } }, includeLocs: "dist.location", num: 5 } }, { $unwind: "$vars" }, { $match: { "vars.sku" : "sku11736" } }, { $group: { _id: "result", count: {$sum: "$vars.q"} } }])

67

Inventory – Sample Indices

• { storeId : 1 }

• { productId : 1, storeId : 1 }

• Why not "vars.sku"?– { productId : 1, storeId : 1, "vars.sku" : 1 }

• { productId : 1, location : "2dsphere" }

68

Horizontal Scaling

Inventory – Technical Decisions

Store

Inventory

Schema

Indexing

69

Shard

East

Shard

Central

Shard

West

East DC

Inventory – Sharding Topology

West DC Central DCLegacy

Inventory

Primary

Primary

Primary

70

Inventory – Shard Key

• Choose shard key– { productId : 1, storeId : 1 }

• Set up sharding– sh.enableSharding("inventoryDB")– sh.shardCollection( "inventoryDB.inventory", { productId : 1, storeId : 1 } )

71

Inventory – Shard Tags

• Set up shard tags– sh.addShardTag("shard0000", "west")

– sh.addShardTag("shard0001", "central")

– sh.addShardTag("shard0002", "east")

• Set up tag ranges– Add new field: region– sh.addTagRange("inventoryDB.inventory",

{ region : 0 }, { region : 100}, "west" )

– sh.addTagRange("inventoryDB.inventory",

{ region : 100 }, { region : 200 }, "central" )

– sh.addTagRange("inventoryDB.inventory",

{ region : 200 }, { region : 300 }, "east" )

Insight

87

Insight

Insight

MongoDB

Advertising metrics

Clickstream

Recommendations

Session Capture

Activity Logging

Geo Tracking

Product Analytics

Customer Insight

Application Logs

88

Many user activities can be of interest:

• Search

• Product view, like or wish

• Shopping cart add / remove

• Sharing on social network

• Ad impression, Clickstream

Activity Logging – Data of interest

89

Will be used to compute:

• Product Map (relationships, etc)

• User Preferences

• Recommendations

• Trends …

Activity Logging – Data of interest

90

Activity logging - Architecture

MongoDB

HVDFAPI

Activity LoggingUser History

External Analytics:Hadoop,Spark,Storm,

User Preferences

Recommendations

Trends

Product MapApps

Internal Analytics:

Aggregation,MR

All user activity is recorded

MongoDB – Hadoop

Connector

Personalization

91

Activity Logging

92

• store and manage an incoming stream of data samples– High arrival rate of data from many sources– Variable schema of arriving data– control retention period of data

• compute derivative data sets based on these samples– Aggregations and statistics based on data – Roll-up data into pre-computed reports and summaries

• low latency access to up-to-date data (user history)– Flexible indexing of raw and derived data sets – Rich querying based on time + meta-data fields in samples

Activity Logging – Problem statement

93

Activity logging - Requirements

Requirement MongoDB

Ingestion of 100ks of writes / sec

Fast C++ process, multi-threads, multi-locks. Horizontal scaling via sharding. Sequential IO via time partitioning.

Flexible schema Dynamic schema, each document is independent. Data is stored the same format and size as it is inserted.

Fast querying on varied fields, sorting

Secondary Btree indexes can lookup and sort the data in milliseconds.

Easy clean up of old data Deletes are typically as expensive as inserts. Getting free deletes via time partitioning.

94

Activity Logging using HVDF

HVDF (High Volume Data Feed):

• Open source reference implementation of high volume writing with MongoDB https://github.com/10gen-labs/hvdf

• Rest API server written in Java with most popular libraries

• Public project, issues can be logged https://jira.mongodb.org/browse/HVDF

• Can be run as-is, or customized as needed

95

Feed

High volume data feed architecture

Channel

Sample Sample Sample Sample

Source

Source

Processor

Inline Processing

Batch Processing

Stream Processing

Grouping by Feed and Channel

Sources send samples

Processors generate derivative Channels

96

HVDF -- High Volume Data Feed engine

HVDF – Reference implementation

REST Service API

Processor Plugins

Inline

Batch

Stream

Channel Data Storage

Raw Channel

Data

Aggregated Rollup T1

Aggregated Rollup T2

Query Processor Streaming spout

Custom Stream Processing Logic

Incoming Sample Stream

POST /feed/channel/data

GET /feed/channeldata?time=XXX&range=YYY

Real-time Queries

97

{ _id: ObjectId(),

geoCode: 1, // used to localize write operations

sessionId: "2373BB…",

device: { id: "1234",

type: "mobile/iphone",

userAgent: "Chrome/34.0.1847.131"

}

userId: "u123",

type: "VIEW|CART_ADD|CART_REMOVE|ORDER|…", // type of activity

itemId: "301671",

sku: "730223104376",

order: { id: "12520185",

… },

location: [ -86.95444, 33.40178 ],

tags: [ "smartphone", "iphone", … ], // associated tags

timeStamp: Date("2014/04/01 …")

}

User Activity - Model

98

Dynamic schema for sample data

Sample 1{ deviceId: XXXX, time: Date(…) type: "VIEW", …}

Channel

Sample 2{ deviceId: XXXX, time: Date(…) type: "CART_ADD", cartId: 123, …}

Sample 3{ deviceId: XXXX, time: Date(…) type: “FB_LIKE”}

Each sample can have

variable fields

99

Channels are sharded

Shard

Shard

Shard

Shard

Shard

Shard Key: Customer_id

Sample{ customer_id: XXXX, time: Date(…) type: "VIEW",}

ChannelYou choose how

to partition samples

Samples can have dynamic

schema

Scale horizontally by adding shards

Each shard is highly available

100

Channels are time partitioned

Channel

Sample Sample Sample Sample Sample Sample Sample Sample

- 2 days - 1 Day Today

Partitioning keeps indexes manageable

This is where all of the writes

happen

Older partitions are read only for

best possible concurrency

Queries are routed only to needed

partitions

Partition 1 Partition 2 Partition N

Each partition is a separate collection

Efficient and space reclaiming

purging of old data

101

Dynamic queries on Channels

Channel

Sample Sample Sample Sample

AppApp

App

Indexes

Queries Pipelines Map-Reduce

Create custom indexes on Channels

Use full mongodb query language to access samples

Use mongodb aggregation pipelines to

access samples

Use mongodb inline map-reduce to access samples

Full access to field, text, and geo

indexing

102

North America - West

North America - East

Europe

Geographically distributed system

Channel

Sample Sample Sample Sample

Source

Source

Source

Source

Source

Source

Sample

Sample

Sample

Sample

Geo shards per location

Clients write local nodes

Single view of channel available

globally

103

Insight

104

Insight – Useful Data

Useful data for better shopping:

• User history (e.g. recently seen products)

• User statistics (e.g. total purchases, visits)

• User interests (e.g. likes videogames and SciFi)

• User social network

105

Insight – Useful Data

Useful data for selling more:

• Cross-selling: people who bought this item had tendency to buy those other items (e.g. iPhone, then bought iPhone case)

• Up-selling: people who looked at this item eventually bought those items (alternative product that may be better)

106

• Get the recent activity for a user, to populate the "recently viewed" list

db.activities.find({ userId: "u123", time: { $gt: DATE }}).

sort({ time: -1 }).limit(100)

• Get the recent activity for a product, to populate the "N users bought this in the past N hours" list

db.activities.find({ itemId: "301671", time: { $gt: DATE }}).

sort({ time: -1 }).limit(100)

• Indices: time, userId + time, deviceId + time, itemId + time

• All queries should be time bound, since this is a lot of data!

Insight – User History

107

• Get the recent number of views, purchases, etc for a userdb.activities.aggregate(([

{ $match: { userId: "u123", time: { $gt: DATE } }}, { $group: { _id: "$type", count: {$sum: 1} } }])

• Get the total recent sales for a userdb.activities.aggregate(([

{ $match: { userId: "u123", time: { $gt: DATE }, type: "ORDER" }}, { $group: { _id: "result", count: {$sum: "$totalPrice"} } }])

• Get the recent number of views, purchases, etc for an itemdb.activities.aggregate(([

{ $match: { itemId: "301671", time: { $gt: DATE } }}, { $group: { _id: "$type", count: {$sum: "1"} } }])

• Those aggregations are very fast, real-time

Insight – User Stats

108

• number of activities for unique visitors for the past hour. Calculation of uniques is hard for any system!

db.activities.aggregate(([ { $match: { time: { $gt: NOW-1H } }}, { $group: { _id: "$userId", count: {$sum: 1} } }], { allowDiskUse: 1 })

• Aggregation above can have issues (single shard final grouping, result not persisted). Map Reduce is a better alternative here

var map = function() { emit(this.userId, 1); }var reduce = function(key, values) { return Array.sum(values); }db.activities.mapreduce(map, reduce,

{ query: { time: { $gt: NOW-1H } }, out: { replace: "lastHourUniques", sharded: true })

db.lastHourUniques.find({ userId: "u123" }) // number activities for a userdb.lastHourUniques.count() // total uniques

Insight – User Stats

109

User Activity – Items bought together

Time to cross-sell!

110

Let's simplify each activity recorded as the following:

{ userId: "u123", type: order, itemId: 2, time: DATE }

{ userId: "u123", type: order, itemId: 3, time: DATE }

{ userId: "u234", type: order, itemId: 7, time: DATE }

Calculate items bought by a user with Map Reduce:

- Match activities of type "order" for the past 2 weeks

- map: emit the document by userId

- reduce: push all itemId in a list

- Output looks like { _id: "u123", items: [2, 3, 8] }

User Activity – Items bought together

111

Then run a 2nd mapreduce job from the previous output to compute the number of occurrences of each item combination:

- query: go over all documents (1 document per userId)

- map: emit every combination of 2 items, starting with lowest itemId

- reduce: sum up the total.

- output looks like { _id: { a: 2, b: 3 } , count: 36 }

User Activity – Items bought together

112

Then obtain the most popular combinations per item:

- Index created on { _id.a : 1, count: 1 } and { _id.b: 1, count: 1 }

- Query with a threshold:

- db.combinations.find( { _id.a: "u123", count: { $gt: 10 }} ).sort({ count: -1 })

- db.combinations.find( { _id.b: "u123", count: { $gt: 10 }} ).sort({ count: -1 })

Later we can create a more compact recommendation collection that includes popular combinations with weights, like:

{ itemId: 2, recom: [ { itemId: 32, weight: 36},

{ itemId: 158, weight: 23}, … ] }

User Activity – Items bought together

113

User Activity – Hadoop integration

EDW

Man

ag

em

en

t &

Mon

itori

ng

Secu

rity &

Au

ditin

g

RDBMS

CRM, ERP, Collaboration, Mobile, BI

OS & Virtualization, Compute, Storage, Network

RDBMS

Applications

Infrastructure

Data Management

Operational Analytical

114

Commerce

Applicationspowered by

Analysispowered by

• Products & Inventory• Recommended products• Customer profile• Session management

• Elastic pricing• Recommendation models• Predictive analytics• Clickstream history

MongoDB Connector for

Hadoop

115

Connector Overview

Data

Read/Write MongoDB

Read/Write BSON

Tools

MapReduce

Pig

Hive

Spark

Platforms

Apache Hadoop

Cloudera CDH

Hortonworks HDP

Amazon EMR

116

Connector Features and Functionality

• Open-source on github https://github.com/mongodb/mongo-hadoop

• Computes splits to read data– Single Node, Replica Sets, Sharded Clusters

• Mappings for Pig and Hive– MongoDB as a standard data source/destination

• Support for– Filtering data with MongoDB queries– Authentication– Reading from Replica Set tags– Appending to existing collections

117

MapReduce Configuration

• MongoDB input

– mongo.job.input.format = com.hadoop.MongoInputFormat

– mongo.input.uri = mongodb://mydb:27017/db1.collection1

• MongoDB output

– mongo.job.output.format = com.hadoop.MongoOutputFormat

– mongo.output.uri = mongodb://mydb:27017/db1.collection2

• BSON input/output

– mongo.job.input.format = com.hadoop.BSONFileInputFormat

– mapred.input.dir = hdfs:///tmp/database.bson

– mongo.job.output.format =

com.hadoop.BSONFileOutputFormat

– mapred.output.dir = hdfs:///tmp/output.bson

118

Pig Mappings

• Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader

• Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage

119

Hive Support

CREATE TABLE mongo_users (id int, name string, age int)STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

• Access collections as Hive tables

• Use with MongoStorageHandler or BSONStorageHandler

Thank You!

Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal

Recommended