Retail Reference Architecture

Retail Reference Architecturewith MongoDB

Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal

Introduction

• it is way too broad to tackle with one solution

• data maps so well to the document model

• needs for agility, performance and scaling

• Many (e)retailers are already using MongoDB

• Let's define the best ways and places for it!

Retail solution

• Holds complex JSON structures

• Dynamic Schema for Agility

• complex querying and in-place updating

• Secondary, compound and geo indexing

• full consistency, durability, atomic operations

• Near linear scaling via sharding

• Overall, MongoDB is a unique fit!

MongoDB is a great fit

MongoDB Strategic Advantages

Horizontally Scalable-Sharding

AgileFlexible

High Performance &Strong Consistency

Application

HighlyAvailable-Replica Sets

{ customer: “roger”, date: new Date(), comment: “Spirited Away”, tags: [“Tezuka”, “Manga”]}

build your data to fit your application

Relational MongoDB{ customer_id : 1,

name : "Mark Smith",city : "San Francisco",orders: [ {

order_number : 13,store_id : 10,date: “2014-01-03”,products: [

{SKU: 24578234,

Qty: 3, Unit_price:

350},{SKU:

98762345, Qty: 1, Unit_Price:

},{ <...> }

CustomerID First Name Last Name City0 John Doe New York1 Mark Smith San Francisco2 Jay Black Newark3 Meagan White London4 Edward Danields Boston

Order Number Store ID Product Customer ID10 100 Tablet 011 101 Smartphone 012 101 Dishwasher 013 200 Sofa 114 200 Coffee table 115 201 Suit 2

Notions

RDBMS MongoDB

Database Database

Table Collection

Row Document

Column Field

Retail Components Overview

Information Management

Merchandising

Content

Inventory

Customer

Channel

Sales & Fulfillment

Insight

Social

Architecture Overview

Customer

ChannelsAmazon

Ebay…

StoresPOSKiosk

MobileSmartphone

Tablet

Website

Contact Center

APIData and Service

Integration

SocialFacebook

Twitter…

Data Warehouse

Analytics

Supply Chain Management

System

Suppliers

3rd Party

In Network

Web Servers

Application Servers

Commerce Functional Components

Information Layer

Look & Feel

Navigation

Customization

Personalization

Branding

Promotions

Customer's Perspective

ResearchBrowseSearch

SelectShopping Cart

PurchaseCheckout

ReceiveTrack

UseFeedbackMaintain

DialogAssist

Market / Offer

Semantic Search

Recommend

Rule-based Decisions

Pricing

Coupons

Sell / Fullfill

Orders

Payments

Fraud Detection

Fulfillment

Business Rules

InsightSession CaptureActivity

Monitoring

Customer Enterprise

Information Management

Merchandising

Content

Inventory

Customer

Channel

Sales & Fulfillment

Insight

Social

Merchandising

MongoDB

Variant

Hierarchy

Pricing

Promotions

Ratings & Reviews

Calendar

Semantic Search

Localization

• Single view of a product, one central catalog service

• Read volume high and sustained, 100k reads / s

• Write volume spikes up during catalog update

• Advanced indexing and querying

• Geographical distribution and low latency

• No need for a cache layer, CDN for assets

Merchandising - principles

Merchandising - requirements

Requirement Example Challenge MongoDB

Single-view of product Blended description and hierarchy of product to ensure availability on all channels

Flexible document-oriented storage

High sustained read volume with low latency

Constant querying from online users and sales associates, requiring immediate response

Fast indexed querying, replication allows local copy of catalog, sharding for scaling

Spiky and real-time write volume

Bulk update of full catalog without impacting production, real-time touch update

Fast in-place updating, real-time indexing, , sharding for scaling

Advanced querying Find product based on color, size, description

Ad-hoc querying on any field, advanced secondary and compound indexing

Merchandising - Product Page

Product images

General Informatio

List of Variants

External Informatio

Localized Descriptio

> db.item.findOne()

{ _id: "301671", // main item id

department: "Shoes",

category: "Shoes/Women/Pumps",

brand: "Guess",

thumbnail: "http://cdn…/pump.jpg",

image: "http://cdn…/pump1.jpg", // larger version of thumbnail

title: "Evening Platform Pumps",

description: "Those evening platform pumps put the perfect finishing touches on your most glamourous night-on-the-town outfit",

shortDescription: "Evening Platform Pumps",

style: "Designer",

type: "Platform",

rating: 4.5, // user rating

lastUpdated: Date("2014/04/01"), // last update time

Merchandising - Item Model

• Get item by id

db.definition.findOne( { _id: "301671" } )

• Get item from Product Ids

db.definition.findOne( { _id: { $in: ["301671", "301672" ] } } )

• Get items by department

db.definition.find({ department: "Shoes" })

• Get items by category prefix

db.definition.find( { category: /^Shoes\/Women/ } )

• Indices

productId, department, category, lastUpdated

Merchandising - Item Definition

> db.variant.findOne()

_id: "730223104376", // the sku

itemId: "301671", // references item id

thumbnail: "http://cdn…/pump-red.jpg", // variant specific

image: "http://cdn…/pump-red.jpg",

size: 6.0,

color: "Red",

width: "B",

heelHeight: 5.0,

Merchandising – Variant Model

• Get variant from SKU

db.variation.find( { _id: "730223104376" } )

• Get all variants for a product, sorted by SKU

db.variation.find( { productId: "301671" } ).sort( { _id: 1 } )

• Indices

productId, lastUpdated

Merchandising – Variant Model

Per store Pricing could result in billions of documents,

unless you build it in a modular way

Price: {

_id: "sku730223104376_store123",

currency: "USD",

price: 89.95,

_id: concatenation of item and store.

Item: can be an item id or sku

Store: can be a store group or store id.

Indices: lastUpdated

Merchandising – per store Pricing

• Get all prices for a given item

db.prices.find( { _id: /^p301671_/ )

• Get all prices for a given sku (price could be at item level)

db.prices.find( { _id: { $in: [ /^sku730223104376_/, /^p301671_/ ])

• Get minimum and maximum prices for a sku

db.prices.aggregate( { match }, { $group: { _id: 1, min: { $min: price },

max: { $max : price} } })

• Get price for a sku and store id (returns up to 4 prices)

db.prices.find( { _id: { $in: [ "sku730223104376_store1234",

"sku730223104376_sgroup0",

"p301671_store1234",

"p301671_sgroup0"] , { price: 1 })

Merchandising – per store Pricing

Merchandising – Browse and Search products

Browse by category

Special Lists

Filter by attributes

Lists hundreds of item

summaries

Ideally a single query is issued to the database to obtain all items and metadata to display

The previous page presents many challenges:

• Response within milliseconds for hundreds of items

• Faceted search on many attributes: category, brand, …

• Attributes at the variant level: color, size, etc, and the variation's image should be shown

• thousands of variants for an item, need to de-duplicate

• Efficient sorting on several attributes: price, popularity

• Pagination feature which requires deterministic ordering

Hundreds of sizes

One Item

Dozens of colors

A single item may have thousands of variants

Images of the matching variants are displayed

HierarchySort

parameter

Faceted Search

Merchandising – Traditional Architecture

Relational DBSystem of Records

Full Text SearchEngine

Indexing

#1 obtain search

results IDs

ApplicationCache

#2 obtain objects by

Pre-joined into objects

The traditional architecture issues:

• 3 different systems to maintain: RDBMS, Search engine, Caching layer

• search returns a list of IDs to be looked up in the cache, increases latency of response

• RDBMS schema is complex and static

• The search index is expensive to update

• Setup does not allow efficient pagination

Merchandising – Traditional Architecture

MongoDB Data Store

Merchandising - Architecture

SummariesItems Pricing

PromotionsVariantsRatings & Reviews

#1 Obtain results

The summary relies on the following parameters:

• department e.g. "Shoes"

• An indexed attribute

– Category path, e.g. "Shoes/Women/Pumps"

– Price range

– List of Item Attributes, e.g. Brand = Guess

– List of Variant Attributes, e.g. Color = red

• A non-indexed attribute

– List of Item Secondary Attributes, e.g. Style = Designer

– List of Variant Secondary Attributes, e.g. heel height = 4.0

• Sorting, e.g. Price Low to High

Merchandising – Summary Model

> db.summaries.findOne()

{ "_id": "p39",

"title": "Evening Platform Pumps 39",

"department": "Shoes", "category": "Shoes/Women/Pumps",

"thumbnail": "http://cdn…/pump-small-39.jpg", "image": "http://cdn…/pump-39.jpg",

"price": 145.99,

"rating": 0.95,

"attrs": [ { "brand" : "Guess"}, … ],

"sattrs": [ { "style" : "Designer"} , { "type" : "Platform"}, …],

"vars": [

{ "sku": "sku2441",

"thumbnail": "http://cdn…/pump-small-39.jpg.Blue",

"image": "http://cdn…/pump-39.jpg.Blue",

"attrs": [ { "size": 6.0 }, { "color": "Blue" }, …],

"sattrs": [ { "width" : "B"} , { "heelHeight" : 5.0 }, …],

}, … Many more skus …

• Get summary from item iddb.variation.find({ _id: "p301671" })

• Get summary's specific variation from SKUdb.variation.find( { "vars.sku": "730223104376" }, { "vars.$": 1 } )

• Get summary by department, sorted by ratingdb.variation.find( { department: "Shoes" } ).sort( { rating: 1 } )

• Get summary with mix of parametersdb.variation.find( { department : "Shoes" ,

"vars.attrs" : { "color" : "Gray"} , "category" : ^/Shoes/Women/ , "price" : { "$gte" : 65.99 , "$lte" :

180.99 } } )

Merchandising - Summary Model

• The following indices are used:– department + attr + category + _id– department + vars.attrs + category + _id– department + category + _id– department + price + _id– department + rating + _id

• _id used for pagination

• Can take advantage of index intersection

• With several attributes specified (e.g. color=red and size=6), which one is looked up?

Facet samples:

{ "_id" : "Accessory Type=Hosiery" , "count" : 14}

{ "_id" : "Ladder Material=Steel" , "count" : 2}

{ "_id" : "Gold Karat=14k" , "count" : 10138}

{ "_id" : "Stone Color=Clear" , "count" : 1648}

{ "_id" : "Metal=White gold" , "count" : 10852}

Single operations to insert / update:

db.facet.update( { _id: "Accessory Type=Hosiery" },

{ $inc: 1 }, true, false)

The facet with lowest count is the most restrictive…

It should come first in the query!

Merchandising – Facet

Merchandising – Query stats

Department Category Price Primary attribute

Time Average (ms)

90th (ms) 95th (ms)

1 0 0 0 2 3 3

1 1 0 0 1 2 2

1 0 1 0 1 2 3

1 1 1 0 1 2 2

1 0 0 1 0 1 2

1 1 0 1 0 1 1

1 0 1 1 1 2 2

1 1 1 1 0 1 1

1 0 0 2 1 3 3

1 1 0 2 0 2 2

1 0 1 2 10 20 35

1 1 1 2 0 1 1

Inventory

Inventory – Traditional Architecture

NightlyBatches

Analytics, Aggregations,

Reports

Caching Layer

Field Inventory

Internal & External Apps

Point-in-time Loads

Opportunities Missed

• Can’t reliability detect availability

• Can't redirect purchasers to in-store pickup

• Can’t do intra-day replenishment

• Degraded customer experience

• Higher internal expense

Inventory – Principles

• Single view of the inventory

• Used by most services and channels

• Read dominated workload

• Local, real-time writes

• Bulk writes for refresh

• Geographically distributed

• Horizontally scalable

Inventory – Requirements

Requirement Challenge MongoDB

Single view of inventory

Ensure availability of inventory information on

all channels and services

Developer-friendly, document-oriented

storage

High volume, low latency reads

Anytime, anywhere access to inventory

data without overloading the system

of record

Fast, indexed readsLocal reads

Horizontal scaling

Bulk updates,intra-day deltas

Provide window-in-time consistency for highly

available services

Bulk writesFast, in-place updates

Horizontal scaling

Rapid application development cycles

Deliver new services rapidly to capture new

opportunities

Flexible schemaRich query language

Agile-friendly iterations

Inventory – Target Architecture

Analytics, Aggregations,

Reports

Field Inventory

Internal & External Apps

Inventory

Assortments

Shipments

Audits

Products

Stores

Point-in-time Loads

NightlyRefresh

Real-timeUpdates

Horizontal Scaling

Inventory – Technical Decisions

Inventory

Schema

Indexing

Inventory – Collections

Stores InventoryProducts

AuditsAssortmen

tsShipments

Stores – Sample Document

• > db.stores.findOne()• {• "_id" :

ObjectId("53549fd3e4b0aaf5d6d07f35"),• "className" : "catalog.Store",• "storeId" : "store0",• "name" : "Bessemer store",• "address" : {• "addr1" : "1st Main St",• "city" : "Bessemer",• "state" : "AL",• "zip" : "12345",• "country" : "US"• },• "location" : [ -86.95444, 33.40178 ],

...• }

Stores – Sample Queries

• Get a store by storeId

db.stores.find({ "storeId" : "store0" })

• Get a store by zip code

db.stores.find({ "address.zip" : "12345" })

What’s near me?

Stores – Sample Geo Queries

• Get nearby stores sorted by distance

db.runCommand({ geoNear : "stores", near : { type : "Point", coordinates : [-82.8006, 40.0908] }, maxDistance : 10000.0, spherical : true })

Stores – Sample Geo Queries

• Get the five nearest stores within 10 km

db.stores.find({ location : { $near : { $geometry : { type : "Point", coordinates : [-82.80, 40.09] }, $maxDistance : 10000 } } }).limit(5)

Stores – Indices

• { "storeId" : 1 }

• { "name" : 1 }

• { "address.zip" : 1 }

• { "location" : "2dsphere" }

Inventory – Sample Document

• > db.inventory.findOne()• { • "_id": "5354869f300487d20b2b011d",• "storeId": "store0",• "location": [-86.95444, 33.40178],• "productId": "p0",• "vars": [• { "sku": "sku1", "q": 14 },• { "sku": "sku3", "q": 7 },• { "sku": "sku7", "q": 32 },• { "sku": "sku14", "q": 65 },• ...• ]• }

Inventory – Sample Queries

• Get all items in a store

db.inventory.find({ storeId : "store100" })

• Get quantity for an item at a store

db.inventory.find({ "storeId" : "store100", "productId" : "p200" })

Inventory – Sample Queries

• Get quantity for a sku at a store

db.inventory.find( { "storeId" : "store100", "productId" : "p200", "vars.sku" : "sku11736" }, { "vars.$" : 1 } )

Inventory – Sample Update

• Increment / decrement inventory for an item at a store

db.inventory.update( { "storeId" : "store100", "productId" : "p200", "vars.sku" : "sku11736" }, { "$inc" : { "vars.$.q" : 20 } } )

Inventory – Sample Aggregations

• Aggregate total quantity for a product

db.inventory.aggregate( [ { $match : { productId : "p200" } }, { $unwind : "$vars" }, { $group : { _id : "result", count : { $sum : "$vars.q" } } } ] )

{ "_id" : "result", "count" : 101752 }

• Aggregate total quantity for a store

db.inventory.aggregate( [ { $match : { storeId : "store100" } }, { $unwind : "$vars" }, { $match : { "vars.q" : { $gt : 0 } } }, { $group : { _id : "result", count : { $sum : 1 } } } ] )

{ "_id" : "result", "count" : 29347 }

• Aggregate total quantity for a store

db.inventory.aggregate( [ { $match : { storeId : "store100" } }, { $unwind : "$vars" }, { $group : { _id : "result", count : { $sum : "$vars.q" } } } ] )

{ "_id" : "result", "count" : 29347 }

Inventory – Sample Geo-Query

• Get inventory for an item near a point

db.runCommand( { geoNear : "inventory", near : { type : "Point", coordinates : [-82.8006, 40.0908] }, maxDistance : 10000.0, spherical : true, limit : 10, query : { "productId" : "p200", "vars.sku" : "sku11736" } } )

Inventory – Sample Geo-Query

• Get closest store with available sku

db.runCommand( { geoNear : "inventory", near : { type : "Point", coordinates : [-82.800672, 40.090844] }, maxDistance : 10000.0, spherical : true, limit : 1, query : { productId : "p200", vars : { $elemMatch : { sku : "sku11736", q : { $gt : 0 } } } } } )

Inventory – Sample Geo-Aggregation

• Get count of inventory for an item near a point db.inventory.aggregate( [ { $geoNear: { near : { type : "Point", coordinates : [-82.800672, 40.090844] }, distanceField: "distance", maxDistance: 10000.0, spherical : true, query: { productId : "p200", vars : { $elemMatch : { sku : "sku11736", q : {$gt : 0} } } }, includeLocs: "dist.location", num: 5 } }, { $unwind: "$vars" }, { $match: { "vars.sku" : "sku11736" } }, { $group: { _id: "result", count: {$sum: "$vars.q"} } }])

Inventory – Sample Indices

• { storeId : 1 }

• { productId : 1, storeId : 1 }

• Why not "vars.sku"?– { productId : 1, storeId : 1, "vars.sku" : 1 }

• { productId : 1, location : "2dsphere" }

Horizontal Scaling

Inventory – Technical Decisions

Inventory

Schema

Indexing

Central

East DC

Inventory – Sharding Topology

West DC Central DCLegacy

Inventory

Primary

Inventory – Shard Key

• Choose shard key– { productId : 1, storeId : 1 }

• Set up sharding– sh.enableSharding("inventoryDB")– sh.shardCollection( "inventoryDB.inventory", { productId : 1, storeId : 1 } )

Inventory – Shard Tags

• Set up shard tags– sh.addShardTag("shard0000", "west")

– sh.addShardTag("shard0001", "central")

– sh.addShardTag("shard0002", "east")

• Set up tag ranges– Add new field: region– sh.addTagRange("inventoryDB.inventory",

{ region : 0 }, { region : 100}, "west" )

– sh.addTagRange("inventoryDB.inventory",

{ region : 100 }, { region : 200 }, "central" )

– sh.addTagRange("inventoryDB.inventory",

{ region : 200 }, { region : 300 }, "east" )

Insight

MongoDB

Advertising metrics

Clickstream

Recommendations

Session Capture

Activity Logging

Geo Tracking

Product Analytics

Customer Insight

Application Logs

Many user activities can be of interest:

• Search

• Product view, like or wish

• Shopping cart add / remove

• Sharing on social network

• Ad impression, Clickstream

Activity Logging – Data of interest

Will be used to compute:

• Product Map (relationships, etc)

• User Preferences

• Recommendations

• Trends …

Activity Logging – Data of interest

Activity logging - Architecture

MongoDB

HVDFAPI

Activity LoggingUser History

External Analytics:Hadoop,Spark,Storm,

User Preferences

Recommendations

Trends

Product MapApps

Internal Analytics:

Aggregation,MR

All user activity is recorded

MongoDB – Hadoop

Connector

Personalization

Activity Logging

• store and manage an incoming stream of data samples– High arrival rate of data from many sources– Variable schema of arriving data– control retention period of data

• compute derivative data sets based on these samples– Aggregations and statistics based on data – Roll-up data into pre-computed reports and summaries

• low latency access to up-to-date data (user history)– Flexible indexing of raw and derived data sets – Rich querying based on time + meta-data fields in samples

Activity Logging – Problem statement

Activity logging - Requirements

Requirement MongoDB

Ingestion of 100ks of writes / sec

Fast C++ process, multi-threads, multi-locks. Horizontal scaling via sharding. Sequential IO via time partitioning.

Flexible schema Dynamic schema, each document is independent. Data is stored the same format and size as it is inserted.

Fast querying on varied fields, sorting

Secondary Btree indexes can lookup and sort the data in milliseconds.

Easy clean up of old data Deletes are typically as expensive as inserts. Getting free deletes via time partitioning.

Activity Logging using HVDF

HVDF (High Volume Data Feed):

• Open source reference implementation of high volume writing with MongoDB https://github.com/10gen-labs/hvdf

• Rest API server written in Java with most popular libraries

• Public project, issues can be logged https://jira.mongodb.org/browse/HVDF

• Can be run as-is, or customized as needed

High volume data feed architecture

Channel

Sample Sample Sample Sample

Source

Processor

Inline Processing

Batch Processing

Stream Processing

Grouping by Feed and Channel

Sources send samples

Processors generate derivative Channels

HVDF -- High Volume Data Feed engine

HVDF – Reference implementation

REST Service API

Processor Plugins

Inline

Stream

Channel Data Storage

Raw Channel

Aggregated Rollup T1

Aggregated Rollup T2

Query Processor Streaming spout

Custom Stream Processing Logic

Incoming Sample Stream

POST /feed/channel/data

GET /feed/channeldata?time=XXX&range=YYY

Real-time Queries

{ _id: ObjectId(),

geoCode: 1, // used to localize write operations

sessionId: "2373BB…",

device: { id: "1234",

type: "mobile/iphone",

userAgent: "Chrome/34.0.1847.131"

userId: "u123",

type: "VIEW|CART_ADD|CART_REMOVE|ORDER|…", // type of activity

itemId: "301671",

sku: "730223104376",

order: { id: "12520185",

… },

location: [ -86.95444, 33.40178 ],

tags: [ "smartphone", "iphone", … ], // associated tags

timeStamp: Date("2014/04/01 …")

User Activity - Model

Dynamic schema for sample data

Sample 1{ deviceId: XXXX, time: Date(…) type: "VIEW", …}

Channel

Sample 2{ deviceId: XXXX, time: Date(…) type: "CART_ADD", cartId: 123, …}

Sample 3{ deviceId: XXXX, time: Date(…) type: “FB_LIKE”}

Each sample can have

variable fields

Channels are sharded

Shard Key: Customer_id

Sample{ customer_id: XXXX, time: Date(…) type: "VIEW",}

ChannelYou choose how

to partition samples

Samples can have dynamic

schema

Scale horizontally by adding shards

Each shard is highly available

Channels are time partitioned

Channel

Sample Sample Sample Sample Sample Sample Sample Sample

- 2 days - 1 Day Today

Partitioning keeps indexes manageable

This is where all of the writes

happen

Older partitions are read only for

best possible concurrency

Queries are routed only to needed

partitions

Partition 1 Partition 2 Partition N

Each partition is a separate collection

Efficient and space reclaiming

purging of old data

Dynamic queries on Channels

Channel

AppApp

Indexes

Queries Pipelines Map-Reduce

Create custom indexes on Channels

Use full mongodb query language to access samples

Use mongodb aggregation pipelines to

access samples

Use mongodb inline map-reduce to access samples

Full access to field, text, and geo

indexing

North America - West

North America - East

Europe

Geographically distributed system

Channel

Source

Sample

Geo shards per location

Clients write local nodes

Single view of channel available

globally

Insight

Insight – Useful Data

Useful data for better shopping:

• User history (e.g. recently seen products)

• User statistics (e.g. total purchases, visits)

• User interests (e.g. likes videogames and SciFi)

• User social network

Insight – Useful Data

Useful data for selling more:

• Cross-selling: people who bought this item had tendency to buy those other items (e.g. iPhone, then bought iPhone case)

• Up-selling: people who looked at this item eventually bought those items (alternative product that may be better)

• Get the recent activity for a user, to populate the "recently viewed" list

db.activities.find({ userId: "u123", time: { $gt: DATE }}).

sort({ time: -1 }).limit(100)

• Get the recent activity for a product, to populate the "N users bought this in the past N hours" list

db.activities.find({ itemId: "301671", time: { $gt: DATE }}).

sort({ time: -1 }).limit(100)

• Indices: time, userId + time, deviceId + time, itemId + time

• All queries should be time bound, since this is a lot of data!

Insight – User History

• Get the recent number of views, purchases, etc for a userdb.activities.aggregate(([

{ $match: { userId: "u123", time: { $gt: DATE } }}, { $group: { _id: "$type", count: {$sum: 1} } }])

• Get the total recent sales for a userdb.activities.aggregate(([

{ $match: { userId: "u123", time: { $gt: DATE }, type: "ORDER" }}, { $group: { _id: "result", count: {$sum: "$totalPrice"} } }])

• Get the recent number of views, purchases, etc for an itemdb.activities.aggregate(([

{ $match: { itemId: "301671", time: { $gt: DATE } }}, { $group: { _id: "$type", count: {$sum: "1"} } }])

• Those aggregations are very fast, real-time

Insight – User Stats

• number of activities for unique visitors for the past hour. Calculation of uniques is hard for any system!

db.activities.aggregate(([ { $match: { time: { $gt: NOW-1H } }}, { $group: { _id: "$userId", count: {$sum: 1} } }], { allowDiskUse: 1 })

• Aggregation above can have issues (single shard final grouping, result not persisted). Map Reduce is a better alternative here

var map = function() { emit(this.userId, 1); }var reduce = function(key, values) { return Array.sum(values); }db.activities.mapreduce(map, reduce,

{ query: { time: { $gt: NOW-1H } }, out: { replace: "lastHourUniques", sharded: true })

db.lastHourUniques.find({ userId: "u123" }) // number activities for a userdb.lastHourUniques.count() // total uniques

Insight – User Stats

User Activity – Items bought together

Time to cross-sell!

Let's simplify each activity recorded as the following:

{ userId: "u123", type: order, itemId: 2, time: DATE }

Calculate items bought by a user with Map Reduce:

- Match activities of type "order" for the past 2 weeks

- map: emit the document by userId

- reduce: push all itemId in a list

- Output looks like { _id: "u123", items: [2, 3, 8] }

Then run a 2nd mapreduce job from the previous output to compute the number of occurrences of each item combination:

- query: go over all documents (1 document per userId)

- map: emit every combination of 2 items, starting with lowest itemId

- reduce: sum up the total.

- output looks like { _id: { a: 2, b: 3 } , count: 36 }

Then obtain the most popular combinations per item:

- Index created on { _id.a : 1, count: 1 } and { _id.b: 1, count: 1 }

- Query with a threshold:

- db.combinations.find( { _id.a: "u123", count: { $gt: 10 }} ).sort({ count: -1 })

- db.combinations.find( { _id.b: "u123", count: { $gt: 10 }} ).sort({ count: -1 })

Later we can create a more compact recommendation collection that includes popular combinations with weights, like:

{ itemId: 2, recom: [ { itemId: 32, weight: 36},

{ itemId: 158, weight: 23}, … ] }

User Activity – Hadoop integration

rity &

CRM, ERP, Collaboration, Mobile, BI

OS & Virtualization, Compute, Storage, Network

Applications

Infrastructure

Data Management

Operational Analytical

Commerce

Applicationspowered by

Analysispowered by

• Products & Inventory• Recommended products• Customer profile• Session management

• Elastic pricing• Recommendation models• Predictive analytics• Clickstream history

MongoDB Connector for

Hadoop

Connector Overview

Read/Write MongoDB

Read/Write BSON

MapReduce

Platforms

Apache Hadoop

Cloudera CDH

Hortonworks HDP

Amazon EMR

Connector Features and Functionality

• Open-source on github https://github.com/mongodb/mongo-hadoop

• Computes splits to read data– Single Node, Replica Sets, Sharded Clusters

• Mappings for Pig and Hive– MongoDB as a standard data source/destination

• Support for– Filtering data with MongoDB queries– Authentication– Reading from Replica Set tags– Appending to existing collections

MapReduce Configuration

• MongoDB input

– mongo.job.input.format = com.hadoop.MongoInputFormat

– mongo.input.uri = mongodb://mydb:27017/db1.collection1

• MongoDB output

– mongo.job.output.format = com.hadoop.MongoOutputFormat

– mongo.output.uri = mongodb://mydb:27017/db1.collection2

• BSON input/output

– mongo.job.input.format = com.hadoop.BSONFileInputFormat

– mapred.input.dir = hdfs:///tmp/database.bson

– mongo.job.output.format =

com.hadoop.BSONFileOutputFormat

– mapred.output.dir = hdfs:///tmp/output.bson

Pig Mappings

• Input: BSONLoader and MongoLoader data = LOAD ‘mongodb://mydb:27017/db.collection’ using com.mongodb.hadoop.pig.MongoLoader

• Output: BSONStorage and MongoInsertStorage STORE records INTO ‘hdfs:///output.bson’ using com.mongodb.hadoop.pig.BSONStorage

Hive Support

CREATE TABLE mongo_users (id int, name string, age int)STORED BY "com.mongodb.hadoop.hive.MongoStorageHandler"WITH SERDEPROPERTIES("mongo.columns.mapping” = "_id,name,age”) TBLPROPERTIES("mongo.uri" = "mongodb://host:27017/test.users”)

• Access collections as Hive tables

• Use with MongoStorageHandler or BSONStorageHandler

Thank You!

Antoine GirbalPrincipal Solutions Engineer, MongoDB Inc.@antoinegirbal

Retail Reference Architecture

Technology

CityWorks | Aviation Retail Architecture Portfolio

The Minimalist Reference Architecture - jt-nm.org Minimalist Reference Architecture v0.5...The Minimalist Reference Architecture . Contributed by Charles Meyer, Chief Technology Officer,

A reference architecture for high performance analytics … · A reference architecture for high performance ... reference architecture for healthcare ... A reference architecture

Epiphany Architecture Reference · Epiphany Architecture Reference ... time

Event Driven Architecture (EDA) Reference Architecture

Reference Architecture Framework

An Information Systems Reference Architecture for the … · An Information Systems Reference Architecture for the ... Keywords: CRM, information systems reference architecture,

Reference Architecture: Splunk Enterprise with ThinkSystem ... · Reference Architecture: Splunk Enterprise with ThinkSystem Servers Describes reference architecture for Splunk Enterprise

Retail Industry Enterprise Architecture Review

Whitepaper reference architecture

Cloud Computing Reference Architecture 2.0: Overview · IBM Cloud Computing Reference Architecture 8 IBM Cloud Computing Reference Architecture IBM Cloud Computing Reference Architecture:

Reference Architecture Library

Connected Retail Reference Architecture

Reference Architecture - vRealize Automation 7 · vRealize Automation Reference Architecture Guide The vRealize Automation Reference Architecture Guide describes the structure and

Telecom Reference Architecture, Part 2 - BPTrends...• Telecom Reference Architecture • Enterprise SOA based Reference Architecture Telecom Reference Architecture Tele Management

Retail Reference Architecture Part 3: Scalable Insight Component Providing User History, Recommendations and Personalization

Reactive reference architecture

Altera: ARM Architecture Reference Manual · Translate this pageAltera: ARM Architecture Reference Manual

Interoperability Reference Architecture v 1Interoperability Reference Architecture 11 1.2 Document Purpose This document presents the Reference Architecture for health information

Reference Architecture: EMC Backup for Microsoft … · EMC NetWorker Module for Microsoft : 9.0 . ... Reference Architecture: EMC Backup for Microsoft Cloud . ... Reference Architecture: