Using Data Mining and Machine Learning in RetailUsing Data Mining and Machine Learning in Retail Omeid Seide Senior Manager, Big Data Solutions Sears Holdings Bharat Prasad Big Data

Using Data Mining and Machine Learning in Retail

Omeid SeideSenior Manager, Big Data SolutionsSears Holdings

Bharat PrasadBig Data Solution ArchitectSears Holdings

22

TheChallenge

Shortened processing windows

Escalatingcosts

Hitting scalabilityceilings

Demanding business

requirements

ETLcomplexity

Latency in data

Tight IT budgets

Growing data

volumes

Over a Century of Innovation

A Fortune 100 company, nearly $40 billion in annual revenue.

The nation’s fourth largest broad line retailer with almost 2,500 full-line and specialty retail stores in the US and Canada.

A front runner in Big Data efforts including driving personalized marketing and generating savings from legacy migration.

Running one of the biggest rewards programs that captures and analyzes a very large number of customer transactions quickly.

33

Big Data can no longer be defined by the amount of data, but by the type, speed, and storage capacity needed to compute and analyze that data.

What is Big Data?

44

We are creating so much data, so quickly, that 90% of the data in the world today has been created in the last 2 years.

Data, Data, and More Data

55

With traditional computer processing--it can be difficult to compute everything, due to storage space, processing time, and cost.

This typically leads to incomplete computations, data latency, and overall lack of quality analysis.

Hadoop brings infinite scalability, extremely large storage capability, and fast data processing.

The Problem with Large Scale Data Processing

66

Runs applications on a large cluster built of commodity hardware.

Provides reliability and data motion to applications.

Implements a computational paradigm named MapReduce.• Applications divided into small fragments of work for execution/

re-execution on any node in the cluster.

Provides a Distributed File System (HDFS) that stores data on compute nodes, resulting in high aggregate bandwidth across the cluster. Both Map/Reduce and the Distributed File System Framework automatically handle the node failures.

Apache Hadoop is a framework which:

Enter Hadoop

77

Stability: Hadoop is “horizontally scalable.” • Easily stores and processes petabytes of data, just by adding

hardware.

Economical: Uses commodity based hardware.

Efficient: Extremely powerful processing ability.

Reliability: Data is replicated 3x times (min) in different locations; failed tasks are rerun.

Storage space & Capacity: Central Repository; Keep everything forever.

Why Use Hadoop?

88

How can I better manage my inventory?

How can I better understand my customers’ buying habits?

How can I detect fraudulent activity?

How can I create better targeted interaction with my customer?

How do I get customers to purchase more products?

Big Data Analytics in Retail

99

The Evolution Data Analysis

1010

Top Apache Foundation software project

Uses Scalable Machine Learning algorithms

Collection of pre-built data-mining libraries

Primary focus on collaborative filtering, clustering & classification

Houses a Java based math library that uses common math operations

Uses MapReduce paradigm

What is Mahout?

1111

Examples of Data Mining & Machine Learning

1212

Clustering

Recommendation Systems

Market Basket Analysis

3 Primary Algorithms

1313

A process of grouping similar things in such a way, so that ‘like items’ are grouped together with other items that most closely represent themselves.

Clustering

1414

Why use Clustering??

To better understand a customer’s buying behavior

To develop targeted marketing campaigns

To understand interest, motivation, and lifestyle, in order more effectively move merchandise in and out of stores

Motivation behind Clustering

1515

An information filtering system that is used to predict a users rating or preference, typically using a collaborative, content-based or hybrid approach to recommendations.

Recommendation Systems

1616

Framework that filters and recommends items based on user behavior, preferences and activities.

Based on their similarities to others.

Recommenders User based Item based

Online and Offline support Can utilize Hadoop

Uses numerous similarity measurements, such as Cosine, LLR, Tanimoto, Pearson, and more.

Collaborative Filtering

1717

Looks at the item and the users preference in order, and provides a recommendation.

Allows for highly precise recommendations.

Difficulty when making recommendation over cross-sections of service when used for cross- selling.

A

C

B

Users

Ratings

Matching

Content with similar feature values is recommended

Feature Values

Content used in the past

X

Z

Y

Contents

User Profile

Feature Values

Content Profile

profile

Content- Based Filtering

1818

A model used to describe the commonality of several relationships between two objects.

Items: anything that is purchased

Basket: a set of items

The numbers of items in a basket is typically small, and the number of baskets is typically large

Market-Basket Model

1919

A list of Purchasers Additional “Purchaser” data is can be useful (but

is not needed)

A list of transactions

Seek to identify purchasing patterns What items are normally purchased together What is the purchasing sequence Is there a seasonality effect to purchasing

Categorize buying behavior

Translate buying behavior into actionable insight Targeted promotions Inventory placement Store layout Cross- Selling

Market Basket Models

2020

Any set of items that appears regularly within multiple baskets

Originally used to analyze a physical “supermarket basket”

Best used to link commonly bought together pairs that often have no relationship to each other

Example: Diapers & Beer

A major store chain discovered that diapers and beer were regularly appearing in baskets together. Theory was that if you bought diapers you are likely to have a baby at home, with a baby at home it is less likely that you go to a bar to drink, and more likely you will have a beer at home.

Frequent Itemsets

2121

Retail Stores

Showroom floor planning

Catalog layout

Crossing selling

Fraud Analysis

Applying Market Baskets Models

2222

Big Data Stack

Data Governance & Integration --ETL/ELT

Security

Storage-hdfsOn-Promises

Metadata

NOSQL DBNOSQL DB

Hive/Pig Advance Query

Storage-hdfsCloud

Hive/PigAdvance

Query

Data Analytics Data Mining

Data Visualization & Reporting

Real-Time Streaming Time seriesOn demand

Consumption Layer

Consumption Layer

Semantic LayerSemantic Layer

Computation/Access Layer


Storage LayerStorage Layer

Security LayerSecurity Layer

Integration LayerIntegration Layer

Frequency Frequency

Integration Layer

Integration Layer

2323

BlogSource

LayerSource Layer

Integration Layer

Integration Layer

Security Layer

Security Layer

Storage Layer

/NO SQL DB

Storage Layer

/NO SQL DB



Semantic Layer

Semantic Layer

Consumption Layer

Consumption Layer

DistributionDistribution

Open vs Closed Stack

2424

Questions?

Documents

Using Data Mining and Machine Learning in RetailUsing Data Mining and Machine Learning in Retail Omeid Seide Senior Manager, Big Data Solutions Sears Holdings Bharat Prasad Big Data