Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Using Data Mining and Machine Learning in Retail
Omeid SeideSenior Manager, Big Data SolutionsSears Holdings
Bharat PrasadBig Data Solution ArchitectSears Holdings
22
TheChallenge
Shortened processing windows
Escalatingcosts
Hitting scalabilityceilings
Demanding business
requirements
ETLcomplexity
Latency in data
Tight IT budgets
Growing data
volumes
Over a Century of Innovation
A Fortune 100 company, nearly $40 billion in annual revenue.
The nation’s fourth largest broad line retailer with almost 2,500 full-line and specialty retail stores in the US and Canada.
A front runner in Big Data efforts including driving personalized marketing and generating savings from legacy migration.
Running one of the biggest rewards programs that captures and analyzes a very large number of customer transactions quickly.
33
Big Data can no longer be defined by the amount of data, but by the type, speed, and storage capacity needed to compute and analyze that data.
What is Big Data?
44
We are creating so much data, so quickly, that 90% of the data in the world today has been created in the last 2 years.
Data, Data, and More Data
55
With traditional computer processing--it can be difficult to compute everything, due to storage space, processing time, and cost.
This typically leads to incomplete computations, data latency, and overall lack of quality analysis.
Hadoop brings infinite scalability, extremely large storage capability, and fast data processing.
The Problem with Large Scale Data Processing
66
Runs applications on a large cluster built of commodity hardware.
Provides reliability and data motion to applications.
Implements a computational paradigm named MapReduce.• Applications divided into small fragments of work for execution/
re-execution on any node in the cluster.
Provides a Distributed File System (HDFS) that stores data on compute nodes, resulting in high aggregate bandwidth across the cluster. Both Map/Reduce and the Distributed File System Framework automatically handle the node failures.
Apache Hadoop is a framework which:
Enter Hadoop
77
Stability: Hadoop is “horizontally scalable.” • Easily stores and processes petabytes of data, just by adding
hardware.
Economical: Uses commodity based hardware.
Efficient: Extremely powerful processing ability.
Reliability: Data is replicated 3x times (min) in different locations; failed tasks are rerun.
Storage space & Capacity: Central Repository; Keep everything forever.
Why Use Hadoop?
88
How can I better manage my inventory?
How can I better understand my customers’ buying habits?
How can I detect fraudulent activity?
How can I create better targeted interaction with my customer?
How do I get customers to purchase more products?
Big Data Analytics in Retail
99
The Evolution Data Analysis
1010
Top Apache Foundation software project
Uses Scalable Machine Learning algorithms
Collection of pre-built data-mining libraries
Primary focus on collaborative filtering, clustering & classification
Houses a Java based math library that uses common math operations
Uses MapReduce paradigm
What is Mahout?
1111
Examples of Data Mining & Machine Learning
1212
Clustering
Recommendation Systems
Market Basket Analysis
3 Primary Algorithms
1313
A process of grouping similar things in such a way, so that ‘like items’ are grouped together with other items that most closely represent themselves.
Clustering
1414
Why use Clustering??
To better understand a customer’s buying behavior
To develop targeted marketing campaigns
To understand interest, motivation, and lifestyle, in order more effectively move merchandise in and out of stores
Motivation behind Clustering
1515
An information filtering system that is used to predict a users rating or preference, typically using a collaborative, content-based or hybrid approach to recommendations.
Recommendation Systems
1616
Framework that filters and recommends items based on user behavior, preferences and activities.
Based on their similarities to others.
Recommenders User based Item based
Online and Offline support Can utilize Hadoop
Uses numerous similarity measurements, such as Cosine, LLR, Tanimoto, Pearson, and more.
Collaborative Filtering
1717
Looks at the item and the users preference in order, and provides a recommendation.
Allows for highly precise recommendations.
Difficulty when making recommendation over cross-sections of service when used for cross- selling.
A
C
B
Users
Ratings
Matching
Content with similar feature values is recommended
Feature Values
Content used in the past
X
Z
Y
Contents
User Profile
Feature Values
Content Profile
profile
Content- Based Filtering
1818
A model used to describe the commonality of several relationships between two objects.
Items: anything that is purchased
Basket: a set of items
The numbers of items in a basket is typically small, and the number of baskets is typically large
Market-Basket Model
1919
A list of Purchasers Additional “Purchaser” data is can be useful (but
is not needed)
A list of transactions
Seek to identify purchasing patterns What items are normally purchased together What is the purchasing sequence Is there a seasonality effect to purchasing
Categorize buying behavior
Translate buying behavior into actionable insight Targeted promotions Inventory placement Store layout Cross- Selling
Market Basket Models
2020
Any set of items that appears regularly within multiple baskets
Originally used to analyze a physical “supermarket basket”
Best used to link commonly bought together pairs that often have no relationship to each other
Example: Diapers & Beer
A major store chain discovered that diapers and beer were regularly appearing in baskets together. Theory was that if you bought diapers you are likely to have a baby at home, with a baby at home it is less likely that you go to a bar to drink, and more likely you will have a beer at home.
Frequent Itemsets
2121
Retail Stores
Showroom floor planning
Catalog layout
Crossing selling
Fraud Analysis
Applying Market Baskets Models
2222
Big Data Stack
Data Governance & Integration --ETL/ELT
Security
Storage-hdfsOn-Promises
Metadata
NOSQL DBNOSQL DB
Hive/Pig Advance Query
Storage-hdfsCloud
Hive/PigAdvance
Query
Data Analytics Data Mining
Data Visualization & Reporting
Real-Time Streaming Time seriesOn demand
Consumption Layer
Consumption Layer
Semantic LayerSemantic Layer
Computation/Access Layer
Computation/Access Layer
Storage LayerStorage Layer
Security LayerSecurity Layer
Integration LayerIntegration Layer
Frequency Frequency
Integration Layer
Integration Layer
2323
BlogSource
LayerSource Layer
Integration Layer
Integration Layer
Security Layer
Security Layer
Storage Layer
/NO SQL DB
Storage Layer
/NO SQL DB
Computation/Access Layer
Computation/Access Layer
Semantic Layer
Semantic Layer
Consumption Layer
Consumption Layer
DistributionDistribution
Open vs Closed Stack
2424
Questions?