Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop

Jongwook Woo

HiPICHiPIC

CSULA

Market Basket Analysis Market Basket Analysis Algorithm with no-SQL DB Algorithm with no-SQL DB

HBase and HadoopHBase and Hadoop

EDB 2011EDB 2011((Songdo Park Hotel, Incheon, KoreaSongdo Park Hotel, Incheon, Korea

Aug. 25-27, 2011Aug. 25-27, 2011))

Seon Ho Kim, PhD @Integrated Media Systems Center, USCSeon Ho Kim, PhD @Integrated Media Systems Center, USC

Jongwook Woo (PhD), Jongwook Woo (PhD), Siddharth Basopia, Yuhang Xu @CSULASiddharth Basopia, Yuhang Xu @CSULA

High-Performance Internet Computing Center (HiPIC)

Computer Information Systems Department

California State University, Los Angeles

HiPICHiPIC

Jongwook Woo

CSULA

Contents

Map/Reduce Brief Introduction

Market Basket Analysis

Map/Reduce Algorithm for MBA

NoSQL HBase

Experimental Result

Conclusion

HiPICHiPIC

Jongwook Woo

CSULA

What is Map/Reduce and NoSQL DB on Cloud Computing

ClouderaHortonWorks

AWS

NoSQ

L DB

HiPICHiPIC

Jongwook Woo

CSULA

Big Data for RDBMS

Issues in RDBMS

Hard to scale– Relation gets broken

• Partitioning for scalability• Replication for availability

Speed– The Seek times of physical storage

• Slower than N/W speed• 1TB disk: 10Mbps transfer rate

– 100Ksec =>27.7 hrs

– Multiple data sources at difference places• 100 10GB disks: each 10Mbps transfer rate

– 1.7min

HiPICHiPIC

Jongwook Woo

CSULA

Big Data for RDBMS (Cont’d)

Issues in RDBMS (Cont’d)

Data Integration– Many unstructured data

• Web data

Solution

Map/Reduce– (Key, Value) parallel computing– Apache Hadoop

Big Data => Data Cleansing by Hadoop => Data Repositories (Pig, Hive, Mahout, HBase, Cassandra, MongoDB) => Business Intelligence (Data Mining, OLAP, Data Visualization, Reporting)

HiPICHiPIC

Jongwook Woo

CSULA

What is MapReduce

Functions borrowed from functional programming languages (eg. Lisp)

Provides Restricted parallel programming model: Hadoop

User implements Map() and Reduce()Libraries (Hadoop) take care of EVERYTHING else

– Parallelization– Fault Tolerance– Data Distribution– Load Balancing

Useful for huge (peta- or Terra-bytes) but non-complicated data

Log file for web companies New York Times case

HiPICHiPIC

Jongwook Woo

CSULA

MapConvert input data to (key, value) pairs

map() functions run in parallel, creating different intermediate (key, value)

values from different input data sets

HiPICHiPIC

Jongwook Woo

CSULA

Reduce

reduce() combines those intermediate values into one or more final values for that same key

reduce() functions also run in parallel, each working on a different output key

Bottleneck: reduce phase can’t start until map phase is

completely finished.

HiPICHiPIC

Jongwook Woo

CSULA

Example: Sort URLs in the largest hit order

Map()

Input <logFilename, file text>Parses file and emits <url, hit counts> pairs

– eg. <http://hello.com, 1>

Reduce()

Sums all values for the same key and emits <url, TotalCount>

– eg. <http://hello.com, (3 5 2 7)> => <http://hello.com, 17>

HiPICHiPIC

Jongwook Woo

CSULA

Legacy Example

In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-

friendly PDF files. – not a particularly complicated but large computing chore,

• requiring a whole lot of computer processing time.

a software programmer at the Times, Derek Gottfrid, – playing around with Amazon Web Services, Elastic Compute Cloud

(EC2), • uploaded the four terabytes of TIFF data into Amazon's Simple

Storage System (S3) • In less than 24 hours, 11,000 PDFs, all stored neatly in S3 and ready

to be served up to visitors to the Times site. The total cost for the computing job? $240

– 10 cents per computer-hour times 100 computers times 24 hours

HiPICHiPIC

Jongwook Woo

CSULA

noSQL DBs

Key/Value, Column, Document

Column-Oriented DB

HBase

Fast Index on large amount of data

Lookup by one more keys (key/value)

NoSQL normally supports MapReduce

HiPICHiPIC

Jongwook Woo

CSULA

Data Store of noSQL DB

Key/Value store

(Key, Value) Index, versioning, sorting, locking, transaction, replication Memcached

Document Store

Search Engine/Repository Multiple indexed to store indexed document No locking, replication, Transaction MongoDB, CouchDB, , ThruDB, SimpleDB

Column-Oriented Stores (Extensible Record Stores)

Extensible record horizontally and vertically partitioned across nodes

Rows and Columns are distributed over multiple nodes BigTable, HBase, Cassandra, Hypertable

HiPICHiPIC

Jongwook Woo

CSULA

Market Basket Analysis (MBA)

Collect the list of pair of transaction items most frequently occurred together at a store(s)

Traditional Business Intelligence Analysis

much better opportunity to make a profit by controlling the order of products and marketing – control the stocks more intelligently – arrange items on shelves – promote items together etc.

HiPICHiPIC

Jongwook Woo

CSULA

Market Basket Analysis (MBA)

Transactions in Store A: Input dataTransaction 1: cracker, icecream, beerTransaction 2: chicken, pizza, coke, breadTransaction 3: baguette, soda, hering, cracker,

beer Transaction 4: bourbon, coke, turkey Transaction 5: sardines, beer, chicken, cokeTransaction 6: apples, peppers, avocado, steakTransaction 7: sardines, apples, peppers,

avocado, steak…

What is a pair of items that people frequently buy at Store A?

HiPICHiPIC

Jongwook Woo

CSULA

Map Algorithm

1: Reads each transaction of input file and generates the data set of the items:

(<V1>, <V2>, …, <Vn>) where < Vn>: (vn1, vn2,.. vnm)

2: Loop For each item from vn1 to vnm of < Vn >

2.a: generate the data set <Yn>: (yn1, yn2,.. ynl); ynl: (unx, uny) where unx ≢ uny

2.b: (unx, uny) : sort data set note: (key, value) = (ynl, number of occurrences);

2.c: increment the occurrence of ynl;

3. Data set is created as input of Reducer:

(key, <value>) = (ynl, <number of occurrences>)

HiPICHiPIC

Jongwook Woo

CSULA

Reduce Algorithm

1: Take (ynl, <number of occurrences>)

as input data from multiple Map nodes

2. Add the values for ynl to have (ynl,

total number of occurrences) as output

HiPICHiPIC

Jongwook Woo

CSULA

Market Basket Analysis (Cont’d)

1. Transactions in Store A

Transaction 1: cracker, icecream, beerTransaction 2: chicken, pizza, coke, bread…

2. Distribute Transaction data to Map nodes

3. Pair of Items restructured in each Map node

Transaction 1: < (cracker, icecream), (cracker, beer) , (beer, icecream)>

Transaction 2: < (chicken, pizza), (chicken, coke), (chicken, bread) , (coke, pizza), (bread, pizza), (coke , bread)>

…

HiPICHiPIC

Jongwook Woo

CSULA


Note: order of pairs should be sorted as it becomes a key

For example, (cracker, icecream), (icecream, cracker) should be (cracker, icecream)

3. Pair of Items sorted in MBA

Transaction 1: < (cracker, icecream), (beer, cracker) , (beer, icecream)>

Transaction 2: < (chicken, pizza), (chicken, coke), (bread, chicken) , (coke, pizza), (bread, pizza), (bread, coke)>

…

HiPICHiPIC

Jongwook Woo

CSULA


4. Output of Map node Pair of Items in (key, value) structure in each Map

node (key, value): (pair of items, number of occurences) ((cracker, icecream), 1)((beer, cracker), 1) ((beer, icecream),1)(chicken, pizza), 1)((chicken, coke), 1)((chicken, bread) , 1)((coke, pizza), 1)((bread, pizza), 1)((coke , bread), 1) …

HiPICHiPIC

Jongwook Woo

CSULA


5. Data Aggregation/Combine

(key, <value>): (pair of items, list number of occurences)

((cracker, icecream), <1, 1, …, 1>)

((beer, cracker), <1, 1, …, 1>)

((beer, icecream), <1, 1, …, 1>)

(chicken, pizza), <1, 1, …, 1>)

…

HiPICHiPIC

Jongwook Woo

CSULA


6. Reduce nodes

(key, value): (pair of items, total number of occurences)

((cracker, icecream), 421)

((beer, cracker), 341)

((beer, icecream), 231)

(chicken, pizza), 111)

…

HiPICHiPIC

Jongwook Woo

CSULA

Map/Reduce for MBA

…

…Map1() Map2() Mapm()

Reduce1 () Reducel()

Data Aggregation/Combine

((coke, pizza), <1, 1, …, 1>)((ham, juice), <1, 1, …, 1>)

((coke, pizza), 3,421) ((ham, juice), 2,346)

Input Trax Data

Reduce2()

((coke, pizza), 1)((bear, corn), 1)…

((ham, juice), 1)((coke, pizza), 1)…

HiPICHiPIC

Jongwook Woo

CSULA

HBase Schema

Input Data: (Key, [Column Family, Column Item])(Transaction #, [Items, Items:List])

Items

Items:List

Trax 1 cracker, icecream, beer

Trax 2 chicken, pizza, coke, bread

… …

HiPICHiPIC

Jongwook Woo

CSULA

HBase Schema

Output Item Pairs after Map/Reduce computing(Key, [Column Family, Column Item])

– (Item Pair, [Items, Items:Count])

Items

Items:Count

(coke, pizza) 3,421

(ham, juice) 2,346

… …

HiPICHiPIC

Jongwook Woo

CSULA

Experimental Result

3 transaction files for the experiment:

64MB (1.1M transactions), 128MB (2.2M transactions), 400 MB (6.7M transactions), 800MB (13M transactions)

run on small instances of AWS EC2

each node is of 1.0-1.2 GHz 2007 Opteron or Xeon Processor

1.7GB memory 350GB storage on 32 bits platform Install and run both Hadoop/HBase by Apache Whirr

– Blog:• “Market Basket Analysis Example in Hadoop”, Jongwook Woo,

http://dal-cloudcomputing.blogspot.com/2011/03/market-basket-analysis-example-in.html, March 2011

The data are executed on 5, 10, and 15 nodes

HiPICHiPIC

Jongwook Woo

CSULA

Experimental Result on HDFS

Execution time (msec)

100000

150000

200000

250000

300000

350000

5 10 15

no of instances

mse

c

64M

128M

400M

800M

HiPICHiPIC

Jongwook Woo

CSULA

Experimental Result on HDFS and HBase

Execution time (msec)

0

500000

1000000

1500000

2000000

5 10 15

no of instances

mse

c

64M (HBase)

128M (HBase)

64M (HDFS)

128M (HDFS)

HiPICHiPIC

Jongwook Woo

CSULA

Conclusion The Market Basket Analysis Algorithm on Map/Reduce

and HBase is presented data mining analysis to find the most frequently occurred pair

of products in baskets at a store.

The associated items can be paired with Map/Reduce approach. Shows the possibility and performance using HBase for Market

Basket Analysis Data Once we have the paired items, it can be used for more studies

by statically analyzing them even sequentially, which is beyond this paper

– Support– Confidence

Parallelism is negatively affected by a bottle-neck for distributing, aggregating, and reducing the

data set among nodes

HiPICHiPIC

Jongwook Woo

CSULA

Technology

Market Basket Analysis Algorithm with no-SQL DB HBase and Hadoop