Upload
jongwook-woo
View
4.195
Download
6
Tags:
Embed Size (px)
DESCRIPTION
Draft Slide for EDB 2011 (Songdo Park Hotel, Incheon, KoreaAug. 25-27, 2011)
Citation preview
Jongwook Woo
HiPICHiPIC
CSULA
Market Basket Analysis Market Basket Analysis Algorithm with no-SQL DB Algorithm with no-SQL DB
HBase and HadoopHBase and Hadoop
EDB 2011EDB 2011((Songdo Park Hotel, Incheon, KoreaSongdo Park Hotel, Incheon, Korea
Aug. 25-27, 2011Aug. 25-27, 2011))
Seon Ho Kim, PhD @Integrated Media Systems Center, USCSeon Ho Kim, PhD @Integrated Media Systems Center, USC
Jongwook Woo (PhD), Jongwook Woo (PhD), Siddharth Basopia, Yuhang Xu @CSULASiddharth Basopia, Yuhang Xu @CSULA
High-Performance Internet Computing Center (HiPIC)
Computer Information Systems Department
California State University, Los Angeles
HiPICHiPIC
Jongwook Woo
CSULA
Contents
Map/Reduce Brief Introduction
Market Basket Analysis
Map/Reduce Algorithm for MBA
NoSQL HBase
Experimental Result
Conclusion
HiPICHiPIC
Jongwook Woo
CSULA
What is Map/Reduce and NoSQL DB on Cloud Computing
ClouderaHortonWorks
AWS
NoSQ
L DB
HiPICHiPIC
Jongwook Woo
CSULA
Big Data for RDBMS
Issues in RDBMS
Hard to scale– Relation gets broken
• Partitioning for scalability• Replication for availability
Speed– The Seek times of physical storage
• Slower than N/W speed• 1TB disk: 10Mbps transfer rate
– 100Ksec =>27.7 hrs
– Multiple data sources at difference places• 100 10GB disks: each 10Mbps transfer rate
– 1.7min
HiPICHiPIC
Jongwook Woo
CSULA
Big Data for RDBMS (Cont’d)
Issues in RDBMS (Cont’d)
Data Integration– Many unstructured data
• Web data
Solution
Map/Reduce– (Key, Value) parallel computing– Apache Hadoop
Big Data => Data Cleansing by Hadoop => Data Repositories (Pig, Hive, Mahout, HBase, Cassandra, MongoDB) => Business Intelligence (Data Mining, OLAP, Data Visualization, Reporting)
HiPICHiPIC
Jongwook Woo
CSULA
What is MapReduce
Functions borrowed from functional programming languages (eg. Lisp)
Provides Restricted parallel programming model: Hadoop
User implements Map() and Reduce()Libraries (Hadoop) take care of EVERYTHING else
– Parallelization– Fault Tolerance– Data Distribution– Load Balancing
Useful for huge (peta- or Terra-bytes) but non-complicated data
Log file for web companies New York Times case
HiPICHiPIC
Jongwook Woo
CSULA
MapConvert input data to (key, value) pairs
map() functions run in parallel, creating different intermediate (key, value)
values from different input data sets
HiPICHiPIC
Jongwook Woo
CSULA
Reduce
reduce() combines those intermediate values into one or more final values for that same key
reduce() functions also run in parallel, each working on a different output key
Bottleneck: reduce phase can’t start until map phase is
completely finished.
HiPICHiPIC
Jongwook Woo
CSULA
Example: Sort URLs in the largest hit order
Map()
Input <logFilename, file text>Parses file and emits <url, hit counts> pairs
– eg. <http://hello.com, 1>
Reduce()
Sums all values for the same key and emits <url, TotalCount>
– eg. <http://hello.com, (3 5 2 7)> => <http://hello.com, 17>
HiPICHiPIC
Jongwook Woo
CSULA
Legacy Example
In late 2007, the New York Times wanted to make available over the web its entire archive of articles, 11 million in all, dating back to 1851. four-terabyte pile of images in TIFF format. needed to translate that four-terabyte pile of TIFFs into more web-
friendly PDF files. – not a particularly complicated but large computing chore,
• requiring a whole lot of computer processing time.
a software programmer at the Times, Derek Gottfrid, – playing around with Amazon Web Services, Elastic Compute Cloud
(EC2), • uploaded the four terabytes of TIFF data into Amazon's Simple
Storage System (S3) • In less than 24 hours, 11,000 PDFs, all stored neatly in S3 and ready
to be served up to visitors to the Times site. The total cost for the computing job? $240
– 10 cents per computer-hour times 100 computers times 24 hours
HiPICHiPIC
Jongwook Woo
CSULA
noSQL DBs
Key/Value, Column, Document
Column-Oriented DB
HBase
Fast Index on large amount of data
Lookup by one more keys (key/value)
NoSQL normally supports MapReduce
HiPICHiPIC
Jongwook Woo
CSULA
Data Store of noSQL DB
Key/Value store
(Key, Value) Index, versioning, sorting, locking, transaction, replication Memcached
Document Store
Search Engine/Repository Multiple indexed to store indexed document No locking, replication, Transaction MongoDB, CouchDB, , ThruDB, SimpleDB
Column-Oriented Stores (Extensible Record Stores)
Extensible record horizontally and vertically partitioned across nodes
Rows and Columns are distributed over multiple nodes BigTable, HBase, Cassandra, Hypertable
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (MBA)
Collect the list of pair of transaction items most frequently occurred together at a store(s)
Traditional Business Intelligence Analysis
much better opportunity to make a profit by controlling the order of products and marketing – control the stocks more intelligently – arrange items on shelves – promote items together etc.
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (MBA)
Transactions in Store A: Input dataTransaction 1: cracker, icecream, beerTransaction 2: chicken, pizza, coke, breadTransaction 3: baguette, soda, hering, cracker,
beer Transaction 4: bourbon, coke, turkey Transaction 5: sardines, beer, chicken, cokeTransaction 6: apples, peppers, avocado, steakTransaction 7: sardines, apples, peppers,
avocado, steak…
What is a pair of items that people frequently buy at Store A?
HiPICHiPIC
Jongwook Woo
CSULA
Map Algorithm
1: Reads each transaction of input file and generates the data set of the items:
(<V1>, <V2>, …, <Vn>) where < Vn>: (vn1, vn2,.. vnm)
2: Loop For each item from vn1 to vnm of < Vn >
2.a: generate the data set <Yn>: (yn1, yn2,.. ynl); ynl: (unx, uny) where unx ≢ uny
2.b: (unx, uny) : sort data set note: (key, value) = (ynl, number of occurrences);
2.c: increment the occurrence of ynl;
3. Data set is created as input of Reducer:
(key, <value>) = (ynl, <number of occurrences>)
HiPICHiPIC
Jongwook Woo
CSULA
Reduce Algorithm
1: Take (ynl, <number of occurrences>)
as input data from multiple Map nodes
2. Add the values for ynl to have (ynl,
total number of occurrences) as output
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (Cont’d)
1. Transactions in Store A
Transaction 1: cracker, icecream, beerTransaction 2: chicken, pizza, coke, bread…
2. Distribute Transaction data to Map nodes
3. Pair of Items restructured in each Map node
Transaction 1: < (cracker, icecream), (cracker, beer) , (beer, icecream)>
Transaction 2: < (chicken, pizza), (chicken, coke), (chicken, bread) , (coke, pizza), (bread, pizza), (coke , bread)>
…
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (Cont’d)
Note: order of pairs should be sorted as it becomes a key
For example, (cracker, icecream), (icecream, cracker) should be (cracker, icecream)
3. Pair of Items sorted in MBA
Transaction 1: < (cracker, icecream), (beer, cracker) , (beer, icecream)>
Transaction 2: < (chicken, pizza), (chicken, coke), (bread, chicken) , (coke, pizza), (bread, pizza), (bread, coke)>
…
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (Cont’d)
4. Output of Map node Pair of Items in (key, value) structure in each Map
node (key, value): (pair of items, number of occurences) ((cracker, icecream), 1)((beer, cracker), 1) ((beer, icecream),1)(chicken, pizza), 1)((chicken, coke), 1)((chicken, bread) , 1)((coke, pizza), 1)((bread, pizza), 1)((coke , bread), 1) …
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (Cont’d)
5. Data Aggregation/Combine
(key, <value>): (pair of items, list number of occurences)
((cracker, icecream), <1, 1, …, 1>)
((beer, cracker), <1, 1, …, 1>)
((beer, icecream), <1, 1, …, 1>)
(chicken, pizza), <1, 1, …, 1>)
…
HiPICHiPIC
Jongwook Woo
CSULA
Market Basket Analysis (Cont’d)
6. Reduce nodes
(key, value): (pair of items, total number of occurences)
((cracker, icecream), 421)
((beer, cracker), 341)
((beer, icecream), 231)
(chicken, pizza), 111)
…
HiPICHiPIC
Jongwook Woo
CSULA
Map/Reduce for MBA
…
…Map1() Map2() Mapm()
Reduce1 () Reducel()
Data Aggregation/Combine
((coke, pizza), <1, 1, …, 1>)((ham, juice), <1, 1, …, 1>)
((coke, pizza), 3,421) ((ham, juice), 2,346)
Input Trax Data
Reduce2()
((coke, pizza), 1)((bear, corn), 1)…
((ham, juice), 1)((coke, pizza), 1)…
HiPICHiPIC
Jongwook Woo
CSULA
HBase Schema
Input Data: (Key, [Column Family, Column Item])(Transaction #, [Items, Items:List])
Items
Items:List
Trax 1 cracker, icecream, beer
Trax 2 chicken, pizza, coke, bread
… …
HiPICHiPIC
Jongwook Woo
CSULA
HBase Schema
Output Item Pairs after Map/Reduce computing(Key, [Column Family, Column Item])
– (Item Pair, [Items, Items:Count])
Items
Items:Count
(coke, pizza) 3,421
(ham, juice) 2,346
… …
HiPICHiPIC
Jongwook Woo
CSULA
Experimental Result
3 transaction files for the experiment:
64MB (1.1M transactions), 128MB (2.2M transactions), 400 MB (6.7M transactions), 800MB (13M transactions)
run on small instances of AWS EC2
each node is of 1.0-1.2 GHz 2007 Opteron or Xeon Processor
1.7GB memory 350GB storage on 32 bits platform Install and run both Hadoop/HBase by Apache Whirr
– Blog:• “Market Basket Analysis Example in Hadoop”, Jongwook Woo,
http://dal-cloudcomputing.blogspot.com/2011/03/market-basket-analysis-example-in.html, March 2011
The data are executed on 5, 10, and 15 nodes
HiPICHiPIC
Jongwook Woo
CSULA
Experimental Result on HDFS
Execution time (msec)
100000
150000
200000
250000
300000
350000
5 10 15
no of instances
mse
c
64M
128M
400M
800M
HiPICHiPIC
Jongwook Woo
CSULA
Experimental Result on HDFS and HBase
Execution time (msec)
0
500000
1000000
1500000
2000000
5 10 15
no of instances
mse
c
64M (HBase)
128M (HBase)
64M (HDFS)
128M (HDFS)
HiPICHiPIC
Jongwook Woo
CSULA
Conclusion The Market Basket Analysis Algorithm on Map/Reduce
and HBase is presented data mining analysis to find the most frequently occurred pair
of products in baskets at a store.
The associated items can be paired with Map/Reduce approach. Shows the possibility and performance using HBase for Market
Basket Analysis Data Once we have the paired items, it can be used for more studies
by statically analyzing them even sequentially, which is beyond this paper
– Support– Confidence
Parallelism is negatively affected by a bottle-neck for distributing, aggregating, and reducing the
data set among nodes
HiPICHiPIC
Jongwook Woo
CSULA