2
Tim Hsu
• Senior Data Engineer at Yahoo!
• Data modeling, BI application design
• To provide an integrated, easy to use BI system
to Yahoo! EC users
3
Neal Lee
• Senior Data Engineer
• Aims to build up an easy
to use self-service BI
platform connecting to
Hadoop
Johnny Nien
• Senior Data Engineer
• Software developer
specialized in large scale
data processing
infrastructure and
applications
APAC is the best region where
Yahoo! runs EC business
Major EC properties
› 2001 Auction
› 2004 Shopping Mall
› 2008 Store Market
Yahoo! is the leading e-commerce
company in Taiwan
6
Who Are We?
In MM USD
- 1,000 2,000 3,000 4,000 5,000
EHS
National 3C Chains
Fubon momo TV shopping
FarEastern Dept store
TK 3C
PC Home
SOGO Dept Store
Y!EC
RT Mart(hyper mart)
FamiMart
PxMart (hyper mart)
Carrefour
ShinKwan Mitsukoshi
7-Eleven
2011 Taiwan Retail Revenue
Types of End Users in Yahoo! Taiwan
7
GM, BU Heads
Business Analysts
Marketers, Data Analysts
Category Managers
Suppliers, Sellers
BI Needs for Different Types of Users
8
Sophisticated Summarized
Lo
w
Hig
h
Data Scale
Analytics &
Interactivity
GM, BU
Heads
Business
Analysts
Marketers,
Data
Analysts
Category
Managers Suppliers,
Sellers
Challenges
9
Ad hoc
Reports
ERP
Transactions
Web Logs
Browsing
Purchase
DW/DM
Performance
Reports
Management
Reports
Traffic
Reports
PHP,
ASP.NET
MicroStrategy
Hyperion SQL, Stored
Procedure, Pig, HiveQL
PHP, Web
Services API
Yahoo! Taiwan Needs …
10
One unified data platform for retrieving information in an easy and efficient way.
Where We Are Going …
11
Business Intelligence Application
Business Intelligence Platform
Data Storage
Data Process
Data Source
Architecture
13
Auction
Shopping
Store
Instrumentation
Instrumentation
Instrumentation
Auction
Backend
Shopping
ERP
Store
ERP
E
T
L
Oracle RAC
Listing
Member
Revenue
Seller
Sales
Supplier
F
E
T
L
Yahoo! Grid
Page View
Click Event
Session
ETL
Beacon
Servers
Data
Highway
Users
Hive
Shark
MicroStrategy
SQL Engine
Hive Performance Test
15
Use case: Visitor distribution by demographic and device preference
Source Data: 293TB web logs in 60 days
Transformed Cube : 2.3 GB, 60.5M rows
Test environment
› MicroStrategy Server: 8 Cores 2.5G, 16G RAM, v9.2.1
› Hive Server: 4 Cores 2.5G, 4G RAM, v0.9
› Hadoop clusters: 300+ nodes, v0.23
Case C1: Cross tab with date
slice
Case C2: Dynamic prompt on
date
Case C3:
Dynamic data
grouping (Browser)
Case C4:
80/20 Analysis
Case C5: Data grouping
& charting
Hive Test Cases
16
Case C1: Cross tab with date
slice
Case C2: Dynamic prompt on
date
Case C3:
Dynamic data
grouping (Browser)
Case C4:
80/20 Analysis
Case C5: Data grouping
& charting
Hive Performance Test
17
Average response time is less than 20 seconds under the
stress of 50 concurrent users against 60 days data.
20 Days 40 Days 60 Days
10 CU. 1.8 3.1 4.7
25 CU. 3.5 6.8 9.6
50 CU. 6.1 12.1 19.2
100 CU. 11.9 24.5 36.1
0
5
10
15
20
25
30
35
40
Av
g.
Resp
on
se T
ime (
sec)
Data Volume in Cube
Avg. Resp. Time by Data Volume
10 CU.
25 CU.
50 CU.
100 CU.
Enhance by Using Spark/Shark
18
Spark is a fast and expressive cluster computing system interoperable
with Apache Hadoop
iter. 1 iter. 2 …
Input
File system
read
File system
write
File system
read
File system
write
Map/Reduce
iter. 1 iter. 2 …
Input
File system
read Memory
write
Memory
read
Memory
write
Spark
Enhance by Using Spark/Shark
19
Shark is an analytic query engine built on top of Spark
› 100% compatible with Hive
› Could be 100x faster than Hive
Meta
store
HDFS
Client
Driver
SQL
Parse
r
Query
Optimizer
Physical Plan
Execution
CLI JDBC
MapReduce
Meta
store
HDFS
Client
Driver
SQL
Parse
r
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
Query
Optimizer
Item based recommendation
system by collaborative filtering
Modules implemented
› Viewed-also-viewed (Shopping)
› Bought-also-bought (Shopping)
› Bought-after-viewed (Auction)
Implemented by Pig script
20
Spark Performance Test: Recommendation by CF
21
3,616 production machines
10 virtual machines
Yahoo! Grid
Pig vs. Spark: CF Performance Test
Nodes CPU RAM HD
3,616 16 Cores 48GB 16TB
10 2 Cores 4GB 100G
Put Shark into The Scene
23
Shark
Hive
Users
EC Backend
ERP
MicroStrategy
SQL Engine
Yahoo! TW Grid
ETL
Web
Clickstream
Lessons Learned
26
Data modeling
› Join operation is extremely expensive in Hive/Shark
Denomalize as much as possible
Modeling in snowflake schema
Data processing (ETL)
› Use partition to minimize data loading time
› Hive handles partitions well, but Shark does not
Keep partitioned tables for daily refresh
Create and cache non-partitioned tables for MicroStrategy
Shark is not the silver bullet
› Aggregation is still needed for best performance
Aggregation tables for ad-hoc query
Intelligent Cubes for dashboards
Lessons Learned – expect the unexpected
27
select day_id, count(distinct buyer_id) as buyer_cnt
from fact_table
group by day_id;
select day_id, count(buyer_id) as buyer_cnt
from (
select day_id, buyer_id
from fact_table
group by day_id, buyer_id
) tmp
group by day_id;
A
B
1. Rewrite SQL
2. Performance improved
significantly in Shark 0.8
Lessons Learned – expect the unexpected
28
select day_id, sum(order_amt) as revenue
from fact_table
where day_id between date_add(„2013-12-01‟, -10)
and date_add(„2013-12-01‟, 0)
and cate_id in (1, 2, 3)
group by day_id;
select day_id, sum(order_amt) as revenue
from fact_table
where cate_id in (1, 2, 3)
and day_id between date_add(„2013-12-01‟, -10)
and date_add(„2013-12-01‟, 0)
group by day_id;
A
B 1. Change sequence of
filters
2. Write a patch to
evaluate and replace
date_add() ‘2013-11-21’
‘2013-12-01’
‘2013-11-21’
‘2013-12-01’
Benefits to Yahoo! Taiwan
29
One unified data platform for all EC properties. i.e. EC Source of Truth
› Access transaction and web traffic data simultaneously and transparently.
Self-service BI reporting
› Users can now create their own reports at the “speed of thought”.
Sophisticated dashboards
› Consolidate different information into one single screen.
Low latency
› Daily report average response time increased by 83%, from 43.6 seconds to 7.4
seconds.