Upload
cloudera-inc
View
2.918
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Gap Inc Direct, the online division for Gap Inc., uses HBase to serve, in real-time, apparel catalog for all its brands’ and markets’ web sites. This case study will review the business case as well as key decisions regarding schema selection and cluster configurations. We will also discuss implementation challenges and insights that were learned.
Citation preview
1
Gap Inc Direct: Serving Apparel Catalog from HBase for Live Website
HBaseCon 2012
Applications Track – Case Study
2
Who Are We?
Suraj Varma Director of Technology Implementation Gap Inc Direct (GID), San Francisco, CA IRC: svarma
Gupta Gogula Director-IT & Domain Architect of
Catalog Management & Distribution Gap Inc Direct (GID), San Francisco, CA
3
Agenda - Case Study
Problem Domain
HBase Schema Specifics
HBase Cluster Specifics
Learning & Challenges
4
2005NEW SITE LAUNCH
2007PIPERLIME2008
UNIVERSALITY2009ATHLETA
US
US
US
US
US
EU
EU
CA
CA
EUCA
INCOMING TRAFFIC
2010CA & EU
MARKETS
APPLICATION SERVERS DATABASES
5
Problem Domain
Evolution of the GID Apparel Catalog 2005 - Three independent brands in US 2010 – 5 integrated brands in US, CA, EU
Rapid Expansion of Apparel Catalog
However, each brand / market combination necessitated separate logical catalog databases
6
What We Wanted …
Single Catalog store for all brands/markets Horizontally scalable over time Cross brand business features
Access data store directly To avail of inventory awareness of items
Minimal Caching – only for optimization Keeping caches in sync is a problem.
Highly Available
7
Initial Explorations
Sharded RDMBS, MemCached, etc Significant effort was required Still had scalability limits
Non-relational alternatives considered
HBase POC (early-2010) Promising results -decided to move
ahead
8
Why HBase?
Strong Consistency Model Server Side Filters Automatic Sharding, Distribution,
Failover Hadoop Integration out of the box
General Purpose Other use cases outside of Catalog
Strong Community!
9
Architecture Diagram
HBASE CLUSTER
MUTATIONS
MUTATIONSMUTATIONS
REQUESTSBACKEND SERVICES
NEAR REAL TIME INVENTORY UPDATES
PRICING UPDATES ITEM UPDATES
INCOMING REQUESTS
FOR CATALOG DATA
10
Cluster Traffic Patterns
Read Mostly
Write / Delete Bursts
Continuous Writes
Website Traffic Sync MR Jobs
Catalog Publish Phase out to near real-
time updates from originating systems
MR jobs on Live Cluster
Inventory Updates
11
Rows:100KB avg size1000-5000 colsSparse rows
Data Model & Access Patterns Hierarchical Data (Primarily)
SKU -> Style Lookups (child -> parent) Cross Brand Sell (sibling <-> sibling)
Data Access Patterns Full Product Graph in one read Single path of graph from root to leaf node Search - Secondary Indices Large Feed files
12
Primary Access Patterns
READ FULL GRAPH
READ SINGLE PATH / EDGE
13
HBase Schema Management Built custom “bean to schema
mapper” POJO graph < -> HBase qualifiers Flexibility to shorten column qualifiers Flexibility to change schema qualifiers
(per environment / developer)<…><association>one-to-many</association>
<prefix>SC</prefix> <uniqueId>colorCd</uniqueId>
<beanName>styleColorBean</beanName> <…>
14
Schema Example - Hierarchy <PP>_<id1>_QQ_<id2>_RR_<id3>_name
Where PP is parent, QQ is child, RR is grandchild
cf1:VAR_1_SC_0012_colorCdcf2:VAR_1_SC_0012_SCIMG_10_path
Pattern: ANCESTOR IDS EMBEDDED IN QUALIFIER NAME
15
Schema – Lookups
Secondary Index <id3> => RR ; QQ ; PP FilterList with (RR, QQ, PP) ids to get
thin slice path
14444 333 22KEY_5555
Pattern: SECONDARY INDEX TO HIERARCHICAL ANCESTORS
16
Schema – Future Dates, Large Files “Publish at Midnight”
Future Dated PUTs Get/Scan with time range
Large Feed Files Sharded into smaller chunks < 2MB per
cell
S_4S_1 S_2 S_3KEY_nnnn
Pattern: SHARDED CHUNKS
17
HBase Cluster
16 Slave (RS + TT + DN) Nodes 8 & 16 GB RAM
3 Master (HM,ZK,JT, NN) Nodes 8 GB RAM
NN Failover via NFS
18
Configurations – Block Cache, GC
Block Cache Maximize Block Cache hfile.block.cache.size: 0.6
Garbage Collection MSLAB enabled CMSInitiatingOccupancyFactor
19
Configurations – Timeouts Quick Recovery on node failure
Default timeouts too large zookeeper.session.timeout
Region Server hbase.rpc.timeout
Data Node dfs.heartbeat.recheck.interval heartbeat.recheck.interval
20
Learnings – Regions
Block Cache Size Tuning Block Cache Churn
Hot Row scenarios Perf Tests & Doing Phased Rollouts
Hot Region issues Perf Tests & Pre-split Regions.
Filters CPU Intensive – profiling needed.
21
Learnings – Monitoring, Hardware
Monitoring is crucial Layer by layer -> what’s the bottleneck Metrics to target optimization & tuning Troubleshooting
Non Uniform Hardware Sub-optimal region distribution Hefty boxes lightly loaded.
22
Learnings – Miscellaneous
M/R Jobs running on live cluster Has an impact – so cannot run full
throttle Go easy …
Feature Enablement – Phase in Don’t turn on several features together Easier identification of potential hot
regions / rows, overloaded RS, etc
23
Phasing In Features
23
HBASE CLUSTER
LOT MORE
REQUESTS
BACKEND SERVICES
INVENTORY UPDATES
PRICING UPDATES ITEM UPDATES
Enable Features individually to measure impact and tune cluster accordingly
FEATURE “A” ENABLED: ADDITIONAL “N” REQ / SEC
FEATURE “B” ENABLED: ADDITIONAL “K” REQ / SEC
INCOMING REQUESTS
24
Challenges – Search, Transactions
Search No out-of-the-box secondary indexes. Custom solution with Solr
Transactions Only row level atomicity But … can’t pack all in a single row Atomic Cross-Row Put/Delete and HBASE-
5229 seem potential partial solves (0.94+)
25
Challenges – Optimal Schema Orthogonal access patterns
Optimize for most frequently used pattern.
Filters May suffice, with early out configurations Impacts CPU usage
Duplicate data for every access pattern Too drastic Effort to keep all copies in sync
26
Challenges - Backups
Rebuild from source data Takes time … but no data loss
Export / import based backups Faster … but stale Another MR on live cluster
Better options in future releases …
27
Gap Inc Direct
We’re hiring!http://www.gapinc.com