Zero-downtime Hadoop/HBase Cross-datacenter Migration

Preview:

Citation preview

Zero-downtime Hadoop/HBase Cross-datacenter Migration

SPN, TrendMicroScott Miao & Dumbo team membersSep. 19, 2015

2

Who am I• Scott Miao• RD, SPN, Trend Micro• Worked on Hadoop ecosystem since

2011• Expertise in HDFS/MR/HBase• Contributor for HBase/HDFS• Speaker in HBaseCon2014• @takeshi.miao

Our blog ‘Dumbo in TW’: http://dumbointaiwan.blogspot.tw/HBasecon2014 sharing:http://www.slideshare.net/HBaseCon/case-studies-session-6https://vimeo.com/99679688

3

Agenda• What problem we suffered• IDC migration• Zero downtime migration• Wrap up

4

What problem we suffered ?

#1Network bandwidth

insufficient

6

Old IDC Layout

● ● ●

POD Core Switch

TOR Switch

41U rack 41U rackPOD

1Gb

1Gb

20 Gb

POD

● ● ●

Up stream devices

HD NNcpu: 8coresmem: 72GBDisk: 4TB

HD DNcpu: 12coresmem: 128GBdisk: 6TB

Other services

12 Gb usageHadoop + services

network traffic

No physical space

Core Switch

Since 2008

x n

x 2 x n

Devices view Servers view

#2Data storage capacity

insufficient

Est. Data Growth

8

• ~2x data growth

11

What options• Enhance old IDC

– Replace 1Gb to 10 Gb network topology– Adjust servers location– Any chances for more physical space ?

• Migrate to new IDC– 10 Gb network topology– Servers location well defined– More physical space

12

What options• Migrate to public cloud

– Provision on-demand• Instance type (NIC/CPU/Mem/Disk) and amount

– Pay as you go– Need to optimize our existing services

Migrate to new IDC !

http://gdimitriou.eu/?m=200912

14

IDC Migration

Recap…

Network bandwidthData storage capacity

insufficient

New IDC Layout

● ● ●

POD Core

TOR Switch

41rack 41U rackSPN Hadoop POD

10Gb

160 Gb

40Gb

Up stream devices

HD NNcpu: 16coresmem: 128GBdisk: 10TB

HD DNcpu: 24coresmem: 196GBdisk: 72TB

Other services

Core Switch

Network traffic becomes far more less

Total 2~3X data storage capacity in terms of our data growth

Grow up to 14 racksx 2 x n

Servers viewDevices view

Now what ?Don’t forget our beloved

elephant~

YARNhttps://gigaom.com/2013/10/25/cloudera-ceo-were-taking-the-high-profit-road-in-hadoop/http://www.pragsis.com/blog/how_install_hadoop_3_commands

18

YARN abstracts the computing frameworks from Hadoop

http://hortonworks.com/hadoop/yarn/

So not only doing migration

also doing upgrade as well

TMH6 V.S. TMH7

2

Project TMH6 TMH7 Highlights

Hadoop 2.0.0 (MRv1) 2.6.0 • YARN + MRv2• YARN + ???

HBase 0.94.2 0.98.5 • MTTR impr.• Stripe Comp.

Zookeeper 3.4.5 3.4.6

Pig 0.10.0 0.14.0 • Pig on Tez

Sqoop1 1.4.2 1.4.5

Oozie 4.0.1 4.0.1

JVM Java6 Java7 • G1GC support

21

How we test our TMH7 ?How our services port and test with TMH7 ?

Apache Bigtop PMC Evans Ye

Comes to rescuein next Session

Something about HW• CPU

– Mores cores• Memory

– More memory• Disk

– Storage capacity

• Network– 10Gb– Topology

• # of nodes per rack– Do PoC

http://www.desktopwallpapers4.me/computers/hardware-28528/

23

Migration + Upgrade• Span two IDCs -> upgrade -> phase out old

one

Old IDC

20 Gb

New IDC

24

Migration + Upgrade• Build new one -> migrate -> phase out old

one

Old IDC

20 Gb

New IDC

1. Build new one2. migrate

3. phase out old one

Are we done ?We even not in the

game !

27

SLA for PROD Services

Various data access patterns

28

Zero downtime migration

Data Access Pattern Analysis

Hadoop/HDFS/MR

2

IDC

Hadoop cluster

Log collector

s

Message queues

Data sourcing services

File compactor

s

Internet

Data inData proc

Applicationservices

Data outService1. New files put (mins)

to HDFS2. Proc files with Pig/MR(hourly/daily) to HDFS3. Get result files from HDFS, do further proc4. Serve user requests

1.

2.

3.4.

Data access patterns for Hadoop/HDFS/MR

• Data in– New file put in couple mins

• Computation– Process data hourly or daily

• Data out– Get result files by services for further process

32

33

Categorize Data• Hot data

– Ingest files in mins• New data file put into Hadoop

continuously– Digest by Pig/MR for services

hourly or daily• Needed history data files

– Usually within couple months

– Sync data by• Replicate Data streaming ingestion

(Message queues + File compactors)

• distcp – every mins

• Cold data– All data except hot

• Time spans couple years data

• For monthly/quarterly/yearly report purposes

• Adhoc query– Copy data by

• disctp, run & leave it alone

34

Kerberos federation among our clusters• Please wait for our next session

– Multi-Cluster Live Synchronization with Kerberos Federated Hadoop by Mammi Chang, Dumbo team

Old IDCTMH6 stg

TMH6 prod

Old IDCTMH7 stg

TMH7 prod

35

New IDCOld IDC

Hadoop(tmh7)

Old Service 1’

New Service 1

Log collectors

20g Link

Hadoop(tmh6)

Old Service 2

Old Service 1

Log collectors

Sync hot data

Sync hot data

Message Queue

Zero downtime migration for Hadoop/HDFS/MR

File Compactors

Copy cold data

File Compactors

Message Queues

Need services’ cooperation• It seems services have no downtime• Latency for hot data sync

– May cause about latency in mins– Due to distcp cron job runs every couple mins

• Need services to– Adjust their jobs to delay couple mins to run

36

Seems pretty !So are we done?

Don’t forget our HBase XD

Data Access Pattern AnalysisHBase

2

IDC

Hadoop cluster

Log collector

s

Message queues

Data sourcing services

File compactor

s

Internet

Data inData proc

Applicationservices

Data outService1. New files put (mins)

to HDFS2. Proc files with Pig/MR(hourly/daily) to HBase3. Random read from HBase4. Serve user requests5. Random writes to HBase

1.

3.4.

5.

2.

Data access patterns for HBase• Data in

– Random write to HBase– Process/write data hourly or daily

• Data out– Random read from HBase

40

41

Considerations for HBase data sync• What we want ?

• All HBase data synced between old and new

• Arrange useless regions (Region merge)• Rowkey: ‘<key>-<timestamp>’• hbase.hregion.max.filesize

– 1GB to 4GB

42

Considerations for HBase data sync• Incompatible changes between old & new

HBases– API binary incomapatible– HDFS level folder structure changed– HDFS level meta data file format changed

• Not include HFileV2

43

Tools for HBase data syncTool Impl. tech. API compatible Service impact Data chunk

Boundary

CopyTable API client call

Cluster Replication API client call

Completebulkload HFileNeed to pending writes and flush

table

Based on when to pending writes

Export/Import SequenceFile + KeyValue + MR

Set start/end timestamp Based on previous

http://hbase.apache.org/book.html#tools

44

Support tools for HBase sync• Pre-splits generator

– Run on TMH6– Deal with region merge issue– To generate pre-splits rowkey file– Create new HTable on TMH7 with this filegen-htable-presplits.sh /user/SPN-hbase/<table-name>/ <region-size-bytes>

<threshold> > /tmp/<table-name>-splits.txt

hbase shellcreate '<table-name>', '<column-family-1>' , SPLITS_FILE => '/tmp/<table-name>-splits.txt'

45

Support tools for HBase sync• RowCount with timerange

– Support on both TMH6 & TMH7– Imported data check– Not officially support– Enhance old one to make our own

rowCounter.sh <table-name> --time-range=<start-timestamp>,end-timestamp># ... com.trendmicro.spn.hbase.mapreduce.RowCounter$RowCounterMapper$Counters ROWS=10892133 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0

46

Support tools for HBase sync• Snapshot

– On TMH7– For every time pass of imported data check– Rollback to previous snapshot if data check fails

hbase shellsnapshot '<table-name>', '<table-name>-<start-timestamp>-<end-timestamp>'

47

Support tools for HBase sync• DateTime <-> Timestamp

# get current java timestamp (long) date +%s%N | cut -b1-13# get current hour java timestamp (long) date --date="$(date +'%Y%m%d %H:00:00')" +%s%N | cut -b1-13# get current hour -1 java timestamp (long) date --date="$(date --date='1 hour ago' +'%Y%m%d %H:00:00')" +%s%N | cut -b1-13# timestamp to date date -d '@1436336202' # must be 10 digits, from left to right

48

Zero downtime migration for HBase

Old IDC

Staging

New IDC

Staging

hbase-tmh7

Hadoop-tmh7

hbase-tmh6

Hadoop-tmh6

ServiceA

ServiceB

1. Confirm KV timestamp with ServiceB2. Export data to HDFS with timestamp3. Gen splits file4. distcp data to TMH75. Create HTable with splits6. Import data to HTable7. Verify data by rowcount W/ timestamp8. Create snapshot9,11. Sync data thru #2~8 (skip 3, 5)10. ServiceB stag test start12. Grant ‘RW’ to HTable for ServiceB13. Install ServiceB in new IDC14. Start ServiceB in new IDC15. Done

2. 3.

4.

5.

6.

7.

8.

12.

ServiceB13. 14.

49

Need services’ cooperation• There still will be a small data gap

– It may be mins• Is it sensitive to services ?

– If it is not• Wait for our final data sync

– If it is• Services need to direct their writes to both clusters

Data sync to HTable -> service start up and run -> final data sync to HtableData gap

50

Wrap up

51

Wrap up• Analyze access patterns

– Batch ? Real time ? Streaming ?– Cold data ? Hot data ?

• Keep it simple!– Use native utils as far as you can

• Rehearsal ! Rehearsal ! Rehearsal !• Communicate with your users closely

52

某一天… 你們migrate的如何? 我migrate完了!

我migrate,完了

有聽有保庇!

Q & A

Thank You

Backups

What items need to take care of• CPU

– Use more cores• One MR task process uses 1 CPU core• Single core clock rate does not increase much

– Do math to compare CPU cores for old and new

2

(codes-per-old-machine * amount-of-machines * increases-percent) / cores-per-new-machine = amount-of-new-machines

1. Hortonworks, Corp., Apache Hadoop Cluster Configuration Guide, 2013 Apr., p. 15.

e.g. # of 8 cores machine s to # of 24 cores machine, with 1.5X capacity higher(8 * 10 * 150%) / 24 = 120 / 24 =~ 5

P.S. could consider to enable hyper-threading1, then the # of cores is double, but 1/3 of doubled cores need to keep for OS

57

What items need to take care of

• Memory– Total memories much higher than our old cluster– Consider next gen. computing framework

((per-slot-gigbytes * total-slots + hbase-heap-gigabytes) * 120%-os-mem) * increase-percent / mem-per-new-machine = amount-of-new-machines

e.g. 8 slots with 2GB for each per old machine(((2GB * 80 + 8GB) * 120%) * 300%) / 192GB = (168GB * 120%) * 300% / 192GB =~ 4

58

What items need to take care of

• Disk– 2~3X storage capacity to fulfill our BIG data size– Hot swapping support– One disk/partition versus 2~3 process (MR tasks)

• Network– Network topology changed (as previous)– 10Gb NIC for Hadoop nodes

total-cores / (disks-per-new-machine * amount-of-new-machines) = amount-of-process-per-diske.g. with total cores is 120; 120 / (12 * 5) =~ 2

What items need to take care of

• Rack– Power consumption & cooling– One rack can support our Hadoop nodes is 15, instead of

20– Ask your HW vendor for PoC !!

• Transactional workload (heavy IO load)• Computation workload (100% CPU workload)• Memory Intensive workload (full memory usage)

• New Hadoop TMH7– Build new one first -> migrate -> phase out old one

2

60

Need services’ cooperation

• Services need to port their codes for TMH7• We released a Dev Env. (all-in-one Hadoop) for

services to test in advanced– VMWare image (OVF)– VagrantBox– Docker image

• A Jira project for users to submit issues if any

Recommended