Upload
madison-chandler
View
217
Download
2
Tags:
Embed Size (px)
Citation preview
Transitioning of existing applications to use HDFS
August 2008
2
ContextWeb: Traffic
Traffic – up to 10 thousand Ad requests per second. Comscore Trend Data:
3
ContextWeb Architecture highlights
Pre – Hadoop aggregation framework Logs are generated on each server and aggregated in memory to 15
minute chunks Aggregation of logs from different servers into one log Load to DB Multi-stage aggregation in DB About 20 different jobs end-to-end Could take 2hr to process through all stages
Hadoop Data Set
Up to 160GB of raw log files per day. 10TB for 60 days30 different aggregated data sets 25TB total to cover 1 year
(uncompressed) For example, list of URLs with Keywords for every impression – 15TB
total (uncompressed).
Multiply by 3 replicas …Compression would help – potential compression ratio 1:15 –
1:20
5
Architectural Challenges
How to organize data set to keep aggregated data sets fresh. Logs constantly appended to the main DataSet. Reports and
aggregated datasets should be refreshed every 15 minutes
Mix of .NET and Java applications. (80%+ .Net, 20%- Java) How to make .Net application write logs to Hadoop?
Some 3rd party applications to consume results of MapReduce Jobs (e.g. reporting application) How make 3rd party or internal Legacy applications to read data from
Hadoop ?
6
Partitioned Data Set: approach
Date/Time as dimension for PartitioningSegregate results of MapReduce jobs into Daily
Files/DirectoriesEach Daily file is regenerated if input into MR job contains
data for this DayUse revision number for each file. This way multi-stage jobs
could overlap during processing (HDFS is still write-once, at least for now)
7
Partitioned Data Set: processing flow
HDFS
Historic Data (By Day)
RawLogD 0214_r4
RawLogD 0215_r4
RawLogD 0216_r4
HADOOP15 minute log
LogRpt15 yyyy0215_
hhmm
Map Reduce
RawLogD 0214_r5
Aggregated data for Advertisers (By Day)
AdvD 0214_r3
AdvD 0215_r4
AdvD 0216_r4
AdvD 0214_r4
Map Reduce
AdvMR
IncomingMR
From Ad Serving Platform
To Reporting and Predictions
8
Partitioned Data Set: Implementation
Use MultipleOutputFormat to generate daily files/directoriesProvide your own generateFileNameForKeyValue()Compression is supported out of the box
Use PartitionerClass to make sure that all rows for the same day go to the same reducer Provide your own getPartition() int partitionID = dateHash % numPartitions;
9
Getting Data in and out
Mix of .NET and Java applications. (80%+ .Net, 20%- Java) How to make .Net application write logs to Hadoop?
Some 3rd party applications to consume results of MapReduce Jobs (e.g. reporting application) How make 3rd party or internal Legacy applications to read data from
Hadoop ?
10
Getting Data in and out: distcp
Hadoop Distcp <src> <trgt> <src> - hdfs <trgt> - /mnt/abc – network share
Easy to start – just allocate storage on network shareBut…Difficult to maintain if there are more than 10 types of data to
copyNeed extra storage. Outside of HDFS. (oxymoron!)Extra step in processingClean up
11
Getting Data in and out: WebDAV driver
WebDAV server is part of Hadoop source code tree Needed some minor clean up
WebDAV client is pre-installed on WindowsLinux
Mount Modules available from http://dav.sourceforge.net/
[root@pglnx mnt]# cd /mnt[root@pglnx mnt]# mkdir -p hadoop/prod[root@pglnx mnt]# mount -t davfs http://cw-grid100.contextweb.prod/hadoop/prod/[root@pglnx ~]# mount | grep hadoophttp://cw-grid100.contextweb.prod/ on /mnt/hadoop/prod type davfs(rw,nosuid,nodev,_netdev)[root@pglnx ~]# cd /mnt/hadoop/prod/[root@pglnx prod]# lsgeo geo1.txt hadoop home lost+found old_versions rpt system testing
tmp user wide
12
Getting Data in and out: Running Server on Linux
(Windows/Linux)
HADOOP/HDFS
MasterData Node
Data Node
Data Node
Data Node
Data Node
Client (Windows/Linux)
WebDav Server
Data consumers
Webdav client
ListgetProperties
Data
Data
Data
HD
FS
apiClient (Windows/Linux)
Data consumers
Webdav client
13
Getting Data in and out: Running Server on the same node where client is installed
HADOOP/HDFS
MasterData Node
Data Node
Data Node
Data Node
Data Node
Known high-bandwidth clients (Win/Linux)
WebDav Server
Data consumers
Webdav client
ListgetProperties
Data
Data
Data
Known high-bandwidth clients (Win/Linux)
WebDav Server
Data consumers
Webdav client
occasional clients
Data consumers
Webdav client
occasional clients
Data consumers
Webdav client
Data
Data
HD
FS
apiH
DF
S api
WebDAV and compression
But your results are compressed…Options:
Decompress files on HDFS – an extra step again Refactor your application to read compressed files…
• Java – Ok• .Net – much more difficult. Cannot decompress SequenceFiles• 3rd party- not possible
WebDAV and compression
Solution – extend WebDAV to support compressed SequenceFiles
Same driver can provide compressed and uncompressed files If file with requested name foo.bar exists – return as is foo.bar If file with requested name foo.bar does not exist – check if there is
a compressed version foo.bar.seq. Uncompress on the fly and return as if foo.bar
Outstanding issues Temporary files are created on Windows client side There are no native Hadoop (de)compression codecs on Windows