Hadoop secondary sort and a custom comparator

Secondary Sort and a Custom Comparator

What is Time Series Data?• In statistics, signal processing, econometrics and mathematical finance,

a time series is a sequence of data points, measured typically at successive time instants spaced at uniform time intervals.

• Examples of time series data are the daily adjusted close price of a stock at the NYSE or sensor readings on a power grid occuring 30 times a second.

• Time series as a general class of problems has typically resided in the scientific and financial domains.

• However, due to the ongoing explosion of available data, time series data is becoming more prevalent across a wider swath of industries.

• Time Series sensors are being ubiquitously integrated in places like:

– The power grid, aka “the smart grid”

– Cellular Services

– As well as, military and environmental uses

• The understanding of how we can refactor traditional approaches to these time series problems when inputting into MapReduce can potentially allow us to improve processing and analysis techniques in a timely fashion.

Current approaches• The financial industry has long been interested in time series data and have employed

programming languages such as R to help deal with this problem.

• So, why would a sector create a programming language specifically for one class of data when technologies like RDBMS have existed for decades?

• In reality, current RDBMs technology has limitations when dealing with high-resolution time series data.

• These limiting factors include:– High-frequency time series data coming from a variety of sources can create huge amounts of

data in very little time

– RDBMS’s tend to not like storing and indexing billions of rows.

– Non-distributed RDBMS’s tend to not like scaling up into the hundreds of GB’s, let alone TB’s or PB’s.

– RDBMS’s that can scale into those arenas tend to be very expensive, or require large amounts of specialized hardware.

– To process high resolution time series data with a RDBMS we’d need to use an analytic aggregate function in tandem with moving window predicates (ex: the “OVER” clause) which results in rapidly increasing amounts of work to do as the granularity of time series data gets finer.

– Query results are not perfectly commutable and cannot do variable step sliding windows (ex: step 5 seconds per window move) without significant unnecessary intermediate work or non-standard SQL functions.

– Queries on RDBMS for time series for certain techniques can be awkward and tend to require premature subdividing of the data and costly reconstruction during processing (example: Data mining, iSAX decompositions)

– Due to the above factors, with large amounts of time series data RDBMS performance degrades while scaling.

Example Problem : Simple Moving Average• A simple moving average is the series of un-weighted averages in a subset of

time series data points as a sliding window progresses over the time series data set.

• Each time the window is moved we recalculate the average of the points in the window.

• This produces a set of numbers representing the final moving average.

• Typically the moving average technique is used with time series to highlight longer term trends or smooth out short-term noise.

• Moving averages are similar to low pass filters in signal processing, and mathematically are considered a type of convolution.

• In other terms, we take a window and fill it in a First In First Out (FIFO) manner with time series data points until we have N points in it.

• We then take the average of these points and add this to our answer list.

• We slide our window forward by M data points and again take the average of the data points in the window.

• This process is repeated until the window can no longer be filled at which point the calculation is complete.

• Let N=30, M = 1

Data• /input/movingaverage/NYSE_daily

exchange stock_symbol date open high low close volume adj close

NYSE AA 3/5/2008 37.01 37.9 36.13 36.6 17752400 36.6

NYSE AA 3/4/2008 38.85 39.28 38.26 38.37 11279900 38.37

NYSE AA 3/3/2008 38.25 39.15 38.1 38.71 11754600 38.71

NYSE AA 3/2/2008 37.9 38.94 37.1 38 15715600 38

NYSE AA 3/1/2008 37.17 38.46 37.13 38.32 13964700 38.32

NYSE AA 2/29/2008 38.77 38.82 36.94 37.14 22611400 37.14

NYSE AA 2/28/2008 38.61 39.29 38.19 39.12 11421700 39.12

NYSE AA 2/27/2008 38.19 39.62 37.75 39.02 14296300 39.02

NYSE AA 2/26/2008 38.59 39.25 38.08 38.5 14417700 38.5

NYSE AA 2/25/2008 36.64 38.95 36.48 38.85 22500100 38.85

NYSE AA 2/24/2008 36.38 36.64 35.58 36.55 12834300 36.55

NYSE AA 2/23/2008 36.88 37.41 36.25 36.3 13078200 36.3

NYSE AA 2/22/2008 35.96 36.85 35.51 36.83 10906600 36.83

NYSE AA 2/21/2008 36.19 36.73 35.84 36.2 12825300 36.2

NYSE AA 2/20/2008 35.16 35.94 35.12 35.72 14082200 35.72

NYSE AA 2/19/2008 36.01 36.43 35.05 35.36 18238800 35.36

NYSE AA 2/18/2008 33.75 35.52 33.63 35.51 21082100 35.51

NYSE AA 2/17/2008 34.33 34.64 33.26 33.49 12418900 33.49

NYSE AA 2/16/2008 33.82 34.25 33.29 34.06 11249800 34.06

NYSE AA 2/15/2008 32.67 33.81 32.37 33.76 10731400 33.76

NYSE AA 2/14/2008 32.24 33.25 31.9 32.78 9058900 32.78

NYSE AA 2/13/2008 32.95 33.37 32.26 32.41 7230300 32.41

NYSE AA 2/12/2008 33.3 33.64 32.52 32.67 11338000 32.5

NYSE AA 2/11/2008 34.57 34.85 33.98 34.08 9528000 33.9

NYSE AA 2/10/2008 33.67 34.45 33.07 34.28 15186100 34.1

NYSE AA 2/9/2008 32.13 33.34 31.95 33.09 9200400 32.92

NYSE AA 2/8/2008 32.58 33.42 32.11 32.7 10241400 32.53

NYSE AA 2/7/2008 31.73 33.13 31.57 32.66 14338500 32.49

NYSE AA 2/6/2008 30.27 31.52 30.06 31.47 8445100 31.31

NYSE AA 2/5/2008 31.16 31.89 30.55 30.69 17567800 30.53

NYSE AA 2/4/2008 37.01 37.9 36.13 36.6 17752400 10.6

NYSE AA 2/3/2008 38.85 39.28 38.26 38.37 11279900 8.37

Approach• In our simple moving average example, however, we don’t operate on a per

value basis specifically, nor do we produce an aggregate across all of the values.

• Our operation in the aggregate sense involves a sliding window, which performs its operations on a subset of the data at each step.

• We also have to consider that the points in our time series data are not guaranteed to arrive at the reduce in order and need to be sorted.

• This is because with multiple map functions reading multiple sections of the source data MapReduce does not impose any order on the key-value pairs that are grouped together in the default partition and sorting schemes.

• We want to group all of one stock’s adjusted close values together so we can apply the simple moving average operation over the sorted time series data.

• We want to emit each time series key value pair keyed on a stock symbol to group these values together.

• In the reduce phase we can run an operation, here the simple moving average, over the data.

• Since the data more than likely will not arrive at the reducer in sorted order we’ll need to sort the data before we can calculate the simple moving average.

Problem• We’re limited by our Java Virtual Machine (JVM) child heap size and we are

taking time to manually sort the data ourselves.

• With a few design changes, we can solve both of these issues taking advantage of some inherent properties of MapReduce.

– First we want to look at the case of sorting the data in memory on each reducer.

– Currently we have to make sure we never send more data to a single reducer than can fit in memory.

– The way we can currently control this is to give each reducer child JVM more heap and/or to further partition our time series data in the map phase.

– In this case we’d partition further by time, breaking our data into smaller windows of time.

• As opposed to further partitioning of the data, another approach to this issue is to allow Hadoop to sort the data for us in what’s called the “shuffle phase” of MapReduce.

• If the data arrives at a reducer already in sorted order

– we can lower our memory footprint and

– reduce the number of loops through the data by only looking at the next N samples for each simple moving average calculation.

shuffle’s “secondary sort” mechanic• Sorting is something we can let Hadoop do for us and Hadoop has proven to

be quite good at sorting large amounts of data.

• In using the secondary sort mechanic we can solve both our heap and sort issues fairly simply and efficiently.

• To employ secondary sort in our code, we need to make the key a composite of the natural key and the natural value.

Composite Key• The Composite Key gives Hadoop the needed information during the shuffle

to perform a sort not only on the “stock symbol”, but on the time stamp as well.

• The class that sorts these Composite Keys is called the key comparator.

• The key comparator should order by the composite key, which is the combination of the natural key and the natural value.

• We can see below where an abstract version of secondary sort is being performed on a composite key of 2 integers.

• A more realistic example: Composite Key to have a stock symbol string (K1) and a timestamp (K2). The diagram has sorted the K/V pairs by both “K1: stock symbol” (natural key) and “K2: time stamp” (secondary key).

Partitioning by the natural key• Once we’ve sorted our data on the composite key, we now need to partition

the data for the reduce phase.

• Once we’ve partitioned our data the reducers can now start downloading the partition files and begin their merge phase.

• NaturalKeyGroupingComparator, is used to make sure a reduce() call only sees the logically grouped data meant for that composite key.

In short

• To summarize, there is a recipe here to get the effect of sorting by value:

– Make the key a composite of the natural key and the natural value.

– The sort comparator should order by the composite key, that is, the natural key and natural value.

– The partitioner and grouping comparator for the composite key should consider only the natural key for partitioning and grouping.

Implementation : NaturalKey

• what you would normally use as the key or “group by” operator.

– In this case the Natural Key is the “group” or “stock symbol” as we need to group potentially unsorted stock data before we can sort it and calculate the simple moving average.

Implementation : Composite Key

• A Key that is a combination of the natural key and the natural value we want to sort by.

– In this case it would be the TimeseriesKey class which has two members:

• String Group

• long Timestamp

– Where the natural key is “Group” and the natural value is the “Timestamp” member.

Implementation : CompositeKeyComparator

• Compares two composite keys for sorting.

• Should order by composite key.

Implementation : NaturalKeyPartitioner

• Partitioner should only consider the natural key.

• Blocks all data into a logical group, inside which we want the secondary sort to occur on the natural value, or the second half of the composite key.

• Normal hash partitioner would hash the object and send each key/value pair to a separate reducer.

Implementation : NaturalKeyGroupingComparator

• Should only consider the natural key.

• Inside a partition, a reducer is run on the different groups inside of the partition.

• A custom grouping comparator makes sure that a single reducer sees a custom view of the groups, sometimes grouping values across natural value “borders” in the composite key.

End of session

Day – 2: Secondary Sort and a Custom Comparator

Data & Analytics

Hadoop secondary sort and a custom comparator