Scaling up to Big Data: Algorithmic Engineering + HPC · 2015-07-24 · SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC Statistical and Computational Analytics for Big Data

SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC

Statistical and Computational Analytics for Big DataJune 12, 2015

Andrew Rau-Chaplin

Faculty of Computer Science

Dalhousie University

www.cs.dal.ca/~arc

http://www.cs.dal.ca/~arc

THE PROBLEM

STEP 1: SOLVE THE PROBLEM ON SMALL DATA

Typically Process

• start by using small data sets!

• focus on identifying those machine learning techniques that are best suited to the problem

STEP 2: SCALE-UP

SCALING UP

1)Asymptotic Analysis

2)Algorithmic Engineering

3)Parallelism & HPC

HOW TO SCALE TO “BIG DATA”

• Sorry, I don’t have a recipe!

• Talk about our experience scaling up analytical techniques

• Highlight approaches and technology choices which we have found helped in multiple settings

•Perhaps they should have been obvious?

EFFICIENT FRONTIER APPROACHES TO TREATY OPTIMIZATION

www.Risk-Analytics-Lab.ca

Joint work with

• Omar Carmona Cortes

• I. Cook and J. Gaiser-Porter

1) Asymptotic Analysis applied to search

and optimization

Simulated Event Losses

CATASTROPHE MODELING

Exposure Event Catalog

Hazard

Vulnerability

LossEvent Loss Table (ELT)

Cat Model

A Program≈ Multiple layers, over ~15 ELTs, covering ~5 models, and ~200K events

A Portfolio ≈3-4K Programs each with multiple layers, with 40K ELTs, over 100 models, covering 1M events

A TREATY OPTIMIZATION PROBLEM

Optimization From a Primary Insurers or Broker’s Perspective!

Find a Pareto Frontier

Risk

Expected

Return

Dominated

Infeasible

MORE FORMALLY

Given: a fixed number of contractual layers and a simulated set of expected loss distributions (one per layer), plus a model of reinsurance market costs

The Task: identify optimal combinations of shares (also called placements) in order to build a Pareto frontier

INPUTS/OUTPUTS

Treaty

Optimizer

Discretization = 10%, 5%, or 1%

THE APPROACH

12

• Aggregate the loss data

• Location Event Trial year

• Discretized search parameters

• Calculate results for all combinations of shares

• Use a big parallel machine

THE PROBLEM

• Works for a small number of layers!

• Results in a large number of computations

• Number of computations exponential increases with dimensions

Number of Layers

Number of share intervals

Asymptotic Analysis!• ((# of trials) * (discretization) ^ (# of layers))/

(number of processors)

• Example: (1,000,000 * 100^15) / 1000 =

10^33 computations

ROUND 1:

Need a better algorithm!

Use an evolutional search approach

Population Based Incremental Learning Di-PBIL

Single risk measure (ie. 2D Pareto Frontier)

Variance

Value At Risk (VaR)

Tail Value at Risk (TVaR)

Prototype in R (with mutlithreading)

Questions

Quality: How close to the exact method?

Performance: How fast? How big a problem can we now handle?

QUALITY: HOW CLOSE TO THE EXACT METHOD?

Percentage of time DiPBIL finds the same solution as the exact method?

QUALITY: HOW CLOSE TO THE EXACT METHOD?

Average error when DiPBIL does not find the same solution as the exact method?

Error always

less than

6/100ths of a

percent.

PERFORMANCE: HOW FAST WHEN COMPARED TO THE EXACT METHOD?

Time on a single core to compute a single point on efficient frontier for 7 layers and 5% discretization

Enumeration: weeks

Di-PBIL: 2-15 minutes

PERFORMANCE: HOW BIG A PROBLEM CAN WE NOW SOLVE?

Time on a single core to compute a single point on efficient frontier at 5% discretization

Solutions times no

longer exponential

in the number of layer!

ROUND 2:

19

Single risk metric Multiple risk metric(e.g. 1 in 100yr TVaR + 1 in 5yr VaR)

2-d Pareto front 3-d+ Pareto front

Di-PBIL Mo-PBIL

Prototype in R Prototype in C++

Advantages

Search for whole front, not point by point

Multiple Risk Metrics

Performance!

ROUND 2: OPTIMIZED MO-PBIL

Mo-PBIL : Complete frontier (60 - 70 points) for 7 layer program and 5% discretization in 16 seconds!

Setup: 500 iterations, 128 population

2 * Xeon E5-2660 processors

SUMMARY

Evolutionary techniques work well for Treaty Optimization!

Can now solve practical problem instances with practical performance.

Compared multiple evolutionary search methods

Single Objective: DE, PSO, GA, PBIL

Multi Objective: VEPSO, MODE, SPEA2, NSGA2

Evaluation Results

All work and can produce high quality solutions

Differences

Easy of use

PerformanceDon’t compute exactly what you

can compute approximately!

Parallelism is great, but it only

buys you a constant factor!

OPTIMIZE LAYER STRUCTURE, NOT JUST SHARES !

Treaty Optimizer 2

Aggregate

Simulation Engine

Att

Exh

7

Discretization = d%

Risk Measure

Premium function

# reinstatements

Aggregate terms

(ie 3rd event cover)

Set of ELTs

100K Year Event Table (YET)

Risk

Expected

Return

Att

ExhPopulation Evaluation

Inputs

ACCELERATING NGRAM BASED TEXT ANALYSIS

2) Algorithm engineering techniques

applied to text analytics problems

DOCUMENT RELATEDNESS

Important task in many text mining applications

Represented by a score between 0 and 1

Unsupervised Corpus-based methods: Google Trigram Method, Semantic

Text Similarity, etc.

Unigram:apple 6878789eating 14987879

Trigram:ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50

Word Relatedness

Document Relatedness: abstracted as a function of word relatedness

Word Relatedness

Find frequency of w1 and w2 in Unigram; Find co-occurrence of w1 and w2 in Trigram

GTM Distance Function

GOOGLE TRIGRAM METHOD (GTM)

D1: An autograph is the signature of someone famous which is specially written for a fan to keep.D2: Your signature is your name, written in your own characteristic way, often at the end of a document to indicate that you wrote the document or that you agree with what it says

*

* Proposed by Islam, Milios, and Keselj, Text Relatedness using Google Tri-grams

GTM EXAMPLE

CHALLENGES IN SCALING UP GTM

Measuring the relatedness between a pair of documents is too slow in the

existing work

The size of Unigram is roughly 200 MB; the size of Trigram is 20 GB.

High complexity of N to N pairwise document Relatedness computation.

Volume of documents is growing rapidly

9 GB 3 GB

WORD RELATEDNESS PRECOMPUTATION

Tokenize! - Assign each word with an number ID

Precompute! - Compute all the word relatedness in advance for lookups

Build in-memory data structures! - Dictionary structure to store word

relatedness dictionary in memory

Hashing vs Arrays (207,761,290 pairs of words)

SHARED MEMORY MULTITHREADING

Multithreaded implementation: make uses of a multi-core of shared memory machine.

Amortize I/O Costs: Each thread running on a separate core fetches documents from

the shared memory and computes the relatedness between them.

Lots of language and library based approaches: OpenMP, …

MULTITHREADED IMPLEMENTATION PERFORMANCE

The speed-up analysis

Experiments use 2000 documents from ACM Paper Abstracts collections

HORIZONTAL SCALING: HADOOP

• Scaling for free?

• Data parallelism

• Solves problem partitioning

• Solves task mapping

• Solves fault tolerance

• Challenges

• Shared data structures?

• How to amortize I/O costs?

HADOOP

•Shared data structures?

• Use multi-threaded mappers!

• Example: Each multithreaded mapper constructs word relatedness dictionary and takes

input blocks for document relatedness computation.

•How to amortize I/O costs?

•Map over task definitions not the raw data

•Example: Map of blocks of similarity computations

0

5

10

15

20

25

30

35

40

2000 4000 6000 8000 10000

Tim

e in

Min

ute

s

Number of Files

Hadoop Size_up Performance

20 Instances Performance(mins)

HADOOP IMPLEMENTATION PERFORMANCE

•Hardware: Amazon EC2 m3.xlarge nodes each of which has 4 cores and 15 GB.

•Hadoop: AWS EMR

•Dataset: 10,000 from ACM Paper abstracts collections

0

5

10

15

20

25

30

35

1 2 3 4 5 6 8 10 20 25

Tim

e in

Min

ute

s

Number of Files

Hadoop Scale_up Performance

Time(mins)

HADOOP IMPLEMENTATION PERFORMANCE

The scale-up performance shows the running time of the implementation while keeping the ratio between input size and nodes fixed.

2000 texts per node using between 2-10 nodes.

SUMMARY

• Speed-up of ~10,000,000x from initial research prototype

• GPU implementation: ~10x cost/performance

Algorithm Engineering works!

• Compress your data

• Precompute!

• Exploit in memory data structurtes

• Mutli-thread

• Scale horizontally

SUMMARY

Asymptotic Analysis

• O( (# of Docs)^2 * (# of words per doc)^2 )

0

500

1000

1500

2000

2500

3000

3500

4000

0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000

Hour

s

Documents

Runtime as a function of # of documents

You can’t out run the

asymptotic complexity!

VOLAP: A FULLY SCALABLE CLOUD-BASED SYSTEM FOR

REAL-TIME OLAP ON HIGH VELOCITY DATA

3) Applying cloud technologies to

accelerate real-time performanceJoint work with

• F. Dehne, D. Robillard

• Q. Kong, N. Burkee

REAL-TIME OLAP ON THE CLOUD

38

Business Analytics based on OLAP

High dimensional data with hierarchies

Real-time insert and queries

Solution

Multiple server processors

Dynamic load balancing

Strong session serialization

PERFORMANCE

39

CLOUD BASED SYSTEM ARCHITECTURE

THE DISTRIBUTED PDCR-TREE

The building block for CR-OLAP system

DC-Tree A fully dynamic index structure which explicitly represents

dimension hierarchies

“The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses”, [Kriegel2000]

PDC-Tree A Multi-threaded DC-Tree

“Parallel Real-Time OLAP on Multi-core Processors”, [Dehne2012]

Distributed PDCR-Tree A new distributed in-memory index structure for the cloud

Add support for distributed memory

Array-based implementation to support efficient migration

Add supports for both ordered and unordered dimensions.

DATA INGESTION PERFORMANCE

42

LOAD BALANCING PERFORMANCE

43

SCALE-UP QUERY PERFORMANCE

44

SUMMARY

CloudA model for high velocity data

There are great tools

New IssuesDesign for elasticity

Think about fault-tolerance from the start

Old IssuesNothing is for free! Still need to worry about Compression

Pre-computation

Efficient Data structures

Multi threading

Big Data

HPC: Clusters, Clouds, &

GPU

Algorithm Engineering

SCALING UP TO BIG DATA

Andrew [email protected]

Risk Analytics Lab,

Faculty of Computer Science

Dalhousie University,

Halifax, Canada

www.Risk-Analytics-Lab.ca

http://www.risk-analytics-lab.ca/