Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
SCALING UP TO BIG DATA: ALGORITHMIC ENGINEERING + HPC
Statistical and Computational Analytics for Big DataJune 12, 2015
Andrew Rau-Chaplin
Faculty of Computer Science
Dalhousie University
www.cs.dal.ca/~arc
THE PROBLEM
STEP 1: SOLVE THE PROBLEM ON SMALL DATA
Typically Process
• start by using small data sets!
• focus on identifying those machine learning techniques that are best suited to the problem
STEP 2: SCALE-UP
SCALING UP
1)Asymptotic Analysis
2)Algorithmic Engineering
3)Parallelism & HPC
HOW TO SCALE TO “BIG DATA”
• Sorry, I don’t have a recipe!
• Talk about our experience scaling up analytical techniques
• Highlight approaches and technology choices which we have found helped in multiple settings
•Perhaps they should have been obvious?
EFFICIENT FRONTIER APPROACHES TO TREATY OPTIMIZATION
www.Risk-Analytics-Lab.ca
Joint work with
• Omar Carmona Cortes
• I. Cook and J. Gaiser-Porter
1) Asymptotic Analysis applied to search
and optimization
Simulated Event Losses
CATASTROPHE MODELING
Exposure Event Catalog
Hazard
Vulnerability
LossEvent Loss Table (ELT)
Cat Model
A Program≈ Multiple layers, over ~15 ELTs, covering ~5 models, and ~200K events
A Portfolio ≈3-4K Programs each with multiple layers, with 40K ELTs, over 100 models, covering 1M events
A TREATY OPTIMIZATION PROBLEM
Optimization From a Primary Insurers or Broker’s Perspective!
Find a Pareto Frontier
Risk
Expected
Return
Dominated
Infeasible
MORE FORMALLY
Given: a fixed number of contractual layers and a simulated set of expected loss distributions (one per layer), plus a model of reinsurance market costs
The Task: identify optimal combinations of shares (also called placements) in order to build a Pareto frontier
INPUTS/OUTPUTS
Treaty
Optimizer
Discretization = 10%, 5%, or 1%
THE APPROACH
12
• Aggregate the loss data
• Location Event Trial year
• Discretized search parameters
• Calculate results for all combinations of shares
• Use a big parallel machine
THE PROBLEM
• Works for a small number of layers!
• Results in a large number of computations
• Number of computations exponential increases with dimensions
Number of Layers
Number of share intervals
Asymptotic Analysis!• ((# of trials) * (discretization) ^ (# of layers))/
(number of processors)
• Example: (1,000,000 * 100^15) / 1000 =
10^33 computations
ROUND 1:
Need a better algorithm!
Use an evolutional search approach
Population Based Incremental Learning Di-PBIL
Single risk measure (ie. 2D Pareto Frontier)
Variance
Value At Risk (VaR)
Tail Value at Risk (TVaR)
Prototype in R (with mutlithreading)
Questions
Quality: How close to the exact method?
Performance: How fast? How big a problem can we now handle?
QUALITY: HOW CLOSE TO THE EXACT METHOD?
Percentage of time DiPBIL finds the same solution as the exact method?
QUALITY: HOW CLOSE TO THE EXACT METHOD?
Average error when DiPBIL does not find the same solution as the exact method?
Error always
less than
6/100ths of a
percent.
PERFORMANCE: HOW FAST WHEN COMPARED TO THE EXACT METHOD?
Time on a single core to compute a single point on efficient frontier for 7 layers and 5% discretization
Enumeration: weeks
Di-PBIL: 2-15 minutes
PERFORMANCE: HOW BIG A PROBLEM CAN WE NOW SOLVE?
Time on a single core to compute a single point on efficient frontier at 5% discretization
Solutions times no
longer exponential
in the number of layer!
ROUND 2:
19
Single risk metric Multiple risk metric(e.g. 1 in 100yr TVaR + 1 in 5yr VaR)
2-d Pareto front 3-d+ Pareto front
Di-PBIL Mo-PBIL
Prototype in R Prototype in C++
Advantages
Search for whole front, not point by point
Multiple Risk Metrics
Performance!
ROUND 2: OPTIMIZED MO-PBIL
Mo-PBIL : Complete frontier (60 - 70 points) for 7 layer program and 5% discretization in 16 seconds!
Setup: 500 iterations, 128 population
2 * Xeon E5-2660 processors
SUMMARY
Evolutionary techniques work well for Treaty Optimization!
Can now solve practical problem instances with practical performance.
Compared multiple evolutionary search methods
Single Objective: DE, PSO, GA, PBIL
Multi Objective: VEPSO, MODE, SPEA2, NSGA2
Evaluation Results
All work and can produce high quality solutions
Differences
Easy of use
PerformanceDon’t compute exactly what you
can compute approximately!
Parallelism is great, but it only
buys you a constant factor!
OPTIMIZE LAYER STRUCTURE, NOT JUST SHARES !
Treaty Optimizer 2
Aggregate
Simulation Engine
Att
Exh
7
Discretization = d%
Risk Measure
Premium function
# reinstatements
Aggregate terms
(ie 3rd event cover)
Set of ELTs
100K Year Event Table (YET)
Risk
Expected
Return
Att
ExhPopulation Evaluation
Inputs
ACCELERATING NGRAM BASED TEXT ANALYSIS
2) Algorithm engineering techniques
applied to text analytics problems
DOCUMENT RELATEDNESS
Important task in many text mining applications
Represented by a score between 0 and 1
Unsupervised Corpus-based methods: Google Trigram Method, Semantic
Text Similarity, etc.
Unigram:apple 6878789eating 14987879
Trigram:ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50
Word Relatedness
Document Relatedness: abstracted as a function of word relatedness
Word Relatedness
Find frequency of w1 and w2 in Unigram; Find co-occurrence of w1 and w2 in Trigram
GTM Distance Function
GOOGLE TRIGRAM METHOD (GTM)
D1: An autograph is the signature of someone famous which is specially written for a fan to keep.D2: Your signature is your name, written in your own characteristic way, often at the end of a document to indicate that you wrote the document or that you agree with what it says
*
* Proposed by Islam, Milios, and Keselj, Text Relatedness using Google Tri-grams
GTM EXAMPLE
CHALLENGES IN SCALING UP GTM
Measuring the relatedness between a pair of documents is too slow in the
existing work
The size of Unigram is roughly 200 MB; the size of Trigram is 20 GB.
High complexity of N to N pairwise document Relatedness computation.
Volume of documents is growing rapidly
9 GB 3 GB
WORD RELATEDNESS PRECOMPUTATION
Tokenize! - Assign each word with an number ID
Precompute! - Compute all the word relatedness in advance for lookups
Build in-memory data structures! - Dictionary structure to store word
relatedness dictionary in memory
Hashing vs Arrays (207,761,290 pairs of words)
SHARED MEMORY MULTITHREADING
Multithreaded implementation: make uses of a multi-core of shared memory machine.
Amortize I/O Costs: Each thread running on a separate core fetches documents from
the shared memory and computes the relatedness between them.
Lots of language and library based approaches: OpenMP, …
MULTITHREADED IMPLEMENTATION PERFORMANCE
The speed-up analysis
Experiments use 2000 documents from ACM Paper Abstracts collections
HORIZONTAL SCALING: HADOOP
• Scaling for free?
• Data parallelism
• Solves problem partitioning
• Solves task mapping
• Solves fault tolerance
• Challenges
• Shared data structures?
• How to amortize I/O costs?
HADOOP
•Shared data structures?
• Use multi-threaded mappers!
• Example: Each multithreaded mapper constructs word relatedness dictionary and takes
input blocks for document relatedness computation.
•How to amortize I/O costs?
•Map over task definitions not the raw data
•Example: Map of blocks of similarity computations
0
5
10
15
20
25
30
35
40
2000 4000 6000 8000 10000
Tim
e in
Min
ute
s
Number of Files
Hadoop Size_up Performance
20 Instances Performance(mins)
HADOOP IMPLEMENTATION PERFORMANCE
•Hardware: Amazon EC2 m3.xlarge nodes each of which has 4 cores and 15 GB.
•Hadoop: AWS EMR
•Dataset: 10,000 from ACM Paper abstracts collections
0
5
10
15
20
25
30
35
1 2 3 4 5 6 8 10 20 25
Tim
e in
Min
ute
s
Number of Files
Hadoop Scale_up Performance
Time(mins)
HADOOP IMPLEMENTATION PERFORMANCE
The scale-up performance shows the running time of the implementation while keeping the ratio between input size and nodes fixed.
2000 texts per node using between 2-10 nodes.
SUMMARY
• Speed-up of ~10,000,000x from initial research prototype
• GPU implementation: ~10x cost/performance
Algorithm Engineering works!
• Compress your data
• Precompute!
• Exploit in memory data structurtes
• Mutli-thread
• Scale horizontally
SUMMARY
Asymptotic Analysis
• O( (# of Docs)^2 * (# of words per doc)^2 )
0
500
1000
1500
2000
2500
3000
3500
4000
0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000
Hour
s
Documents
Runtime as a function of # of documents
You can’t out run the
asymptotic complexity!
VOLAP: A FULLY SCALABLE CLOUD-BASED SYSTEM FOR
REAL-TIME OLAP ON HIGH VELOCITY DATA
3) Applying cloud technologies to
accelerate real-time performanceJoint work with
• F. Dehne, D. Robillard
• Q. Kong, N. Burkee
REAL-TIME OLAP ON THE CLOUD
38
Business Analytics based on OLAP
High dimensional data with hierarchies
Real-time insert and queries
Solution
Multiple server processors
Dynamic load balancing
Strong session serialization
PERFORMANCE
39
CLOUD BASED SYSTEM ARCHITECTURE
THE DISTRIBUTED PDCR-TREE
The building block for CR-OLAP system
DC-Tree A fully dynamic index structure which explicitly represents
dimension hierarchies
“The DC-Tree: A Fully Dynamic Index Structure for Data Warehouses”, [Kriegel2000]
PDC-Tree A Multi-threaded DC-Tree
“Parallel Real-Time OLAP on Multi-core Processors”, [Dehne2012]
Distributed PDCR-Tree A new distributed in-memory index structure for the cloud
Add support for distributed memory
Array-based implementation to support efficient migration
Add supports for both ordered and unordered dimensions.
DATA INGESTION PERFORMANCE
42
LOAD BALANCING PERFORMANCE
43
SCALE-UP QUERY PERFORMANCE
44
SUMMARY
CloudA model for high velocity data
There are great tools
New IssuesDesign for elasticity
Think about fault-tolerance from the start
Old IssuesNothing is for free! Still need to worry about Compression
Pre-computation
Efficient Data structures
Multi threading
Big Data
HPC: Clusters, Clouds, &
GPU
Algorithm Engineering
SCALING UP TO BIG DATA
Andrew [email protected]
Risk Analytics Lab,
Faculty of Computer Science
Dalhousie University,
Halifax, Canada
www.Risk-Analytics-Lab.ca