Upload
kaveri
View
21
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Intel Collaborative Research Institute Computational Intelligence. Self-Learning, Adaptive Computer Systems. Yoav Etsion , Technion CS & EE Dan Tsafrir , Technion CS Shie Mannor , Technion EE Assaf Schuster, Technion CS. Intel Collaborative Research Institute - PowerPoint PPT Presentation
Citation preview
Self-Learning, Adaptive Computer Systems
Intel Collaborative Research InstituteComputational Intelligence
Yoav Etsion, Technion CS & EE
Dan Tsafrir, Technion CS
Shie Mannor, Technion EE
Assaf Schuster, Technion CS
Adaptive Computer Systems• Complexity of computer systems keeps growing
• We are moving towards heterogeneous hardware• Workloads are getting more diverse• Process variability affects performance/power of
different parts of the system
• Human programmers and administrators• cannot handle complexity
• The goal: Adapt to workload and hardware variability
Intel Collaborative Research InstituteComputational Intelligence
Predicting System Behavior
• When a human observes the workload, she can typically identify cause and effect
• Workload carries inherent semantics• The problem is extracting them automatically…
• Key issues with machine learning:• Huge datasets (performance counters; exec. traces)• Need extremely fast response time (in most cases)• Rigid space constraints for ML algorithms
Intel Collaborative Research InstituteComputational Intelligence
Memory + Machine LearningCurrent state-of-the-art
• Architectures are tuned for structured data• Managed using simple heuristics
• Spatial and temporal locality• Frequency and recency (ARC)
• Block and stride prefetchers
• Real data is not well structured• Programmer must transform data• Unrealistic for program agnostic
management (swapping, prefetching)
Intel Collaborative Research InstituteComputational Intelligence
Memory + Machine LearningMultiple learning opportunities
• Identify patterns using machine learning• Bring data to the right place at the right time
• Memory hierarchy forms a pyramid• Caches / DRAM, PCM / SSD, HDD
• Different levels require different learning strategies• Top: smaller, faster, costlier [prefetching to
caches]• Bottom: bigger, slower, pricier [fetching from disk]
• Need both hardware and software support
Intel Collaborative Research InstituteComputational Intelligence
Research track:
Predicting Latent Faults in Data Centers
Intel Collaborative Research InstituteComputational Intelligence
Moshe Gabel, Assaf Schuster
• Failures and misconfiguration happen in large datacenters• Cause performance anomalies?
• Sound statistical framework to detect latent faults• Practical:
Non-intrusive, unsupervised, no domain knowledge• Adaptive:
No parameter tuning, robust to system/workload changes
Intel Collaborative Research InstituteComputational Intelligence
7
Latent Fault Detection
• Applied to real-world production service of 4.5K machines
• Over 20% machine/sw failures preceded by latent faults• Slow response time; network errors; disk access times
• Predict failures 14 days in advance, 70% precision, 2% FPR
• Latent Fault Detection in Large Scale Services, DSN 2012
Intel Collaborative Research InstituteComputational Intelligence
8
Latent Fault Detection
Research track:
Task Differentials: Dynamic, inter-thread predictions
using memory access footsteps
Intel Collaborative Research InstituteComputational Intelligence
Adi Fuchs , Yoav Etsion, Shie Mannor, Uri WeiserD
Motivation We are in the age of parallel computing.
Programming paradigms shift towards task level parallelism
Tasks are supported by libraries such as TBB and OpenMP:
Implicit forms of task level parallelism include GPU kernels and parallel loops
Tasks behavior tends to be highly regular = target for learning and adaptation
Intel Collaborative Research InstituteComputational Intelligence
...GridLauncher<InitDensitiesAndForcesMTWorker> &id = *new (tbb::task::allocate_root()) GridLauncher<InitDensitiesAndForcesMTWorker>(NUM_TBB_GRIDS);tbb::task::spawn_root_and_wait(id);GridLauncher<ComputeDensitiesMTWorker> &cd = *new (tbb::task::allocate_root()) GridLauncher<ComputeDensitiesMTWorker>(NUM_TBB_GRIDS);tbb::task::spawn_root_and_wait(cd);...
Taken from: PARSEC.fluidanimate TBB implementation
Parallel sectionSynchronization
Parallel section
Synchronization
Synchronization
task
s
10
How do things currently work?• Programmer codes a parallel loop
• SW maps multiple tasks to one thread• HW sees a sequence of
instructions
• HW prefetchers try to identify patterns between consecutive memory accesses
• No notion of program semantics, i.e. execution consists of a sequence of tasks, not instructions
Intel Collaborative Research InstituteComputational Intelligence
11
A
B
C
A B C D E E
Task Address Set Given the memory trace of task instance A, the task address set TA is a unique set of addresses ordered by access time:
Intel Collaborative Research InstituteComputational Intelligence
Trace:START TASK INSTANCE(A)R 0x7f27bd6df8R 0x61e630R 0x6949ccR 0x7f77b02010R 0x6949ccR 0x61e6d0R 0x61e6e0W 0x7f77b02010STOP TASK INSTANCE(A)
TA:0x7f27bd6df80x61e6300x6949cc0x7f77b020100x61e6d00x61e6e0
1 2, ...A nT a a a
12
Address Differentials Motivation: Task instance address sets are usually meaningless
Intel Collaborative Research InstituteComputational Intelligence
TA:7F27BD6DF8
61E630
6949CC
7F77B02010
61E6D0
61E6E0
+ 0 =
+ 8000480 =
+ 54080 = + 8770090 =
+ 456 =
-1808 = Differences tend to be compact and regular, thus can represent state transitions 13
TB:7F27BD6DF8
DBFA10
6A1D0C
7F7835F23A
61E898
61DFD0
TC:7F27BD6DF8
1560DF0
6AF04C
7F78BBC464
61EA60
61D8C0
+ 0 =
+ 8000480 =
+ 54080 = + 8770090 =
+ 456 =
-1808 =
Address Differentials Given instances A and B, the differential vector is defined as follows:
Example:
Intel Collaborative Research InstituteComputational Intelligence
TA:10000 6000080000007F00000FE000
|AB i i i ib a for each i D TA
a1
DAB
1
2a2
TB
b1
b2
32, 96, 8, 64, 96
14
TB:10020 6006080000087F00040FE060
Differentials Behavior: Mathematical intuition
Intel Collaborative Research InstituteComputational Intelligence
Differential use is beneficial in cases of high redundancy.
Application distribution functions can provide the intuition on vector repetitions.
Non uniform CDFs imply highly regular patterns.
Uniform CDFs imply noisy patterns (differentials behavior cannot be exploited)
Non uniform
Uniform
15
Differentials Behavior: Mathematical intuition
Intel Collaborative Research InstituteComputational Intelligence
Given N vectors, straightforward dictionary will be of size: R=log2(N) Entropy H is a theoretical lower bound on representation, based on distribution:
Example – assuming 1000 vector instances with 4 possible values: R = 2.
Differential Entropy Compression Ratio (DECR) is used as repetition criteria:
1
logN
k
H p k p k
Differential Value #instances p(20,8000,720,100050) 700 0.7
(16,8040,-96,50) 150 0.15(0,0,14420,100) 50 0.05
(0,0,720,100050) 100 0.1
0.7 log 0.7 0.15 log 0.151.31
0.05 log 0.05 0.1 log 0.1H
Benchmark Suite Implementation Differential representation Differential entropy DECR (%)FFT.128M BOTS OpenMP 19.4 14.4 25.5NQUEENS.N=12 BOTS OpenMP 11.8 8.4 28.7SORT.8M BOTS OpenMP 16.4 16.3 0.1SGEFA.500x500 LINPACK OpenMP 14.1 0.9 93.6FLUIDANIMATE.SIMSMALL PARSEC TBB 16.4 8.0 51.3SWAPTIONS.SIMSMALL PARSEC TBB 17.9 13.1 26.6STREAMCLUSTER.SIMSMALL PARSEC TBB 19.6 8.9 54.4
16
Possible differential application: cache line prefetching
Intel Collaborative Research InstituteComputational Intelligence
First attempt: Prefix based predictor, given a differential prefix – predict suffix Example: A and B finished running ( is stored) Now C is running…
17
TA:7F27BD6DF8
61E630
6949CC
7F77B02010
61E6D0
61E6E0
0,
8000480,
54080,
8770090,
456,
-1808
TB:7F27BD6DF8
DBFA10
6A1D0C
7F7835F23A
61E898
61DFD0
TC:7F27BD6DF8
1560DF0
6AF04C?
7F78BBC464?
61EA60?
61D8C0?
0,
8000480,
54080?
8770090?
456?
-1808?
Possible differential application: cache line prefetching
Intel Collaborative Research InstituteComputational Intelligence
Second attempt: PHT predictor, based on the last X differentials – predict next differential. Example:
32 96 8 64 96 32 96 8 64 96 10 16 0 16 32 32 96 8 64 96 32 96 8 64 96 10 16 0 16 32 32 96 8 64 96 32 96 8 64 96
18
Possible differential application: cache line prefetching
Intel Collaborative Research InstituteComputational Intelligence
Prefix policy: Differential DB is a prefix tree, Prediction performed once differential prefix is unique. PHT policy: Differential DB hold the history table, Prediction performed upon task start, based on history pattern:
19
Differential DB
Past Task Addresses
ExecutionCPUs
Caching Hierarchy
Current Task Addresses
New Memory Request
Differential logic
New Differential
Pre-fetch Addresses
Start task/Stop task
Possible differential application: cache line prefetching
Intel Collaborative Research InstituteComputational Intelligence
Predictors compared with 2 models: Base (no prefetching) and Ideal (theoretical predictor – accurately predicts every repeating differential)
NQ
UEEN
S.N=12
SWAPTIO
NS
FLUIDAN
IMATE
SGEFA.500
0
1
2
3
4
5
6
Misses Per 1K Instructions
BasePrefixPHTIdeal
STREAMCLU
STER
FFT.128M
SORT.8M
0
10
20
30
40
50
60
70
Misses Per 1K InstructionsBasePrefixPHTIdeal
Cache Miss Elimination (%)Prefix PHT Ideal
NQUEENS.N=12 19.4 11.4 62.1SWAPTIONS 18.3 0.1 49.2FLUIDANIMATE 14.9 26.0 46.0SGEFA.500 0.0 97.6 99.9STREAMCLUSTER 21.7 36.5 82.3FFT.128M 45.0 -1.0 87.9SORT.8M 3.3 0.0 0.1
20
Future work
Intel Collaborative Research InstituteComputational Intelligence
Hybrid policies: which policy to use when? (PHT is better for complete vector repetitions, prefix is better for partial vector repetitions, i.e. suffixes)
Regular expression based policy (for pattern matching, beyond “ideal” model)
Predict other functional features using differentials (e.g. branch prediction, PTE prefetching etc.)
21
Conclusions (so far…)
Intel Collaborative Research InstituteComputational Intelligence
• When we look at the data, patterns emerge…• Quite a large headroom for optimizing computer systems• Existing predictions are based on heuristics
• A machine that does not respond within 1s is considered dead• Memory prefetchers look for blocked and strided accesses
• Goal: Use ML, not heuristics, to uncover behavioral semantics
22