Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Overview of Vector Statistics Functions in Intel® Math Kernel Library (Intel® MKL)
Intel Corporation
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Outline
• Introduction
• Intel® MKL Vector Statistical Library (VSL)
o Random Number Generators
o Summary Statistics
o Convolution/Correlation
• VSL in applications
• Useful links
2
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Introduction
• VSL is one component of Intel® MKL
• Support HPC applications
– Scientific & engineering simulations
– Finance
– Graphics
• Work with vectors rather than scalars
• Good performance, scalability and accuracy
• C & Fortran APIs
3
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
4
VSL Components
• Psuedo-random, quasi-random, and non-deterministic generators
• Continuous and discrete distributions of various common distribution types
Random Number Generators (RNGs)
• Parallelized algorithms for computation of statistical estimates for raw multi-dimensional datasets.
Summary Statistics (SS)
• A set of routines intended to perform linear convolution and correlation transformations for single and double precision real and complex data.
Convolution/correlation
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
5
Random Number Generators (RNGs)
o Large selection of probability distributions
o Double/single precision in continuous generators
o Integer data type in discrete generators
Random Number Generators
Pseudo-random Quasi-random
Multiplicative Congruential 59-bit Sobol
Multiplicative Congruential 31-bit Niederreiter
Multiple Recursive
Generalized Feedback Shift Register
Wichmann-Hill
Mersenne Twister 2203
Mersenne Twister 19937
SIMD-oriented Fast Mersenne Twister 19937
Non-deterministic
RDRAND based (when HW supports)
Distribution Generators
Continuous Discrete
Uniform Uniform
Gaussian UniformBits
GaussianMV UniformBits32
Exponential UniformBits64
Laplace Bernoulli
Weibull Geometric
Cauchy Binomial
Rayleigh Hypergeometric
Lognormal Poisson
Gumbel PoissonV
Gamma NegBinomial
Beta
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
VSL RNGs Performance Summary
• Performance metric: Cycles-per-element (CPE)
- Lower is better
7/3/2012
6
0
3
6
9
12
15
18
MCG31 MCG59 MRG32K3A MT19937 MT2203 NIEDERR R250 SFMT19937 SOBOL WH
CP
E
BRNG
Intel® MKL 11.0 Beta Uniform distribution generator performance
Intel® Core™ i7-2600
Single precision
Double precision
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7
VSL RNGS Usage Model
Pseudo-Random Quasi-Random
01001101110…
Init params (seed)
Transformation method, distribution params
Uniform distribution
Transformation
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
8
VSL RNGs Usage Model
• Common usage model
– Initialization status = vslNewStream(&stream, VSL_BRNG_MT19937, 7777777)
status = vslNewStream(&stream, VSL_BRNG_SOBOL, 10)
status = vslNewStreamEx(&stream, VSL_BRNG_SOBOL, nparams, params)
– Generating random numbers status = vdRngUniform(VSL_METHOD_DUNIFORM_STD, stream, n, r, 0.0, 1.0)
status = vsRngGaussian(VSL_METHOD_SGAUSSIAN_ICDF, stream, n, r, 0.0, 1.0)
– De-initializatioin status = vslDeleteStream(&stream)
• Example (VSL RNG examples can be found in Intel® MKL packages)
#include “mkl_vsl.h”
#define N 1000 /* Vector size */
#define SEED 777 /* Seed for BRNG */
#define BRNG VSL_BRNG_MT19937 /* VSL BRNG */
#define METHOD VSL_METHOD_DGAUSSIAN_ICDF /* Generation method */
main()
{
double r[N], a = 0, sigma = 1.0;
VSLStreamStatePtr stream;
int errcode;
errcode = vslNewStream( &stream, BRNG, SEED ); /* Initialize random stream */
errcode = vdRngGaussian( METHOD, stream, N, r, a, sigma ); /* Call Gaussian Generator */
errcode = vslDeleteStream( &stream ); /* De-initialize random stream */
…
}
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
9
Parallel Computations Supported by VSL RNGs
•Skip-ahead
•Leap-frog
•BRNG set
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
10
Parallel Computation with Skip-ahead
• Skip certain number of random numbers in the stream w/o their actual generation
– vslSkipAheadStream(stream, nskip)
vslNewStream(
&stream3,
VSL_BRNG_MCG59,
seed);
vslSkipAheadStream(
stream3, 200000);
/* Generate up to
100000 random
numbers */
…
vslNewStream(
&stream2,
VSL_BRNG_MCG59,
seed);
vslSkipAheadStream(
stream2, 100000);
/* Generate up to
100000 random
numbers */
…
vslNewStream(
&stream1,
VSL_BRNG_MCG59,
seed);
/* Generate up to
100000 random
numbers */
…
Block 1 Block 2 Block 3
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
11
Parallel Computation with Leap-frog
• Skip certain number of random numbers in the stream w/o their actual generation
– vslLeapfrogStream(stream, thread_num, nthreads)
vslNewStream(
&stream3,
VSL_BRNG_MCG59,
seed);
vslLeapfrogStream(
stream3, 2, 3)
/* Generate random
numbers */
…
vslNewStream(
&stream2,
VSL_BRNG_MCG59,
seed);
vslLeapfrogStream(
stream2, 1, 3)
/* Generate random
numbers */
…
vslNewStream(
&stream1,
VSL_BRNG_MCG59,
seed);
vslLeapfrogStream(
stream1, 0, 3)
/* Generate random
numbers */
…
Multidimensional uniformity properties of each subsequence deteriorate seriously as nthreads grows. The method may be recommended if nthreads is not bigger than a few dozen
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
12
Parallel Computation with BRNG Set
• Wichmann-Hill – set of 273 “independent” BRNGs
• MT2203 – set of 6024 “independent” BRNGs
vslNewStream(
&stream3,
VSL_BRNG_MT2203+2,
seed)
/* Generate random
numbers */
…
vslNewStream(
&stream2,
VSL_BRNG_MT2203+1,
seed)
/* Generate random
numbers */
…
vslNewStream(
&stream1,
VSL_BRNG_MT2203+0,
seed)
/* Generate random
numbers */
…
MT2203+0 MT2203+1 MT2203+3
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
13
Parallel Computation with BRNG Set Example
#include “mkl_vsl.h”
#define N 1000 /* Vector size */
#define NSTREAMS 10 /* Number of streams */
#define METHOD VSL_METHOD_DGAUSSIAN_ICDF /* Generation method */
main()
{
double r[NSTREAMS][N];
VSLStreamStatePtr stream[NSTREAMS];
int errcode; double a=0.0,sigma=1.0; int seed = 777;
for ( k=0; k< NSTREAMS; k++ )
{
vslNewStream( &stream[k], VSL_BRNG_MT2203+k, seed ); /* Initialize random streams */
}
#pragma omp parallel for
for ( k=0; k< NSTREAMS; k++ )
{
errcode = vdRngGaussian( METHOD, stream[k], N, r[k], a, sigma ); /* Call Gaussian RNG */
}
for ( k=0; k< NSTREAMS; k++ )
{
errcode = vslDeleteStream( &stream[k] ); /* De-initialize random streams */
}
}
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
14
Run time (seconds) Speedup over glibc rand()
Standard C rand() function 20.44 1.00
Intel® MKL VSL RNG MCG31m1 3.57 5.73
OpenMP* version (4 threads) 0.94 21.74
Intel® Core i7-2600 processor, 3.4 GHz, 8MB L3 cache, 4GB memory OS: Linux* 64-bit Intel® C++ Composer XE 2011 Intel® MKL 11.0 Beta
VSL RNG (MCG31m1) vs. glibc rand()
VSL RNG Performance Advantages
14
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
VSL RNGs Performance: Continuous vs. Discrete Generators
7/3/2012
15
0
1
2
3
4
5
6
7
0 5 10 15 20 25 30 35 40
CP
E
SOBOL dimension
Intel® MKL 11.0 Beta SOBOL Quasi-RNG uniform distribution
Intel® Core™ i7-2600
Single precision
Double precision
Integer
• CPE (Cycles-Per-Element). Lower is faster
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
16
VSL RNGs Performance: Scalability
Full matrix storage Packed matrix storage
0
0.5
1
1.5
2
2.5
3
single precision full double precision full single precision packed double precision packed
Sp
eed
up
vs.
1 t
hread
Intel® MKL 11.0 Beta Multivariate Gaussian generator scalability
MT19937 BRNG, Vector length 1000, Dimension size 100 Sandy-Bridge EP, 16 cores, 3.1 GHz OS: Linux* 64-bit
2 threads 4 threads 8 threads 16 threads
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
17
Summary Statistics
• Algorithms for statistical “raw” analysis of big datasets
• Threaded and vectorized computation routines
Available estimates
Raw and central moments up to the fourth order
Excess kurtosis, skewness and variation
Minimum, maximum, quantiles/streaming quantiles, and order statistics
Variance-covariance/correlation matrix
Pooled/group variance-covariance matrix and mean
Partial variance-covariance/correlation matrix
Robust estimators for variance-covariance matrix and mean in presence of outliers
Detection of outliers
Handling of missing values in datasets
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
18
Summary Statistics Usage Model
• Common usage model
– Create a task vsldSSNewTask( &task, &p, &n, &xstorage, x, w, indices);
– Modify settings of the task parameters vsldSSEditTask( task, VSL_SS_ED_MEAN, &mean );
– Compute statistical estimates vsldSSCompute( task, VSL_SS_MEAN, VSL_SS_METHOD_FAST );
– Destroy the task vslSSDeleteTask( &task );
• Example (VSL SS examples can be found in Intel® MKL packages)
#include “mkl.h”
#define DIM 3 /* Task dimension */
#define N 10000 /* Number of observations */
int main()
{
VSLSSTaskPtr task;
MKL_INT dim, n, x_storage, cov_storage cor_storage;
double x[N*DIM], cov[DIM*DIM], cor[DIM*DIM], mean[DIM];
…
vsldSSNewTask( &task, &dim, &n, &x_storage, x, 0, 0 ); /* Create a task */
vsldSSEditCovCor( task, mean, cov, &cov_storage, cor, &cor_storage ); /* Modify the task parameters */
vsldSSCompute( task, VSL_SS_COV|VSL_SS_COR, VSL_SS_METHOD_FAST ); /* Compute statistical estimates */
vslSSDeleteTask( &task ); /* Destroy the task */
}
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Summary Statistics Performance: Scalability
19
7/3/2012
0
2
4
6
8
10
12
14
16
Covariance SP Covariance DP Quantiles SP Quantiles DP
Sp
eed
up
over 1
-th
read
perfo
rm
an
ce
Intel® MKL 11.0 Beta Update 2 Summary Statistics Functions Scalability Sandy Bridge-EP, 16 cores, 3.1 GHz
OS: Linux* 64-bit
1 thread 2 threads 4 threads 8 threads 16 threads
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
20
Convolution & Correlation
• Linear transformations
• Double/single, real/complex data types
• Direct and Fourier algorithms for one dimensional
and multi-dimenstional data (dimensions: 1-7)
• Wrappers are provided (as code samples) for
IBM* ESSL* functions
• Fourier algorithms are parallelized through Intel®
MKL DFT parallelization
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
21
Convolution/Correlation Usage Model
• Common usage model
– Create a new task descriptor or copy task descriptor vslsConvNewTask(task, mode, dims, xshape, yshape, zshape);
vslConvCopyTask(newtask, srctask);
– Change the parameter settings vslConvSetMode(task, newmode);
vslConvSetInternalPrecision(task, precision);
vslConvSetStart(task, start);
vslConvSetDecimation(task, decimation);
– Compute convolution or correlation
vslsConvExec(task, x, xstride, y, ystride, z, zstride);
– Delete task object vslConvDeleteTask(task);
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Convolution/Correlation Example
22
7/3/2012
#include “mkl_vsl.h”
#define XSHAPE 100
#define YSHAPE 1000
#define ZSHAPE (XSHAPE-1)+(YSHAPE-1)+1
…
VSLConvTaskPtr task;
MKL_INT rank=1, mode=VSL_CONV_MODE_DIRECT, xstride=1, ystride=1, zstride=1;
MKL_INT xshape=XSHAPE, yshape=YSHAPE, zshape=ZSHAPE;
MKL_Complex8 x[XSHAPE],y[YSHAPE],z[ZSHAPE];
…
vslcConvNewTask(&task,mode,rank,&xshape,&yshape,&zshape); /* Create a new task */
vslcConvExec(task,x,&xstride,y,&ystride,z,&zstride); /* Compute convolution */
vslConvDeleteTask(&task); /* Delete task object */
• Conv/Corr examples can be found in Intel® MKL packages
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Convolution Performance
23
7/3/2012
0
2
4
6
8
10
12
14
16
18
GFlo
ps
Y array size
Intel® MKL 11.0 Beta 2D convolution performance
Single precision 1000*1000 array X with array Y Intel® Core™ i7-2600
1T
2T
4T
8T
16T
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
24
VSL in Applications: Online Noise Filtration
• Investigating statistical dependencies in dataset
– e.g., in a set of stocks
• Data arrive in chunks and are typically noisy
– e.g., “market noise”
– Historical statistics and the latest block are available
• Filtering statistical “noise” from “signal”
– Incremental filtration
• Filtering is based on theory of randomized matrices – H. Kargupta, K. Sivakumar, and S. Ghosh. Dependency Detection in MobiMine and
Random Matrices, In Proceedings of PKDD'2002, pp. 250–262, 2002. Springer-Verlag Berlin Heidelberg 2002
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
25
VSL in Applications: Online Noise Filtration
time …
Filter
Dk p x m(tk)
t1
D1 p x m(t1)
D2 p x m(t2) …
t2 tk
Data arrive in chunks; each chunk – matrix of size p x m, t(i)
Signal component Noise component
Major blocks of the filter
• Update correlation matrix
using the latest data chunk Dk
• Apply PCA(*): compute
Eigenvalues/vectors for the correlation
• Split Eigenvalues into two sets(**):
1st set presents signal, 2nd set – noise
• Assembly signal and noise correlations
from 2 sets of Eigenvalues/vectors
Further analysis
* PCA - Principal Component Analysis
** Split is based on Randomized Matrix theory
and distribution of Eigenvalues
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
26
VSL in Applications: Online Noise Filtration
Initialization #define P 450 /* # of stocks*/
#define M 1000 /* number of observations in block */
…
VSLSSTaskPtr task;
double x[P*M], cor[P*P], W[2];
MKL_INT p, m, x_storage, cor_storage;
/* Initialize VSL Summary Stats task */
P = P; m = M;
x_storage = VSL_SS_MATRIX_STORAGE_COLS;
vsldSSNewTask( &task, &n, &m, x, &x_storage, 0,0 );
/* Set-up parameters of the task */
/* Specify memory for correlation estimate in task */
cor_storage = VSL_SS_MATRIX_STORAGE_FULL;
vsldSSEditCovCor( task, mean, 0, 0, cor, cor_storage );
/* Specify the parameter for progressive estimation of
correlation */
W[0] = W[1] = 0.0;
vsldSSEditTask( task, VSL_SS_ED_ACCUM_WEIGHT, W );
…
Computation
/* set threshold that define noise component */
l1 = ( 1.0 – sqrt ( p / m ) );
l2 = ( 1.0 + sqrt ( p / m ) );
/* loop over data blocks */
for ( nblock = 0; ; nblock ++ )
{
/* Get the next chunk of size p x m into x */
GetNextChunck( p, m, x );
/* Update correlation estimate in cor */
vsldSSCompute( task, VSL_SS_COR,
VSL_SS_METHOD_FAST );
/* Apply PCA and compute eigen-values that
belong to (l1, l2) and define noise */
dsyevr(…,l1, l2, …);
/* Assembly correlation matrix of noise */
...
dsyrk( evect_n, ..., cor_n,... );
/* compute correlation matrix of signal
by substracting cor_n from cor */
}
De-Initialization vslSSDeleteTask( task );
MKL_Free_Buffers();
…
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
VSL in Applications: Online Noise Filtration
27
7/3/2012
Online Noise Filtration performance Intel® Xeon® E5-2690 (16 cores)
Intel® MKL 11.0 Beta S&P500 historic data, block size 450 x 1000
Seconds per block Speedup vs. Baseline
Baseline implementation (using Netlib)
0.883 1.0
Optimized implementation (using Intel® MKL)
0.031 28.9
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
•Vector Statistical Library
o Great collection of Random Number Generators, summary statistics and convolution/correlation functions
o Performance enhancements through vectorization and threading
o Compatible with various compilers
o Native C and FORTRAN APIs
28
Summary
Combine power of Intel® Compiler with VSL functions in Intel® MKL
to maximize performance of your applications
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
• VSL RNG performance data
– http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm
• MKL Manual, Chapter 10
– http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation/
• VSL RNG Notes
– http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vslnotes.pdf
• VSL Summary Statistics Notes
– http://software.intel.com/sites/products/documentation/hpc/mkl/ssl/sslnotes.pdf
29
Useful Links
30
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
31