Overview of Vector Statistics Functions in Intel® Math ... · Intel® Core i7-2600 processor, 3.4 GHz, 8MB L3 cache, 4GB memory OS: Linux* 64-bit Intel® C++ Composer XE 2011 Intel®

Overview of Vector Statistics Functions in Intel® Math Kernel Library (Intel® MKL)

Intel Corporation

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Outline

• Introduction

• Intel® MKL Vector Statistical Library (VSL)

o Random Number Generators

o Summary Statistics

o Convolution/Correlation

• VSL in applications

• Useful links

2


Introduction

• VSL is one component of Intel® MKL

• Support HPC applications

– Scientific & engineering simulations

– Finance

– Graphics

• Work with vectors rather than scalars

• Good performance, scalability and accuracy

• C & Fortran APIs

3


4

VSL Components

• Psuedo-random, quasi-random, and non-deterministic generators

• Continuous and discrete distributions of various common distribution types

Random Number Generators (RNGs)

• Parallelized algorithms for computation of statistical estimates for raw multi-dimensional datasets.

Summary Statistics (SS)

• A set of routines intended to perform linear convolution and correlation transformations for single and double precision real and complex data.

Convolution/correlation


5

Random Number Generators (RNGs)

o Large selection of probability distributions

o Double/single precision in continuous generators

o Integer data type in discrete generators

Random Number Generators

Pseudo-random Quasi-random

Multiplicative Congruential 59-bit Sobol

Multiplicative Congruential 31-bit Niederreiter

Multiple Recursive

Generalized Feedback Shift Register

Wichmann-Hill

Mersenne Twister 2203

Mersenne Twister 19937

SIMD-oriented Fast Mersenne Twister 19937

Non-deterministic

RDRAND based (when HW supports)

Distribution Generators

Continuous Discrete

Uniform Uniform

Gaussian UniformBits

GaussianMV UniformBits32

Exponential UniformBits64

Laplace Bernoulli

Weibull Geometric

Cauchy Binomial

Rayleigh Hypergeometric

Lognormal Poisson

Gumbel PoissonV

Gamma NegBinomial

Beta


VSL RNGs Performance Summary

• Performance metric: Cycles-per-element (CPE)

- Lower is better

7/3/2012

6

0

3

6

9

12

15

18

MCG31 MCG59 MRG32K3A MT19937 MT2203 NIEDERR R250 SFMT19937 SOBOL WH

CP

E

BRNG

Intel® MKL 11.0 Beta Uniform distribution generator performance

Intel® Core™ i7-2600

Single precision

Double precision


7

VSL RNGS Usage Model

Pseudo-Random Quasi-Random

01001101110…

Init params (seed)

Transformation method, distribution params

Uniform distribution

Transformation


8

VSL RNGs Usage Model

• Common usage model

– Initialization status = vslNewStream(&stream, VSL_BRNG_MT19937, 7777777)

status = vslNewStream(&stream, VSL_BRNG_SOBOL, 10)

status = vslNewStreamEx(&stream, VSL_BRNG_SOBOL, nparams, params)

– Generating random numbers status = vdRngUniform(VSL_METHOD_DUNIFORM_STD, stream, n, r, 0.0, 1.0)

status = vsRngGaussian(VSL_METHOD_SGAUSSIAN_ICDF, stream, n, r, 0.0, 1.0)

– De-initializatioin status = vslDeleteStream(&stream)

• Example (VSL RNG examples can be found in Intel® MKL packages)

#include “mkl_vsl.h”

#define N 1000 /* Vector size */

#define SEED 777 /* Seed for BRNG */

#define BRNG VSL_BRNG_MT19937 /* VSL BRNG */

#define METHOD VSL_METHOD_DGAUSSIAN_ICDF /* Generation method */

main()

{

double r[N], a = 0, sigma = 1.0;

VSLStreamStatePtr stream;

int errcode;

errcode = vslNewStream( &stream, BRNG, SEED ); /* Initialize random stream */

errcode = vdRngGaussian( METHOD, stream, N, r, a, sigma ); /* Call Gaussian Generator */

errcode = vslDeleteStream( &stream ); /* De-initialize random stream */

…

}


9

Parallel Computations Supported by VSL RNGs

•Skip-ahead

•Leap-frog

•BRNG set


10

Parallel Computation with Skip-ahead

• Skip certain number of random numbers in the stream w/o their actual generation

– vslSkipAheadStream(stream, nskip)

vslNewStream(

&stream3,

VSL_BRNG_MCG59,

seed);

vslSkipAheadStream(

stream3, 200000);

/* Generate up to

100000 random

numbers */

…

vslNewStream(

&stream2,

VSL_BRNG_MCG59,

seed);

vslSkipAheadStream(

stream2, 100000);

/* Generate up to

100000 random

numbers */

…

vslNewStream(

&stream1,

VSL_BRNG_MCG59,

seed);

/* Generate up to

100000 random

numbers */

…

Block 1 Block 2 Block 3


11

Parallel Computation with Leap-frog

• Skip certain number of random numbers in the stream w/o their actual generation

– vslLeapfrogStream(stream, thread_num, nthreads)

vslNewStream(

&stream3,

VSL_BRNG_MCG59,

seed);

vslLeapfrogStream(

stream3, 2, 3)

/* Generate random

numbers */

…

vslNewStream(

&stream2,

VSL_BRNG_MCG59,

seed);

vslLeapfrogStream(

stream2, 1, 3)

/* Generate random

numbers */

…

vslNewStream(

&stream1,

VSL_BRNG_MCG59,

seed);

vslLeapfrogStream(

stream1, 0, 3)

/* Generate random

numbers */

…

Multidimensional uniformity properties of each subsequence deteriorate seriously as nthreads grows. The method may be recommended if nthreads is not bigger than a few dozen


12

Parallel Computation with BRNG Set

• Wichmann-Hill – set of 273 “independent” BRNGs

• MT2203 – set of 6024 “independent” BRNGs

vslNewStream(

&stream3,

VSL_BRNG_MT2203+2,

seed)

/* Generate random

numbers */

…

vslNewStream(

&stream2,

VSL_BRNG_MT2203+1,

seed)

/* Generate random

numbers */

…

vslNewStream(

&stream1,

VSL_BRNG_MT2203+0,

seed)

/* Generate random

numbers */

…

MT2203+0 MT2203+1 MT2203+3


13

Parallel Computation with BRNG Set Example


#define N 1000 /* Vector size */

#define NSTREAMS 10 /* Number of streams */

#define METHOD VSL_METHOD_DGAUSSIAN_ICDF /* Generation method */

main()

{

double r[NSTREAMS][N];

VSLStreamStatePtr stream[NSTREAMS];

int errcode; double a=0.0,sigma=1.0; int seed = 777;

for ( k=0; k< NSTREAMS; k++ )

{

vslNewStream( &stream[k], VSL_BRNG_MT2203+k, seed ); /* Initialize random streams */

}

#pragma omp parallel for


{

errcode = vdRngGaussian( METHOD, stream[k], N, r[k], a, sigma ); /* Call Gaussian RNG */

}


{

errcode = vslDeleteStream( &stream[k] ); /* De-initialize random streams */

}

}


14

Run time (seconds) Speedup over glibc rand()

Standard C rand() function 20.44 1.00

Intel® MKL VSL RNG MCG31m1 3.57 5.73

OpenMP* version (4 threads) 0.94 21.74

Intel® Core i7-2600 processor, 3.4 GHz, 8MB L3 cache, 4GB memory OS: Linux* 64-bit Intel® C++ Composer XE 2011 Intel® MKL 11.0 Beta

VSL RNG (MCG31m1) vs. glibc rand()

VSL RNG Performance Advantages

14


VSL RNGs Performance: Continuous vs. Discrete Generators

7/3/2012

15

0

1

2

3

4

5

6

7

0 5 10 15 20 25 30 35 40

CP

E

SOBOL dimension

Intel® MKL 11.0 Beta SOBOL Quasi-RNG uniform distribution

Intel® Core™ i7-2600

Single precision

Double precision

Integer

• CPE (Cycles-Per-Element). Lower is faster


16

VSL RNGs Performance: Scalability

Full matrix storage Packed matrix storage

0

0.5

1

1.5

2

2.5

3

single precision full double precision full single precision packed double precision packed

Sp

eed

up

vs.

1 t

hread

Intel® MKL 11.0 Beta Multivariate Gaussian generator scalability

MT19937 BRNG, Vector length 1000, Dimension size 100 Sandy-Bridge EP, 16 cores, 3.1 GHz OS: Linux* 64-bit

2 threads 4 threads 8 threads 16 threads


17

Summary Statistics

• Algorithms for statistical “raw” analysis of big datasets

• Threaded and vectorized computation routines

Available estimates

Raw and central moments up to the fourth order

Excess kurtosis, skewness and variation

Minimum, maximum, quantiles/streaming quantiles, and order statistics

Variance-covariance/correlation matrix

Pooled/group variance-covariance matrix and mean

Partial variance-covariance/correlation matrix

Robust estimators for variance-covariance matrix and mean in presence of outliers

Detection of outliers

Handling of missing values in datasets


18

Summary Statistics Usage Model


– Create a task vsldSSNewTask( &task, &p, &n, &xstorage, x, w, indices);

– Modify settings of the task parameters vsldSSEditTask( task, VSL_SS_ED_MEAN, &mean );

– Compute statistical estimates vsldSSCompute( task, VSL_SS_MEAN, VSL_SS_METHOD_FAST );

– Destroy the task vslSSDeleteTask( &task );

• Example (VSL SS examples can be found in Intel® MKL packages)

#include “mkl.h”

#define DIM 3 /* Task dimension */

#define N 10000 /* Number of observations */

int main()

{

VSLSSTaskPtr task;

MKL_INT dim, n, x_storage, cov_storage cor_storage;

double x[N*DIM], cov[DIM*DIM], cor[DIM*DIM], mean[DIM];

…

vsldSSNewTask( &task, &dim, &n, &x_storage, x, 0, 0 ); /* Create a task */

vsldSSEditCovCor( task, mean, cov, &cov_storage, cor, &cor_storage ); /* Modify the task parameters */

vsldSSCompute( task, VSL_SS_COV|VSL_SS_COR, VSL_SS_METHOD_FAST ); /* Compute statistical estimates */

vslSSDeleteTask( &task ); /* Destroy the task */

}


Summary Statistics Performance: Scalability

19

7/3/2012

0

2

4

6

8

10

12

14

16

Covariance SP Covariance DP Quantiles SP Quantiles DP

Sp

eed

up

over 1

-th

read

perfo

rm

an

ce

Intel® MKL 11.0 Beta Update 2 Summary Statistics Functions Scalability Sandy Bridge-EP, 16 cores, 3.1 GHz

OS: Linux* 64-bit

1 thread 2 threads 4 threads 8 threads 16 threads


20

Convolution & Correlation

• Linear transformations

• Double/single, real/complex data types

• Direct and Fourier algorithms for one dimensional

and multi-dimenstional data (dimensions: 1-7)

• Wrappers are provided (as code samples) for

IBM* ESSL* functions

• Fourier algorithms are parallelized through Intel®

MKL DFT parallelization


21

Convolution/Correlation Usage Model


– Create a new task descriptor or copy task descriptor vslsConvNewTask(task, mode, dims, xshape, yshape, zshape);

vslConvCopyTask(newtask, srctask);

– Change the parameter settings vslConvSetMode(task, newmode);

vslConvSetInternalPrecision(task, precision);

vslConvSetStart(task, start);

vslConvSetDecimation(task, decimation);

– Compute convolution or correlation

vslsConvExec(task, x, xstride, y, ystride, z, zstride);

– Delete task object vslConvDeleteTask(task);


Convolution/Correlation Example

22

7/3/2012


#define XSHAPE 100

#define YSHAPE 1000

#define ZSHAPE (XSHAPE-1)+(YSHAPE-1)+1

…

VSLConvTaskPtr task;

MKL_INT rank=1, mode=VSL_CONV_MODE_DIRECT, xstride=1, ystride=1, zstride=1;

MKL_INT xshape=XSHAPE, yshape=YSHAPE, zshape=ZSHAPE;

MKL_Complex8 x[XSHAPE],y[YSHAPE],z[ZSHAPE];

…

vslcConvNewTask(&task,mode,rank,&xshape,&yshape,&zshape); /* Create a new task */

vslcConvExec(task,x,&xstride,y,&ystride,z,&zstride); /* Compute convolution */

vslConvDeleteTask(&task); /* Delete task object */

• Conv/Corr examples can be found in Intel® MKL packages


Convolution Performance

23

7/3/2012

0

2

4

6

8

10

12

14

16

18

GFlo

ps

Y array size

Intel® MKL 11.0 Beta 2D convolution performance

Single precision 1000*1000 array X with array Y Intel® Core™ i7-2600

1T

2T

4T

8T

16T


24

VSL in Applications: Online Noise Filtration

• Investigating statistical dependencies in dataset

– e.g., in a set of stocks

• Data arrive in chunks and are typically noisy

– e.g., “market noise”

– Historical statistics and the latest block are available

• Filtering statistical “noise” from “signal”

– Incremental filtration

• Filtering is based on theory of randomized matrices – H. Kargupta, K. Sivakumar, and S. Ghosh. Dependency Detection in MobiMine and

Random Matrices, In Proceedings of PKDD'2002, pp. 250–262, 2002. Springer-Verlag Berlin Heidelberg 2002


25


time …

Filter

Dk p x m(tk)

t1

D1 p x m(t1)

D2 p x m(t2) …

t2 tk

Data arrive in chunks; each chunk – matrix of size p x m, t(i)

Signal component Noise component

Major blocks of the filter

• Update correlation matrix

using the latest data chunk Dk

• Apply PCA(*): compute

Eigenvalues/vectors for the correlation

• Split Eigenvalues into two sets(**):

1st set presents signal, 2nd set – noise

• Assembly signal and noise correlations

from 2 sets of Eigenvalues/vectors

Further analysis

* PCA - Principal Component Analysis

** Split is based on Randomized Matrix theory

and distribution of Eigenvalues


26


Initialization #define P 450 /* # of stocks*/

#define M 1000 /* number of observations in block */

…

VSLSSTaskPtr task;

double x[P*M], cor[P*P], W[2];

MKL_INT p, m, x_storage, cor_storage;

/* Initialize VSL Summary Stats task */

P = P; m = M;

x_storage = VSL_SS_MATRIX_STORAGE_COLS;

vsldSSNewTask( &task, &n, &m, x, &x_storage, 0,0 );

/* Set-up parameters of the task */

/* Specify memory for correlation estimate in task */

cor_storage = VSL_SS_MATRIX_STORAGE_FULL;

vsldSSEditCovCor( task, mean, 0, 0, cor, cor_storage );

/* Specify the parameter for progressive estimation of

correlation */

W[0] = W[1] = 0.0;

vsldSSEditTask( task, VSL_SS_ED_ACCUM_WEIGHT, W );

…

Computation

/* set threshold that define noise component */

l1 = ( 1.0 – sqrt ( p / m ) );

l2 = ( 1.0 + sqrt ( p / m ) );

/* loop over data blocks */

for ( nblock = 0; ; nblock ++ )

{

/* Get the next chunk of size p x m into x */

GetNextChunck( p, m, x );

/* Update correlation estimate in cor */

vsldSSCompute( task, VSL_SS_COR,

VSL_SS_METHOD_FAST );

/* Apply PCA and compute eigen-values that

belong to (l1, l2) and define noise */

dsyevr(…,l1, l2, …);

/* Assembly correlation matrix of noise */

...

dsyrk( evect_n, ..., cor_n,... );

/* compute correlation matrix of signal

by substracting cor_n from cor */

}

De-Initialization vslSSDeleteTask( task );

MKL_Free_Buffers();

…



27

7/3/2012

Online Noise Filtration performance Intel® Xeon® E5-2690 (16 cores)

Intel® MKL 11.0 Beta S&P500 historic data, block size 450 x 1000

Seconds per block Speedup vs. Baseline

Baseline implementation (using Netlib)

0.883 1.0

Optimized implementation (using Intel® MKL)

0.031 28.9


•Vector Statistical Library

o Great collection of Random Number Generators, summary statistics and convolution/correlation functions

o Performance enhancements through vectorization and threading

o Compatible with various compilers

o Native C and FORTRAN APIs

28

Summary

Combine power of Intel® Compiler with VSL functions in Intel® MKL

to maximize performance of your applications


• VSL RNG performance data

– http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm

• MKL Manual, Chapter 10

– http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation/

• VSL RNG Notes

– http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vslnotes.pdf

• VSL Summary Statistics Notes

– http://software.intel.com/sites/products/documentation/hpc/mkl/ssl/sslnotes.pdf

29

Useful Links

http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm

http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vsl_performance_data.htm

http://software.intel.com/en-us/articles/intel-math-kernel-library-documentation/











http://software.intel.com/sites/products/documentation/hpc/mkl/vsl/vslnotes.pdf

http://software.intel.com/sites/products/documentation/hpc/mkl/ssl/sslnotes.pdf

30


INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice


31

Documents

Overview of Vector Statistics Functions in Intel® Math ... · Intel® Core i7-2600 processor, 3.4 GHz, 8MB L3 cache, 4GB memory OS: Linux* 64-bit Intel® C++ Composer XE 2011 Intel®