30
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster, Keita Teranishi, Greg Mackey, Daniel Dunlavy and Tamara Kolda SAND2017-6575 C

Portability and Scalability of Sparse Tensor ...users.wfu.edu/ballard/SIAM-AN17/forster.pdf · SAND2017-6575 C. Sparse Tensor Decomposition ... Parallel Programing with Kokkos

Embed Size (px)

Citation preview

Photos placed in horizontal position with even amount of white space

between photos and header

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

PortabilityandScalabilityofSparseTensorDecompositionsonCPU/MIC/GPUArchitectures

ChristopherForster,KeitaTeranishi,GregMackey,DanielDunlavy andTamaraKolda

SAND2017-6575C

SparseTensorDecomposition

§ DevelopproductionqualitylibrarysoftwaretoperformCPfactorizationwithAlternatingPoissonRegression onHPCplatforms§ SparTen

§ SupportseveralHPCplatforms§ Nodeparallelism(Multicore,Manycore andGPUs)

§ MajorQuestions§ SoftwareDesign§ PerformanceTuning

§ Thistalk§ Weareinterestedintwomajorvariants

§ MultiplicativeUpdates§ ProjectedDampedNewtonforRow-subproblems

2

CPTensorDecomposition

3

§ Expresstheimportantfeatureofdatausingasmallnumberofvectorouterproducts

Key references: Hitchcock (1927), Harshman (1970), Carroll and Chang (1970)

CANDECOMP/PARAFAC (CP) Model

Model:

PoissonforSparseCountData

4

Gaussian (typical) PoissonThe random variable x is a

continuous real-valued number.The random variable x is a

discrete nonnegative integer.

Model: Poisson distribution (nonnegative factorization)

SparsePoissonTensorFactorization

5

§ Nonconvexproblem!§ AssumeRisgiven

§ Minimizationproblemwithconstraint§ Thedecomposedvectorsmustbenon-negative

§ AlternatingPoissonRegression(ChiandKolda,2011)§ Assume(d-1)factormatricesareknownandsolvefortheremainingone

NewMethod:AlternatingPoissonRegression(CP-APR)

6

Repeat until converged…

Fix B,C;solveforA

Fix A,C;solvefor B

Fix A,B;solvefor C

Theorem: The CP-APR algorithm will converge to a constrained stationary point if the subproblems are strictly convex and solved exactly at each iteration. (Chi and Kolda, 2011)

Convergence Theory

CP-APR

7

Minimization problem is expressed as:

CP-APR

8

Minimization problem is expressed as:

• 2 major approaches• Multiplicative Updates like Lee & Seung

(2000) for matrices, but extended by Chi and Kolda (2011) for tensors

• Newton and Quasi-Newton method for Row-subpblems by Hansen, Plantenga and Kolda(2014)

KeyElementsofMUandPDNRmethods

§ Keycomputations§ Khatri-RaoProduct§ Modifier(10+iterations)

§ Keyfeatures§ Factormatrixisupdatedallatonce

§ Exploitstheconvexityofrowsubproblems forglobalconvergence

§ Keycomputations§ Khatri-RaoProduct§ ConstrainedNon-linearNewton-basedoptimizationforeachrow

§ Keyfeatures§ Factormatrixcanbe

updatedbyrows§ Exploitstheconvexityof

row-subproblems

9

Multiplicative Update (MU) Projected Damped Newton for Row-subproblems (PDNR)

CP-APR-MU

10

Key Computations

CP-APR-PDNR

11

Key Computations

Algorithm 1: CPAPR-PDNR algorithm

1 CPAPR PDNR (X ,M);

Input : Sparse N -mode Tensor X of size I

1

⇥ I

2

⇥ . . . IN and the

number of components R

Output: Kruskal Tensor M = [�;A

(1)

. . . A

(N)

]

2 Initialize3 repeat4 for n = 1, . . . , N do5 Let ⇧

(n)= (A

(N) � · · ·�A

(n+1) �A

(n�1) � . . . A

(1)

)

T

6 for i = 1, . . . , In do

7 Find b

(n)i s.t. min

b(n)i �0

f

row

(b

(n)i , x

(n)i ,⇧

(n))

8 end

9 � = e

TB

(n)where B

(n)= [b

(n)1

. . . b

(n)In

]

T

10 A

(n) B

(n)⇤

�1

, where ⇤ = diag(�)

11 end

12 until all mode subproblems converged ;

PARALLELCP-APRALGORITHMS

12

ParallelizingCP-APR§ Focusonon-nodeparallelismformultiplearchitectures

§ Multiplechoicesforprogramming§ OpenMP,OpenACC, CUDA,Pthread …§ Managedifferentlow-levelhardwarefeatures(cache,devicememory,NUMA…)

§ OurSolution:UseKokkos forproductivityandperformanceportability§ Abstractionofparallelloops§ AbstractionDatalayout(row-major,columnmajor,programmablememory)§ Samecodetosupportmultiplearchitectures

13

Kokkos

Intel Multicore Intel Manycore NVIDIA GPU IBM PowerAMD Multicore/APU ARM

Support multiple Architectures

WhatisKokkos?

§ TemplatedC++LibrarybySandiaNationalLabs(Edwards,etal)§ Serveassubstratelayerofsparsematrixandvectorkernels§ Supportanymachineprecisions

§ Float§ Double§ QuadandHalffloatifneeded.

§ Kokkos::View()accommodatesperformance-awaremultidimensionalarraydataobjects§ Light-weightC++classto

§ ParallelizingloopsusingC++languagestandard§ Lambda§ Functors

§ Extensivesupportofatomics

14

ParallelProgramingwithKokkos

§ ProvideparallelloopoperationsusingC++languagefeatures§ Conceptually,theusageisnomoredifficultthanOpenMP.

Theannotationsjustgoindifferentplaces. 15

for (size_t i = 0; i < N; ++i) {

/* loop body */}

#pragma omp parallel forfor (size_t i = 0; i < N; ++i) {

/* loop body */}

parallel_for (( N, [=], (const size_t i) {

/* loop body */});

Seria

lO

penM

PKo

kkos

Kokkos information courtesy of Carter Edwards

WhyKokkos?

§ ComplyC++languagestandard!§ Supportmultipleback-ends

§ Pthread,OpenMP,CUDA,IntelTBBandQthread

§ Supportmultipledatalayoutoptions§ ColumnvsRowMajor§ Device/CPUmemory

§ Supportdifferentparallelism§ Nestingsupport§ Vector,threads,Warp,etc.§ Taskparallelism(underdevelopment)

16

ArrayAccessbyKokkos

17

Row-majorThread 0 reads

Thread 1 reads

Column-major

Thread 0 reads

Thread 1 reads

Kokkos::View<double **, Layout, Space>

View<double **, Right, Space> View<double **, Left, Space>

ArrayAccessbyKokkos

18

Row-major

Thread 0 reads

Thread 1 reads

Contiguous reads per thread

Column-major

Thread 0 reads

Thread 1 reads

Coalesced reads w

ithin warp

View<double **, Right, Host> View<double **, Left, CUDA>

Kokkos::View<double **, Layout, Space>

ParallelCP-APR-MU

19

Data Parallel

ParallelCP-APR-PDNR

20

Data Parallel

Task Parallel

NotesonDataStructure

§ UseKokkos::View§ SparseTensor

§ SimilartotheCoordinate(COO)FormatinSparseMatrixrepresentation

§ Kruskal Tensor&KhatriRaoProduct§ Providesoptionsforroworcolumnmajor

§ Kokkos::Viewprovidesanoptiontodefinetheleadingdimension.§ Determinedduringcompileorruntime

§ AvoidAtomics§ ExpensiveinCPUsandManycore§ Useextraindexingdatastructure

§ CP-APR-PDNR§ Createsapooloftasks§ Adedicatedbufferspace(Kokkos::View)isassignedtoindividualtask

21

PERFORMANCE

22

PerformanceTest§ StrongScalability

§ Problemsizeisfixed§ RandomTensor

§ 3Kx4Kx5K,10Mnonzeroentries§ 100outeriterations

§ RealisticProblems§ CountData(Non-negative)§ Availableathttp://frostt.io/§ 10outeriterations

§ DoublePrecision

23

Data Dimensions Nonzeros RankLBNL 2K x 4K x 2K x 4K x 866K 1.7M 10NELL-2 12K x 9K x 29K 77M 10NELL-1 3M x 2M x 25M 144M 10Delicious 500K x 17M x 3M x 1K 140M 10

CPAPR-MUonCPU(Random)

24

0

200

400

600

800

1000

1200

1400

1600

1800

1 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Exec

utio

n Ti

me

(s)

Core Count

CP-APR-MU method, 100 outer-iterations, (3000 x 4000 x 5000, 10M nonzero entries), R=10, PC cluster, 2 Haswell (14 core) CPUs

per node, MKL-11.3.3, HyperThreading disabled

Pi Phi+ Update

Results:CPAPR-MUScalability

25

DataHaswell CPU

1-core

2 Haswell CPUs

14-cores

2 Haswell CPUs

28-coresKNL

68-core CPUNVIDIA

P100 GPUTime(s) Speedup Time(s) Speedup Time(s) Speedup Time(s) Speedup Time(s) Speedup

Random 1715* 1 279 6.14 165 10.39 20 85.74 10 171.5LBNL 131 1 32 4.09 32 4.09 103 1.27NELL-2 1226 1 159 7.77 92 13.32 873 1.40NELL-1 5410 1 569 9.51 349 15.50 1690 3.20Delicious 5761 1 2542 2.26 2524 2.28

100 outer iterations for the random problem10 outer iterations for realistic problems* Pre-Kokkos C++ code on 2 Haswell CPUs:

1-core, 2136 sec14-cores, 762 sec28-cores, 538 sec

CPAPR-PDNRonCPU(Random)

26

0100200300400500600700800900

1 2 4 6 8 10 12 14 16 18 20 22 24 26 28

Exec

utio

n Ti

me

(s)

Core Count

CPAPR-PDNR method, 100 outer-iterations, 1831221 inner iterations total, (3000 x 4000 x 5000, 10M nonzero entries), R=10,

PC cluster, 2 Haswell (14 core) CPUs per node, MKL-11.3.3, HyperThreading disabled

Pi RowSub

Results:CPAPR-PDNRScalability

27

DataHaswell CPU

1 core2 Haswell CPUs

14 cores2 Haswell CPUs

28 coresTime(s) Speedup Time(s) Speedup Time(s) Speedup

Random 817* 1 73 11.19 44 18.58

LBNL 441 1 187 2.35 191 2.30

NELL-2 2162 1 326 6.63 319 6.77

NELL-1 17212 1 4241 4.05 3974 4.33

Delicious 18992 1 3684 5.15 3138 6.05

100 outer iterations for the random problem10 outer iterations for realistic problems* Pre-Kokkos C++ code spends 3270 sec on 1 core

PerformanceIssues

§ Ourimplementationexhibitsverygoodscalabilitywiththerandomtensor.§ Similarmodesizes§ Regulardistributionofnonzeroentries

§ Somecacheeffects§ Kokkos isNUMA-awareforcontiguousmemoryaccess(first-touch)

§ Somescalabilityissueswiththerealistictensorproblems.§ Irregularnonzerodistributionanddisparityinmodesizes§ Task-parallelcodemayhavesomememorylocalityissuestoaccess

sparsetensor,Kruskal Tensor,andKhatori-Raoproduct§ Preprocessingcouldimprovethelocality

§ ExplicitDatapartitioning(SmithandKarypis)§ PossibletoimplementusingKokkos

28

MemoryBandwidth(StreamBenchmark)

§ Allcoresdeliverapproximately8xperformanceimprovementfromsinglethread

§ Hardtoscaleusingallcoreswithmemory-boundcode.

29

0

20000

40000

60000

80000

100000

120000

0 5 10 15 20 25 30

MBy

tes/

Sec

# of Cores

Stream Benchmark on 2x 14 core Intel Haswell CPUs

Conclusion

§ DevelopmentofPortableon-nodeParallelCP-APRSolvers§ DataparallelismforMUmethod§ MixedData/TaskparallelismforPDNRmethod§ MultipleArchitectureSupportusingKokkos

§ ScalablePerformanceforrandomsparsetensor

§ FutureWork§ ProjectedQuasi-NewtonforRow-subproblems (PQNR)§ GPUandManycore supportforPDNRandPQNR§ Performancetuningtohandleirregularnonzerodistributionsand

disparityinmodesizes

30