Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system

..........

.....

.....................................................................

.....

......

.....

.....

.

. . . . . . .Background

. . . . .How to use CUDA to accelerate

. . . .Experiment Q & A

Accelerate Reed-solomon coding forFault-Tolerance in RAID-like system

Shuai YUAN

NTHULSA lab

1 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




Outline

▶ Background

▶ How to use CUDA to accelerate

▶ Experiment

▶ Q & A

2 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




Why Reed-Solomon codes?

Why fault-tolerance?

In a RAID-like system, storage is distributed among several devices,the probability of one of these devices failing becomes significant.

3 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






In a RAID-like system, storage is distributed among several devices,the probability of one of these devices failing becomes significant.MTTF (mean time to failure) of one device is P ,

MTTF of a system of n devices isP

n.

3 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






In a RAID-like system, storage is distributed among several devices,the probability of one of these devices failing becomes significant.MTTF (mean time to failure) of one device is P ,

MTTF of a system of n devices isP

n.

Thus in such systems, fault-tolerance must be taken into account.

3 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.





Why erasure codes?

Replication is expensive!e.g. To achieve double-fault tolerance, we need 2 replicas.

4 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.





Erasure Codes

Given a message/file of k equal-size blocks/chunks/fragments,encode it into n blocks/chunks/fragments (n > k)

▶ Optimal erasure codes: any k out of the n fragments aresufficient to recover the original message/file.i.e. can tolerate arbitrary (n− k) failures.We call this (n,k) MDS (maximum distance separable)property.e.g. Reed Solomon Codes

▶ Near-optimal erasure codes: require (1 + ε)k codeblocks/chunks to recover (ε > 0).i.e. cannot tolerate arbitrary (n− k) failures.e.g. Local Reconstruction Codes (used in WAS)

5 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.





Replication vs. Erasure Codes

Given a file of size MUsing Reed Solomon Codes(n=4, k=2)

Achieve double fault tolerance

▶ R-S Codes: storage overhead = M

▶ Relication: storage overhead = 2M (2 replicas)

6 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




Reed-Solomon coding mechanism

Reed-Solomon encoding process

Given file F : divide it into k equal-size native chunks:F = [Fi]i=1,2,...,k. Encode them into (n− k) code chunks:C = [Ci]i=1,2,...,n−k.Use Encoding Matrix EM(n−k)×k to produce code chunks:

CT = EM × F T

Ci is the linear combination of F1, F2, . . . , Fk.

7 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






In our case,F =

(A B

)EM =

(1 11 2

)CT =

(1 11 2

)×

(AB

)=

(A+BA+ 2B

)

7 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






Let P = [Pi]i=1,2,...,n = [F1, F2, . . . , Fk, C1, C2, . . . , Cn−k] be the

n chunks in storage, EM ′ =

(I

EM

), Here

I =

1 0 . . . 00 1 . . . 0...

.... . .

...0 0 . . . 1

then

P T = EM ′ × F T

=

(F T

CT

)

7 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






In our case,

EM ′ =

(I

EM

)=

1 00 11 11 2

P T = EM ′ × F T

=

1 00 11 11 2

×(AB

)

=

AB

A+BA+ 2B

7 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






EM ′ is the key of MDS property!

7 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






Theorem (necessary and sufficient condition)

Every possible k × k submatrix obtained by removing (n− k) rowsfrom EM ′ has full rank.equivalent expression of full rank:

▶ rank = k

▶ non-singular...

Alternative view:Consider the linear space ofP = [Pi]i=1,2,...,n = [F1, F2, . . . , Fk, C1, C2, . . . , Cn−k], itsdimension is k, and any k out of n vectors form a basis of thelinear space.

7 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






Reed-Solomon Codes uses Vandermonde matrix V as EM

V =

1 1 1 . . . 11 2 3 . . . k12 22 32 . . . k2

......

.... . .

...

1(n−k) 2(n−k) 3(n−k) . . . k(n−k)

7 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






EM ′ =

1 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...

......

. . ....

0 0 0 . . . 11 1 1 . . . 11 2 3 . . . k12 22 32 . . . k2

......

.... . .

...

1(n−k) 2(n−k) 3(n−k) . . . k(n−k)

7 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






Remark:

▶ All arithmetic operations in Galois Field GF(2x). Then everynumber is less than 2x.

▶ EM ′ satisfies MDS property.

7 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.





Reed-Solomon decoding process

We randomly select k out of n fragments, sayX = [x1, x2, · · · , xk]T .Then we use the coefficients of each fragments xi to construct ak × k matrix V ′.Original data can be regenerated by multiplying matrix X with theinverse of matrix V ′:

F = V ′−1X

8 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




Optimization motivation

Optimization motivation

The reason that discourages using Reed-Solomon coding to replacereplication is its high computational complexity:

▶ Galois Field arithmetic

▶ matrix operations

9 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




Accelerate operations in Galois Field (GF)


GF(2w) field is constructed by finding a primitive polynomial q(x)of degree w over GF(2), and then enumerating the elements(which are polynomials) with the generator x. Addition in this fieldis performed using polynomial addition, and multiplication isperformed using polynomial multiplication and taking the resultmodulo q(x).

10 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






GF(2w) field is constructed by finding a primitive polynomial q(x)of degree w over GF(2), and then enumerating the elements(which are polynomials) with the generator x. Addition in this fieldis performed using polynomial addition, and multiplication isperformed using polynomial multiplication and taking the resultmodulo q(x).In implementation, we map the elements of GF(2w) to binarywords of size w, so that computation in GF(2w) becomes bitwiseoperations. Let r(x) be a polynomial in GF(2w). Then we canmap r(x) to a binary word b of size w by setting the ith bit of b tothe coefficient of xi in r(x).

10 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






GF(2w) field is constructed by finding a primitive polynomial q(x)of degree w over GF(2), and then enumerating the elements(which are polynomials) with the generator x. Addition in this fieldis performed using polynomial addition, and multiplication isperformed using polynomial multiplication and taking the resultmodulo q(x).In implementation, we map the elements of GF(2w) to binarywords of size w, so that computation in GF(2w) becomes bitwiseoperations. Let r(x) be a polynomial in GF(2w). Then we canmap r(x) to a binary word b of size w by setting the ith bit of b tothe coefficient of xi in r(x).After mapping, Addition/subtraction of binary elements of GF(2w)can be performed by bitwise exclusive-or.

10 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






However, multiplication is still costly.Use bitwise operations to emulate polynomial multiplication inGF(2w): a loop of w iterations.e.g. In GF(28), multiplying 2 bytes takes 8 iterations.

11 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






CPU approach

store all the result in a multiplication table

e.g. In GF(28), multiplying 2 bytes takes 256× 256 = 64KBconsumes a lot of space ⇝ not suitable for GPU

11 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.






trade-off between computation and memory consumption ([1])We build up two smaller tables:▶ exponential table: maps from a binary element b to power j

such that xj is equivalent to b.▶ logarithm table: maps from a power j to its binary element b

such that xj is equivalent to b.

Multiplication in GF(2w):▶ looking each binary number in the logarithm table for its

logarithm (this is equivalent to finding the polynomial)▶ adding the logarithms modulo 2w − 1 (this is equivalent to

multiplying the polynomial modulo q(x))▶ looking up exponential table to convert the result back to a

binary number

Totally three table lookups and a modular addition.11 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




Accelerate encoding process

Accelerate encoding process

▶ Generating Vandermonde’s Matrix

▶ Matrix multiplication

12 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




Accelerate decoding process


▶ Computing inverse matrix

▶ Matrix multiplication

13 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.





Accelerate matrix inversion

Guassian/Guass-Jordon elimination in GF(28)

[V ′I] =

1 0 0 0 | 1 0 0 00 1 0 0 | 0 1 0 01 2 3 4 | 0 0 1 01 1 1 1 | 0 0 0 1

14 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.







[V ′I] =

1 0 0 0 | 1 0 0 00 1 0 0 | 0 1 0 00 2 3 4 | 1 0 1 00 1 1 1 | 1 0 0 1

14 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.







[V ′I] =

1 0 0 0 | 1 0 0 00 1 0 0 | 0 1 0 00 0 3 4 | 1 2 1 00 0 1 1 | 1 1 0 1

14 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.







[V ′I] =

1 0 0 0 | 1 0 0 00 1 0 0 | 0 1 0 00 0 1 247 | 244 245 244 00 0 0 246 | 245 244 244 1

14 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.







[V ′I] =

1 0 0 0 | 1 0 0 00 1 0 0 | 0 1 0 00 0 1 0 | 104 187 186 2100 0 0 1 | 105 186 186 211

14 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.







GPU

▶ normalize row

▶ make column into reduced echelon form

14 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




Experiment Settings

Experiment Settings

▶ Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

▶ nVidia Corporation GF100 [Tesla C2050]

15 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




GPU Cost Breakdown

Experiment Result (GPU cost breakdown)

k = 4, n = 6

29.45792

173.697983

218.380676

429.164764

576.414185

0

100

200

300

400

500

600

752KB 47MB 161MB 679MB 1.1GB

time(ms)

file size

GPU R-S encoding

Copy code chunks from GPU memory to hostmemory

Encoding file (matrix multiplication)

Copy encoding matrix from GPU memory tohost memory

Generating encoding matrix

Copy data chunks from host memory to GPUmemory

16 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




GPU Cost Breakdown


k = 4, n = 6

0

50

100

150

200

250

300

350

400

450

0 200000 400000 600000 800000 1000000 1200000

time(ms)

file size(KB)

GPU R-S encoding

Total computation time

Total communication time

16 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




GPU Cost Breakdown


k = 4, n = 6

31.092863

178.489853

262.113556

614.687073

751.384766

0

100

200

300

400

500

600

700

800

752KB 47MB 161MB 679MB 1.1GB

time(ms)

file size

GPU R-S decoding

Copy data chunks from GPU memory to hostmemory

Decoding file (matrix multiplication)

Generating decoding matrix (inverse matrix)

Copy encoding matrix from host memory toGPU memory

Copy code chunks from host memory to GPUmemory

16 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




GPU Cost Breakdown


k = 4, n = 6

0

100

200

300

400

500

600

0 200000 400000 600000 800000 1000000 1200000

time(ms)

file size(KB)

GPU R-S decoding

Total computation time

Total communication time

16 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




GPU vs CPU

Experiment Result (GPU vs CPU)

k = 4, n = 6

9.87 X

25.73 X

58.06 X

64.72 X

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 200000 400000 600000 800000 1000000 1200000

bandwidth (MB/s)

file size(KB)

RS encode: CPU vs GPU

GPU bandwidth

CPU bandwidth

17 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




GPU vs CPU

Experiment Result (GPU vs CPU)

k = 4, n = 6

1.47 X

18.12 X

39.19 X

75.4 X

89.99 X

0

200

400

600

800

1000

1200

1400

1600

0 200000 400000 600000 800000 1000000 1200000

bandwidth (MB/s)

file size(KB)

RS decode: CPU vs GPU

GPU bandwidth

CPU bandwidth

17 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




tuning k

Experiment Result (tuning k)

File size: 1.1GBdouble fault tolerance

576.414185 665.971008 947.589661

1495.114624

2702.535156

5060.778809

0

1000

2000

3000

4000

5000

6000

4 8 16 32 64 128

time(ms)

k

Total GPU encoding time (n-k=2)

Total GPU encoding time

18 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




tuning k


File size: 1.1GBdouble fault tolerance

751.384766 1028.636963 1210.730591

1856.644165

3167.548828

5958.092285

0

1000

2000

3000

4000

5000

6000

7000

4 8 16 32 64 128

time(ms)

k

Total GPU decoding time (n-k=2)

Total GPU decoding time

18 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




tuning k


File size: 1.1GBtriple fault tolerance

766.305176 695.003662 934.1427

1509.532715

2674.739502

5053.129883

0

1000

2000

3000

4000

5000

6000

4 8 16 32 64 128

time(ms)

k

Total GPU encoding time (n-k=3)

Total GPU encoding time

18 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




tuning k


File size: 1.1GBtriple fault tolerance

873.917542 1026.679443 1345.650269

1875.053711

3167.339355

5960.175293

0

1000

2000

3000

4000

5000

6000

7000

4 8 16 32 64 128

time(ms)

k

Total GPU decoding time (n-k=3)

Total GPU decoding time

18 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




Q & A

Thank You!

19 / 19

..........

.....

.....................................................................

.....

......

.....

.....

.




J.S. Plank et al.A tutorial on reed-solomon coding for fault-tolerance inraid-like systems.Software Practice and Experience, 27(9):995–1012, 1997.

19 / 19

Technology

Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system