Solving Quadratic Equations with XL on Parallel Architectures

Solving Quadratic Equations with XLon Parallel Architectures

Cheng Chen-Mou1, Chou Tung2,Ni Ru-Ben2, Yang Bo-Yin2

1National Taiwan University2Academia SinicaTaipei, Taiwan

Leuven, Sept. 11, 2012

Solving Quadratic Equations with XLon Parallel Architectures

Chen-Mou Cheng1, Tung Chou2,Ruben Niederhagen2, Bo-Yin Yang2

1National Taiwan University2Academia SinicaTaipei, Taiwan

Leuven, Sept. 11, 2012

The XL algorithm

Some cryptographic systems can be attacked by solving a system ofmultivariate quadratic equations, e.g.:

I AES:I 8000 quadratic equations with 1600 variables over F2

(Courtois and Pieprzyk, 2002)I 840 sparse quadratic equations and 1408 linear equations over

3968 variables of F256 (Murphy and Robshaw, 2002)

I multivariate cryptographic systems, e.g. QUAD stream cipher(cryptanalysis by Yang, Chen, Bernstein, and Chen, 2007)

The XL algorithm

I XL is an acronym for extended linearization:I extend a quadratic system by multiplying with appropriate

monomialsI linearize by treating each monomial as an independent variableI solve the linearized system

I special case of Gröbner basis algorithmsI first suggested by Lazard (1983)I reinvented by Courtois, Klimov, Patarin, and Shamir (2000)I alternative to Gröbner basis solvers like Faugère’s F4 (1999,

e.g., Magma) and F5 (2002) algorithms

The XL algorithmFor b ∈ Nn denote by xb the monomial xb1

1 xb22 . . . xbn

n and by|b| = b1 + b2 + · · ·+ bn the total degree of xb.

given: finite field K = Fqsystem A of m multivariate quadratic equations:`1 = `2 = · · · = `m = 0, ì ∈ K [x1, x2, . . . , xn]

choose: operational degree D ∈ Nextend: system A to the system

R(D) = {xbì = 0 : |b| ≤ D − 2, ì ∈ A}linearize: consider xd , d ≤ D a new variable

to obtain a linear systemMsolve: linear systemM

minimum degree D0 for reliable termination (Yang and Chen):

D0 := min{D : ((1− λ)m−n−1(1+ λ)m)[D] ≤ 0}

We use the Wiedemann algorithminstead of a Gauss solver. Thus, wedo not compute a complete Gröbnerbasis but distinguished solutions!


1 xb22 . . . xbn



choose: operational degree D ∈ N How?extend: system A to the system




D0 := min{D : ((1− λ)m−n−1(1+ λ)m)[D] ≤ 0}



1 xb22 . . . xbn







D0 := min{D : ((1− λ)m−n−1(1+ λ)m)[D] ≤ 0}



1 xb22 . . . xbn





to obtain a linear systemMsolve: linear systemM How?


D0 := min{D : ((1− λ)m−n−1(1+ λ)m)[D] ≤ 0}



1 xb22 . . . xbn





to obtain a linear systemMsolve: linear systemM How?


D0 := min{D : ((1− λ)m−n−1(1+ λ)m)[D] ≤ 0}


The Wiedemann algorithm – the basic ideagiven: A ∈ KN×N

wanted: v ∈ KN such that Av = 0solution: compute minimal polynomial f of A of degree d : f (A) = 0

d∑i=0

fiAi = 0

d∑i=0

fiAiz = 0 choose z ∈ KN randomly

d∑i=1

fiAiz + f0z = 0

d∑i=1

fiAiz = 0 since f0 = 0

A ·

(d∑

i=1

fiAi−1z

)︸︷︷︸

=v

= 0

The Wiedemann algorithm – the basic ideagiven: A ∈ KN×N

wanted: v ∈ KN such that Av = 0 How?solution: compute minimal polynomial f of A of degree d : f (A) = 0

d∑i=0

fiAi = 0

d∑i=0

fiAiz = 0 choose z ∈ KN randomly

d∑i=1

fiAiz + f0z = 0

d∑i=1

fiAiz = 0 since f0 = 0

A ·

(d∑

i=1

fiAi−1z

)︸︷︷︸

=v

= 0

The Berlekamp–Massey algorithm

Given a linearly recurrent sequence

S = {ai}∞i=0, ai ∈ K ,

compute an annihilating polynomial f of degree d such that

d∑i=0

fiaj+i = 0, for all j ∈ N.

The Berlekamp–Massey algorithm requires the first 2 · d elementsof S as input and computes fi ∈ K , 0 ≤ i ≤ d .

ai = xAiz , x ∈ K 1×N

The Berlekamp–Massey algorithm

Given a linearly recurrent sequence

S = {ai}∞i=0, ai ∈ K ,

compute an annihilating polynomial f of degree d such that

d∑i=0

fiaj+i = 0, for all j ∈ N.

The Berlekamp–Massey algorithm requires the first 2 · d elementsof S as input and computes fi ∈ K , 0 ≤ i ≤ d .

ai = xAiz , x ∈ K 1×N

The block Wiedemann algorithmDue to Coppersmith (1994), three steps:

Input: A ∈ KN×N , parameters m, n ∈ N, κ ∈ N of sizeN/m + N/n + O(1).

BW1: Compute sequence {ai}κi=0 of matrices ai ∈ Kn×m usingrandom matrices x ∈ Km×N and z ∈ KN×n

ai = (xAiy)T , for y = Az . O(N2(wA + m)

)BW2: Use block Berlekamp–Massey to compute polynomial f with

coefficients in Kn×n.Coppersmith’s version: O(N2 · n)

Thomé’s version: O(N log2 N · n)BW3: Evaluate the reverse of f :

W =

deg(f )∑j=0

Ajz(fdeg(f )−j)T . O

(N2(wA + n)

)





)

BW2: Use block Berlekamp–Massey to compute polynomial f withcoefficients in Kn×n.Coppersmith’s version: O(N2 · n)


W =

deg(f )∑j=0


(N2(wA + n)

)







Thomé’s version: O(N log2 N · n)

BW3: Evaluate the reverse of f :

W =

deg(f )∑j=0


(N2(wA + n)

)








W =

deg(f )∑j=0


(N2(wA + n)

)

Parallelization of BW1


Compute sequence {ai}κi=0 of matrices ai ∈ Kn×m using randommatrices x ∈ Km×N and z ∈ KN×n

ai = (xAiy)T , for y = Az

using {ti}κi=0, ti ∈ KN×n,

ti =

{y = Az for i = 0Ati−1 for 0 < i ≤ κ,

ai = (xti )T .

Parallelization of BW1

INPUT: macaulay_matrix<N, N> A;sparse_matrix<N, n> z;

matrix<N, n> t_new, t_old;matrix<m, n> a[N/m + N/n + O(1)];sparse_matrix<m, N, weight> x;

x.rand();t_old = z;for (unsigned i = 0; i <= N/m + N/n + O(1); i++){

t_new = A * t_old;a[i] = x * t_new;swap(t_old, t_new);

}

RETURN a

Parallelization of BW1 – multicore processor

ti A ti−1= ×

Parallelization of BW1 – multicore processor2 cores

ti A ti−1= ×


ti A ti−1= ×


ti A ti−1= ×


ti A ti−1= ×

OpenMP

Parallelization of BW1 – cluster system

ti A ti−1= ×

Parallelization of BW1 – cluster system2 computing nodes

ti A ti−1= ×


ti A ti−1= ×


ti A ti−1= ×

Parallelization of BW1 – cluster system4 computing nodes a(i) ∈ Kn×m

ti A ti−1= ×

n


500

1000

1500

5000

10000

32 64 128 256 512 1024

Run

time[s]

Block Width: m = n

BW1BW2 Thomé

BW2 CoppersmithBW3


0

10

20

30

40

50

60

70

80

90

32 64 128 256 512 1024

Mem

ory[GB]

Block Width: m = n

BW2 ThoméBW2 Coppersmith

36 GB


Ati ti−1= ×


Ati ti−1= ×


Ati ti−1= ×


Ati ti−1= ×


Ati ti−1= ×


Ati ti ti−1 ti−1= ×


Ati ti−1 ti−1 ti−1= ×


A ti ti−1ti ti = ×



MPI: ISend, IRecv, . . .













InfiniBand Verbs


0

0.25

0.5

0.75

1

2 4 8 16 32 64 128 256

Ratio

Number of Nodes

Verbs APIMPI


2 4 8 16 32 64 128 256

Run

time

Number of Cluster Nodes

CommunicationComputation

Sent per Node

(InfiniBand MT26428, 2 ports of 4×QDR, 32 Gbit/s)

Runtime n = 16, m = 18, F16

0

500

1000

1500

2000

2500

8 16 32 64 6 12 24 48

Run

time[m

in]

Number of Cores

BW1BW2 Tho.BW2 Cop.

BW3

NUMACluster

Comparison to Magma F4

0.0010.010.11

101001000

18 19 20 21 22 23 24 25

Tim

e[hrs]

n, m = 2n

XL (64 cores)XL (scaled)Magma F4

0.01

0.1

1

10

100

18 19 20 21 22 23 24 25Mem

ory[GB]

n, m = 2n

XLMagma F4

Conclusions

XL with block Wiedemann as system solver is an alternative forGröbner basis solvers, because

I in about 80% of the cases it operates on the same degree,I it scales well on multicore systems and moderate cluster sizes,

andI it has a relatively small memory demand.

Thank you!

Documents

Solving Quadratic Equations with XL on Parallel Architectures