1
Lucas-Lehmer Primality Tester
Team: W-4
Nathan Stohs W4-1
Brian Johnson W4-2
Joe Hurley W4-3
Marques Johnson W4-4
Design Manager: Prateek Goenka
2
Agenda
• Background (Marques)• Project Description (Marques) • Algorithmic Description (Joe)• Data Flow/Block Diagram (Joe)• Design Process (Nathan)• Simulations (Nathan)• Floorplan/Layout (Brian)• Conclusions (Brian)
3
History of 2P-1
• 16th century it was believed 2P-1 was prime for all prime P’s
• 1536 Hudalricus Regius proved 211-1 was not prime
• French monk Marin Mersenne published Cogitata Physica-Mathematica where he stated 2P-1 was prime for P = 2, 3, 5, 7, 13, 17, 19, 31, 67, 127 and 257
4
Lucas-Lehmer
• François Edouard Anatole Lucas
• 1876 proved that the number 2127 - 1 is prime using his own methods
• Derrick Lehmer – 1930 he refined Lucas’s method
5
Make History
• December 2005• 43rd Known Mersenne Prime Found!!• Dr. Curtis Cooper and Dr. Steven Boone• Professors at Central Missouri State University • 230,402,457-1
6
Prime Number Competitions• Electronic Frontier Foundation
• $50,000 to the first individual or group who discoversa prime number with at least 1,000,000 decimal digits (awarded Apr. 6, 2000)
• $100,000 to the first individual or group who discoversa prime number with at least 10,000,000 decimal digits
• $150,000 to the first individual or group who discoversa prime number with at least 100,000,000 decimal digits
• $250,000 to the first individual or group who discoversa prime number with at least 1,000,000,000 decimal digits
7
rank prime digits who when reference
1 230402457-1 9152052 G9 2005 Mersenne 43
2 225964951-1 7816230 G8 2005 Mersenne 42
3 224036583-1 7235733 G7 2004 Mersenne 41
4 220996011-1 6320430 G6 2003 Mersenne 40
5 213466917-1 4053946 G5 2001 Mersenne 39
6 27653.29167433+1 2759677 SB8 2005
7 28433.27830457+1 2357207 SB7 2004
8 26972593-1 2098960 G4 1999 Mersenne 38
9 5359.25054502+1 1521561 SB6 2003
10 4847.23321063+1 999744 SB9 2005
8
Mersenne Prime Algorithm
• Only used for numbers that are in the form 2P-1
• For P > 2
• 2P-1 is prime if and only if Sp-2 is zero in this sequence:
• S0 = 4
• SN = (SN-12 - 2) mod (2P-1)
9
Example to Show 27 - 1 is Prime
• 27 – 1 = 127
• S0 = 4
• S1 = (4 * 4 - 2) mod 127 = 14
• S2 = (14 * 14 - 2) mod 127 = 67
• S3 = (67 * 67 - 2) mod 127 = 42
• S4 = (42 * 42 - 2) mod 127 = 111
• S5 = (111 * 111 - 2) mod 127 = 0
10
Computations needed:-Squaring (not a problem…)-Add/Subtract (not a problem…)
-Modulo (2n – 1) multiplication (?)
Algorithmic description
We knew the necessary computations, but how to translate that to gates?
11
Mechanisms behind the math• If done with brute force, modulo 2n-1 could have
been ugly.– Would need to square and find the remainder
via division.• Luckily, for that specific computation, math is on
our side, the 2n-1 constraint saves us from division, as will be seen.
• A quick search on www.ieee.org produced inspiration.
• Reto Zimmermann. Efficient VLSI Implementation of Modulo (2n +- 1) Addition and Multiplication. Computer Arithmetic, 1999; p158-167.
12
Useful Math: Multiplication
Just like any other multiplication, a modulo multiplication can be computed by (modulo) summing the partial products.
So modulo multiplication is multiplication using a modulo adder.
From the Zimmerman paper
13
Mod Calc
Mod add
Count
Subtract 2
Block Diagram
P
Out
16 16
1
FSM
start
1done
Register
16
16
Compare
2
1
4
2
2
1
16
Counter
Next Partial Product
16
Register
16
16
2
S1 = (4 * 4) mod 127 - 2 = 14
Loop xP-2
S5 = (111 * 111 - 2) mod 127 = 0
...S2 = (14 * 14) mod 127 - 2 = 67
Loop x16
14
Design ProcessThe Process So far:
- Found Mathematical Means (core algorithm)
- Found Computational Means (modulo multiplier, adder)
From the above, a high level C program was written in a manner that would easily translate to verilog and gates, or at least more standard operations
int mod_square_minus(int value, int p, int offset) { int acc, i; int mod = (1 << p) - 1; for(acc=offset, i=0; i<(sizeof(int)*8-1); i++) { int a = (value >> i) & 1; int temp; if (a) { if (i-p > 0)
temp = value << (i-p); else
temp = value >> (p-i); acc = acc + temp + ((value << i) & ((1 << p) - 1)); } if (acc >= mod) acc = acc - mod; } return acc;}
This easily translated into behavorial verilog, and readily turned into a gate-level implementation. Essentially it was written in a more low-level manner.
15
Design Process
The rest of the design can simply be thought of as a wrapper for the modulo multiplier.
The following slides contain Verilog code that was directly taken from the C code below.
module mod_mult(out, itrCount, x, y, mod, p, reset, en, clk); input [15:0] x, y, mod, p; output [15:0] out;
input reset, en, clk;
wire [15:0] pp, ma0, temp; output [3:0] itrCount;
counter mycount(itrCount, reset, en, clk); partial_product ppg(pp, x, y, itrCount, mod, p); mod_add modAdder(out, pp, temp, mod); dff_16_lp partial(clk, out, temp, reset, en);
endmodule
Top level of multiplier
16
module partial_product(out, x, y, i, mod, p); output [15:0] out; input [15:0] x, y, mod, p; input [3:0] i;
wire [15:0] diff1, diff2, added, result, corrected, final; wire [15:0] high, low, shifted, toadd; wire cout1, cout2, ithbith, toobig;
sub_16 difference1(diff1, cout1, {12'b0, i}, p); sub_16 difference2(diff2, cout2, p, {12'b0, i}); shift_left shiftL(high, y, diff1[3:0]); shift_right shiftR(low, y, diff2[3:0]); mux16 choose(high, low, shifted, cout1);
shift_left shiftL2(toadd, y, i); and16 bigand(added, toadd, mod);
fulladder_16 addhighlow(.out(result), .xin(added), .yin(shifted), .cin({1'b0}), .cout(nowhere));
sub_16 correct(.out(corrected), .cout(toobig), .xin(mod), .yin(result)); mux16 correctionMux(.out(final), .high(corrected), .low(result), .sel(toobig));
shift_right ibit({15'b0, ithbit}, x, i); select16 checkfor0(.out(out), .x(result), .sel(ithbit));
endmodule
Partial Product Unit w/ modulo reduction
17
module mod_add(out, x, y, mod); input [15:0] x, y, mod; output [15:0] out;
wire cout, isDouble, cin; wire [15:0] plus, lowbits, done, mod_bar, check;
fulladder_16 add(.out(plus), .xin(x), .yin(y), .cin(cin), .cout());
invert_16 inverter(mod_bar, mod);
and16 hihnbits(check, plus, mod_bar); and16 lownbits(done, plus, mod);
or8 (cin, check[0], check[1], check[2], check[3], check[4], check[5], check[6], check[7], check[8], check[9], check[10], check[11], check[12], check[13], check[14], check[15]);
compare_16 checkfordouble(isDouble, done, 16'b1111_1111_1111_1111); mux16 fixdouble(.out(out), .high(16'b0), .low(done), .sel(isDouble));
endmodule
Modulo Adder
18
Final Design Process Notes
• Lessons learned: Never tweak the schematics without retesting the verilog first. Timing issues can be subtle. Verilog is better for catching them and quickly fixing/retesting than schematics.
• Considering total time spent during this phase, roughly half was on the “core” and the FSM, the rest on the “wrapper”.
19
Road to verification : C2 Examples of the high-level C implementations:
Tyrion:~/Desktop/15525 nstohs$ ./prime4 7round 1: (4 * 4 - 2) mod 127 = 14round 2: (14 * 14 - 2) mod 127 = 67round 3: (67 * 67 - 2) mod 127 = 42round 4: (42 * 42 - 2) mod 127 = 111round 5: (111 * 111 - 2) mod 127 = 027-1 is prime
Tyrion:~/Desktop/15525 nstohs$ ./prime4 11round 1: (4 * 4 - 2) mod 2047 = 14round 2: (14 * 14 - 2) mod 2047 = 194round 3: (194 * 194 - 2) mod 2047 = 788round 4: (788 * 788 - 2) mod 2047 = 701round 5: (701 * 701 - 2) mod 2047 = 119round 6: (119 * 119 - 2) mod 2047 = 1877round 7: (1877 * 1877 - 2) mod 2047 = 240round 8: (240 * 240 - 2) mod 2047 = 282round 9: (282 * 282 - 2) mod 2047 = 1736211-1 is not prime
20
Road to verification: Verilog
Samples of Verilog Verification output:
Partial Product Unit p = 7380 ppOut= 56, x= 14, y= 14, i= 2, mod= 127, p= 7400 ppOut= 112, x= 14, y= 14, i= 3, mod= 127, p= 7420 ppOut= 0, x= 14, y= 14, i= 4, mod= 127, p= 7440 ppOut= 0, x= 14, y= 14, i= 5, mod= 127, p= 7
Top Level p = 7itrOut= xitrOut= 4itrOut= 14itrOut= 67itrOut= 42itrOut= 111itrOut= 0
Top Level p = 11itrOut= xitrOut= 4itrOut= 14itrOut= 194itrOut= 788itrOut= 701itrOut= 119itrOut= 1877…
Tests were either specific tests on important units such as Partial_Product
…or top level tests. Note that these are the same results generated from the C code
21
Road to verification: Schematic I
Schematic Test of our modular adder.
128 + 68 Mod 127 = 69
22
Road to verification: Schematic II
Plot of the top level output after a single iteration, p=7
Output after a single iteration is 14, the expected value.
23
Road to verification: Schematic III
4 14 67 42 111
24
Road to verification: Intermission
Disk Space required for a full-length schematic test of p=7 : 6 GBTime required for a full-length schematic test of p=7 : 5 hours
Disk Space required for a full-length extractedRC test of p=7 : 20 GBTime required for a full-length extractedRC test of p=7 : 8 hours
Simulations become lengthy due to tests needing to be “deep” to be useful.
25
Layout: ExtractedRC – Full Run
4 14 67 42 111
26
TimingTo determine the bounds of our clock, Pathmill was used once major portions of the schematic was complete.
The critical path through our design is one loop through the modular multiplier, which runs through the modular adder and partial products module.
The pathmill delay of the modular adder was 9ns, and 5.2 ns through the partial products module.
This already puts our total delay at 14.2 ns, putting our schematic delay at 70 MHz.
For extractedRC, due in part to simulation issues, a conservative 50 MHz was chosen as the final clock.
27
Issues
• extractedRC of partial_product module• Registers switch
– Custom design to DFFs with muxes
• Switching from parallel calculations to series– Transistor count vs. clock cycles
• Syncing up design between people– Transferring files– Different design styles
• LONG simulation times• Floorplanning
– Too much emphasis on aspect ratios and not enough on wiring– Couldn’t decide on one set floorplan
28
Floorplan v1.0
29
Floorplan v2.0
30
Final Floorplan
31
Pin Specifications
Pin Type # of Pins
Vdd! In/Out 1
Gnd! In/Out 1
p<0:15> In 16
clk In 1
start In 1
Done Out 1
out Out 1
Total - 22
32
Initial Module SpecificationsModule Transistor
Count
Area
(µm²)
Transistor
Density
FSM 300 900 .33
mod_p 2,440 7,000 .35
mod_add 1,282 9,000 .14
partial_product 8,676 65,000 .13
count 1,656 6,000 .27
sub_16 704 3,500 .20
Registers 1,848 6,000 .30
compare 36 300 .12
Total 16,942 97,700 .17
33
Final Module Specifications
Module Transistor
Count
Area
(µm²)
Transistor
Density
FSM 152 1,200 .13
mod_p 1,280 8,603 .15
mod_add 1,168 5,603 .21
partial_product 7,520 54,680 .14
count 1,424 8,701 .16
sub_16 576 2,934 .20
Registers 896 6,028 .15
compare 56 201 .28
Total 13,702 86,621 .16
Aspect
Ratio
2.45
0.79
2.40
1.16
6.88
4.49
4.76
4.41
1.01
34
Chip Specifications
• Transistor Count: 13,702
• Size: 296.51µm x 292.13µm
• Area: 86,621µm²
• Aspect Ratio: 1.01:1
• Density: 0.16 transistors/µm²
35
Final Floorplan
36
Final Floorplan
37
Partial Product
shift_rightshift_left
shift_right shift_left
adder
16-bit and Select16
Sub_16
mux
38
Poly Layer
Density: 7.14%
39
Active Layer
Density: 8.76%
40
Metal1 Layer
Density: 23.86%
41
Metal2 Layer
Density: 19.97%
42
Metal3 Layer
Density: 11.30%
43
Metal4 Layer
Density: 10.34%
44
Conclusions
• Plan for buffers-Will be hard to put them in after the fact
• Your design will change dramatically from start to finish so be flexible
• Communication is key
• Do layout in parallel