1 Lucas-Lehmer Primality Tester Team: W-4 Nathan Stohs W4-1 Brian Johnson W4-2 Joe Hurley W4-3...

Preview:

Citation preview

1

Lucas-Lehmer Primality Tester

Team: W-4

Nathan Stohs W4-1

Brian Johnson W4-2

Joe Hurley W4-3

Marques Johnson W4-4

Design Manager: Prateek Goenka

2

Agenda

• Background (Marques)• Project Description (Marques) • Algorithmic Description (Joe)• Data Flow/Block Diagram (Joe)• Design Process (Nathan)• Simulations (Nathan)• Floorplan/Layout (Brian)• Conclusions (Brian)

3

History of 2P-1

• 16th century it was believed 2P-1 was prime for all prime P’s

• 1536 Hudalricus Regius proved 211-1 was not prime

• French monk Marin Mersenne published Cogitata Physica-Mathematica where he stated 2P-1 was prime for P = 2, 3, 5, 7, 13, 17, 19, 31, 67, 127 and 257 

4

Lucas-Lehmer

• François Edouard Anatole Lucas

• 1876 proved that the number 2127 - 1 is prime using his own methods

• Derrick Lehmer – 1930 he refined Lucas’s method

5

Make History

• December 2005• 43rd Known Mersenne Prime Found!!• Dr. Curtis Cooper and Dr. Steven Boone• Professors at Central Missouri State University • 230,402,457-1

6

Prime Number Competitions• Electronic Frontier Foundation

• $50,000 to the first individual or group who discoversa prime number with at least 1,000,000 decimal digits (awarded Apr. 6, 2000)

• $100,000 to the first individual or group who discoversa prime number with at least 10,000,000 decimal digits

• $150,000 to the first individual or group who discoversa prime number with at least 100,000,000 decimal digits

• $250,000 to the first individual or group who discoversa prime number with at least 1,000,000,000 decimal digits

8

Mersenne Prime Algorithm

• Only used for numbers that are in the form 2P-1

• For P > 2

• 2P-1 is prime if and only if Sp-2 is zero in this sequence:

• S0 = 4

• SN = (SN-12 - 2) mod (2P-1)

9

Example to Show 27 - 1 is Prime

• 27 – 1 = 127

• S0 = 4

• S1 = (4 * 4 - 2) mod 127 = 14

• S2 = (14 * 14 - 2) mod 127 = 67

• S3 = (67 * 67 - 2) mod 127 = 42

• S4 = (42 * 42 - 2) mod 127 = 111

• S5 = (111 * 111 - 2) mod 127 = 0

10

Computations needed:-Squaring (not a problem…)-Add/Subtract (not a problem…)

-Modulo (2n – 1) multiplication (?)

Algorithmic description

We knew the necessary computations, but how to translate that to gates?

11

Mechanisms behind the math• If done with brute force, modulo 2n-1 could have

been ugly.– Would need to square and find the remainder

via division.• Luckily, for that specific computation, math is on

our side, the 2n-1 constraint saves us from division, as will be seen.

• A quick search on www.ieee.org produced inspiration.

• Reto Zimmermann. Efficient VLSI Implementation of Modulo (2n +- 1) Addition and Multiplication. Computer Arithmetic, 1999; p158-167.

12

Useful Math: Multiplication

Just like any other multiplication, a modulo multiplication can be computed by (modulo) summing the partial products.

So modulo multiplication is multiplication using a modulo adder.

From the Zimmerman paper

13

Mod Calc

Mod add

Count

Subtract 2

Block Diagram

P

Out

16 16

1

FSM

start

1done

Register

16

16

Compare

2

1

4

2

2

1

16

Counter

Next Partial Product

16

Register

16

16

2

S1 = (4 * 4) mod 127 - 2 = 14

Loop xP-2

S5 = (111 * 111 - 2) mod 127 = 0

...S2 = (14 * 14) mod 127 - 2 = 67

Loop x16

14

Design ProcessThe Process So far:

- Found Mathematical Means (core algorithm)

- Found Computational Means (modulo multiplier, adder)

From the above, a high level C program was written in a manner that would easily translate to verilog and gates, or at least more standard operations

int mod_square_minus(int value, int p, int offset) { int acc, i; int mod = (1 << p) - 1; for(acc=offset, i=0; i<(sizeof(int)*8-1); i++) { int a = (value >> i) & 1; int temp; if (a) { if (i-p > 0)

temp = value << (i-p); else

temp = value >> (p-i); acc = acc + temp + ((value << i) & ((1 << p) - 1)); } if (acc >= mod) acc = acc - mod; } return acc;}

This easily translated into behavorial verilog, and readily turned into a gate-level implementation. Essentially it was written in a more low-level manner.

15

Design Process

The rest of the design can simply be thought of as a wrapper for the modulo multiplier.

The following slides contain Verilog code that was directly taken from the C code below.

module mod_mult(out, itrCount, x, y, mod, p, reset, en, clk); input [15:0] x, y, mod, p; output [15:0] out;

input reset, en, clk;

wire [15:0] pp, ma0, temp; output [3:0] itrCount;

counter mycount(itrCount, reset, en, clk); partial_product ppg(pp, x, y, itrCount, mod, p); mod_add modAdder(out, pp, temp, mod); dff_16_lp partial(clk, out, temp, reset, en);

endmodule

Top level of multiplier

16

module partial_product(out, x, y, i, mod, p); output [15:0] out; input [15:0] x, y, mod, p; input [3:0] i;

wire [15:0] diff1, diff2, added, result, corrected, final; wire [15:0] high, low, shifted, toadd; wire cout1, cout2, ithbith, toobig;

sub_16 difference1(diff1, cout1, {12'b0, i}, p); sub_16 difference2(diff2, cout2, p, {12'b0, i}); shift_left shiftL(high, y, diff1[3:0]); shift_right shiftR(low, y, diff2[3:0]); mux16 choose(high, low, shifted, cout1);

shift_left shiftL2(toadd, y, i); and16 bigand(added, toadd, mod);

fulladder_16 addhighlow(.out(result), .xin(added), .yin(shifted), .cin({1'b0}), .cout(nowhere));

sub_16 correct(.out(corrected), .cout(toobig), .xin(mod), .yin(result)); mux16 correctionMux(.out(final), .high(corrected), .low(result), .sel(toobig));

shift_right ibit({15'b0, ithbit}, x, i); select16 checkfor0(.out(out), .x(result), .sel(ithbit));

endmodule

Partial Product Unit w/ modulo reduction

17

module mod_add(out, x, y, mod); input [15:0] x, y, mod; output [15:0] out;

wire cout, isDouble, cin; wire [15:0] plus, lowbits, done, mod_bar, check;

fulladder_16 add(.out(plus), .xin(x), .yin(y), .cin(cin), .cout());

invert_16 inverter(mod_bar, mod);

and16 hihnbits(check, plus, mod_bar); and16 lownbits(done, plus, mod);

or8 (cin, check[0], check[1], check[2], check[3], check[4], check[5], check[6], check[7], check[8], check[9], check[10], check[11], check[12], check[13], check[14], check[15]);

compare_16 checkfordouble(isDouble, done, 16'b1111_1111_1111_1111); mux16 fixdouble(.out(out), .high(16'b0), .low(done), .sel(isDouble));

endmodule

Modulo Adder

18

Final Design Process Notes

• Lessons learned: Never tweak the schematics without retesting the verilog first. Timing issues can be subtle. Verilog is better for catching them and quickly fixing/retesting than schematics.

• Considering total time spent during this phase, roughly half was on the “core” and the FSM, the rest on the “wrapper”.

19

Road to verification : C2 Examples of the high-level C implementations:

Tyrion:~/Desktop/15525 nstohs$ ./prime4 7round 1: (4 * 4 - 2) mod 127 = 14round 2: (14 * 14 - 2) mod 127 = 67round 3: (67 * 67 - 2) mod 127 = 42round 4: (42 * 42 - 2) mod 127 = 111round 5: (111 * 111 - 2) mod 127 = 027-1 is prime

Tyrion:~/Desktop/15525 nstohs$ ./prime4 11round 1: (4 * 4 - 2) mod 2047 = 14round 2: (14 * 14 - 2) mod 2047 = 194round 3: (194 * 194 - 2) mod 2047 = 788round 4: (788 * 788 - 2) mod 2047 = 701round 5: (701 * 701 - 2) mod 2047 = 119round 6: (119 * 119 - 2) mod 2047 = 1877round 7: (1877 * 1877 - 2) mod 2047 = 240round 8: (240 * 240 - 2) mod 2047 = 282round 9: (282 * 282 - 2) mod 2047 = 1736211-1 is not prime

20

Road to verification: Verilog

Samples of Verilog Verification output:

Partial Product Unit p = 7380 ppOut= 56, x= 14, y= 14, i= 2, mod= 127, p= 7400 ppOut= 112, x= 14, y= 14, i= 3, mod= 127, p= 7420 ppOut= 0, x= 14, y= 14, i= 4, mod= 127, p= 7440 ppOut= 0, x= 14, y= 14, i= 5, mod= 127, p= 7

Top Level p = 7itrOut= xitrOut= 4itrOut= 14itrOut= 67itrOut= 42itrOut= 111itrOut= 0

Top Level p = 11itrOut= xitrOut= 4itrOut= 14itrOut= 194itrOut= 788itrOut= 701itrOut= 119itrOut= 1877…

Tests were either specific tests on important units such as Partial_Product

…or top level tests. Note that these are the same results generated from the C code

21

Road to verification: Schematic I

Schematic Test of our modular adder.

128 + 68 Mod 127 = 69

22

Road to verification: Schematic II

Plot of the top level output after a single iteration, p=7

Output after a single iteration is 14, the expected value.

23

Road to verification: Schematic III

4 14 67 42 111

24

Road to verification: Intermission

Disk Space required for a full-length schematic test of p=7 : 6 GBTime required for a full-length schematic test of p=7 : 5 hours

Disk Space required for a full-length extractedRC test of p=7 : 20 GBTime required for a full-length extractedRC test of p=7 : 8 hours

Simulations become lengthy due to tests needing to be “deep” to be useful.

25

Layout: ExtractedRC – Full Run

4 14 67 42 111

26

TimingTo determine the bounds of our clock, Pathmill was used once major portions of the schematic was complete.

The critical path through our design is one loop through the modular multiplier, which runs through the modular adder and partial products module.

The pathmill delay of the modular adder was 9ns, and 5.2 ns through the partial products module.

This already puts our total delay at 14.2 ns, putting our schematic delay at 70 MHz.

For extractedRC, due in part to simulation issues, a conservative 50 MHz was chosen as the final clock.

27

Issues

• extractedRC of partial_product module• Registers switch

– Custom design to DFFs with muxes

• Switching from parallel calculations to series– Transistor count vs. clock cycles

• Syncing up design between people– Transferring files– Different design styles

• LONG simulation times• Floorplanning

– Too much emphasis on aspect ratios and not enough on wiring– Couldn’t decide on one set floorplan

28

Floorplan v1.0

29

Floorplan v2.0

30

Final Floorplan

31

Pin Specifications

Pin Type # of Pins

Vdd! In/Out 1

Gnd! In/Out 1

p<0:15> In 16

clk In 1

start In 1

Done Out 1

out Out 1

Total - 22

32

Initial Module SpecificationsModule Transistor

Count

Area

(µm²)

Transistor

Density

FSM 300 900 .33

mod_p 2,440 7,000 .35

mod_add 1,282 9,000 .14

partial_product 8,676 65,000 .13

count 1,656 6,000 .27

sub_16 704 3,500 .20

Registers 1,848 6,000 .30

compare 36 300 .12

Total 16,942 97,700 .17

33

Final Module Specifications

Module Transistor

Count

Area

(µm²)

Transistor

Density

FSM 152 1,200 .13

mod_p 1,280 8,603 .15

mod_add 1,168 5,603 .21

partial_product 7,520 54,680 .14

count 1,424 8,701 .16

sub_16 576 2,934 .20

Registers 896 6,028 .15

compare 56 201 .28

Total 13,702 86,621 .16

Aspect

Ratio

2.45

0.79

2.40

1.16

6.88

4.49

4.76

4.41

1.01

34

Chip Specifications

• Transistor Count: 13,702

• Size: 296.51µm x 292.13µm

• Area: 86,621µm²

• Aspect Ratio: 1.01:1

• Density: 0.16 transistors/µm²

35

Final Floorplan

36

Final Floorplan

37

Partial Product

shift_rightshift_left

shift_right shift_left

adder

16-bit and Select16

Sub_16

mux

38

Poly Layer

Density: 7.14%

39

Active Layer

Density: 8.76%

40

Metal1 Layer

Density: 23.86%

41

Metal2 Layer

Density: 19.97%

42

Metal3 Layer

Density: 11.30%

43

Metal4 Layer

Density: 10.34%

44

Conclusions

• Plan for buffers-Will be hard to put them in after the fact

• Your design will change dramatically from start to finish so be flexible

• Communication is key

• Do layout in parallel

Recommended