Classical Distributed Computing Studies. Washington DC Apache Spark Interactive Meetup 2015-09-22

Preview:

Citation preview

Classical Distributed

Computing Studies

title inspired by http://prog21.dadgum.com/210.html

Can Catalyst save us

from Amdahl's Law?

(Sorry, no.)

Gene Amdahl

born 1922

in South Dakota

4

(CC BY 2.0) https://www.flickr.com/photos/mwichary/

5

WWII Naval Veteran

then went to SD State

then only got into Wisconsin for theoretical physics

6

While working with slide rules on physics calculations he

thought the whole thing could be faster if he made a computer

to do it.

7

So he did.

WISC

Wisconsin

Integrally

Syncronized

Computer

(CC BY 2.0) https://www.flickr.com/photos/pargon/

8

6J6 and 12AU7 Vacuum

Tubes

Magnetic Drum MemoryCC BY 2.0 https://www.flickr.com/photos/mwichary/

9

First non-Government

sponsored computer.

CC BY 2.0 https://www.flickr.com/photos/mwichary/

10

Invented floating point

CC BY 2.0 https://www.flickr.com/photos/mwichary/

11

When he filed a patent on Floating Point he found out that

von Neumann had already done so.

http://pages.cs.wisc.edu/~bezenek/Stuff/amdahl_thesis

.pdf

12

http://pages.cs.wisc.edu/~bezenek/Stuff/amdahl_thesis.pdf

13

Hired immediately by IBM and worked on

the arithmetic unit for the IBM 360

14

Worked on STRETCH

the first transistorized IBM computer

via https://en.wikipedia.org/wiki/IBM_7030_Stretch

15

photo CC by https://www.flickr.com/photos/jurvetson/

Then founded in partnership with Fujitsu

Air cooled Amdahl 470

The first IBM clone of the IBM S/370!

16

Memo while still at IBM:

Validity of the single processor approach to achieving large

scale computing capabilities

Creates what is known as Amdahl’s Law

17

No equation in the memo, which has led to it

being written many different ways.

But it’s easiest to understand graphically.

18

parallelizableserial

total run time on 1 processor

total run time with infinite

parallelization

19

If your familiar with the Critical Path Method from

business or operations research

or if you’ve ever worked in a restaurant

or on an assembly line

Amdahl’s law should be common sense

Now some other

historical notes

eventually tying to

Spark. :)

21

Rear Admiral Grace Hopper

1906-1992

https://www.youtube.com/watch?v=JEpsKnWZrJ8

22

Rear Admiral Grace Hopper

1906-1992

https://www.youtube.com/watch?v=JEpsKnWZrJ8

what do nanoseconds look like?

23

Table from Amdahl’s PhD Thesis

(1952)

24

https://gist.github.com/jboner/2841832

Latency Comparison Numbers

--------------------------

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns 14x L1 cache

Mutex lock/unlock 25 ns

Main memory reference 100 ns 20x L2 cache, 200x L1 cache

Compress 1K bytes with Zippy 3,000 ns

Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms

Read 4K randomly from SSD* 150,000 ns 0.15 ms

Read 1 MB sequentially from memory 250,000 ns 0.25 ms

Round trip within same datacenter 500,000 ns 0.5 ms

Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory

Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip

Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD

Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

25

L1 cache reference : 0:00:01

Branch mispredict : 0:00:10

L2 cache reference : 0:00:14

Mutex lock/unlock : 0:00:50

Main memory reference : 0:03:20

Compress 1K bytes with Zippy : 1:40:00

Send 1K bytes over 1 Gbps network : 5:33:20

Read 4K randomly from SSD : 3 days, 11:20:00

Read 1 MB sequentially from memory : 5 days, 18:53:20

Round trip within same datacenter : 11 days, 13:46:40

Read 1 MB sequentially from SSD : 23 days, 3:33:20

Disk seek : 231 days, 11:33:20

Read 1 MB sequentially from disk : 462 days, 23:06:40

Send packet CA->Netherlands->CA : 3472 days, 5:20:00

comment from https://gist.github.com/kofemann

“humanized scale” where 1ns = 1s

26

American Documentation

Volume 20, Issue 1, pages 21–26, January 1969

27

What computerization and statistics

can add...

28

Karen Spärck Jones FBA

(1935-2007)

29

Karen Spärck Jones FBA

(1935-2007)

Invented Inverse Document

Frequency

http://nlp.cs.swarthmore.edu/~richardw/papers/sparckjones1972-statistical.pdf

“The specificity of a term can be

quantified as an inverse function of

the number of documents in which it

occurs.”

SparkSQL

31

The Promise of SparkSQL

(the catalyst planner)

32

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate

FROM Orders

INNER JOIN Customers

ON Orders.CustomerID=Customers.CustomerID;

33

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate

FROM Orders

JOIN Customers

ON Orders.CustomerID=Customers.CustomerID;

an imaginary SQL statement that could be parallelized

34

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate

FROM Orders

JOIN Customers

ON Orders.CustomerID=Customers.CustomerID;

35

SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate

FROM Orders

JOIN Customers

ON Orders.CustomerID=Customers.CustomerID;

But what if Customers is on your local HDFS and Orders is at

your on a data center at your warehouse?

36

Computerized query planning is the future, but for the time

being you the user are going to have to recognize your

latency issues.

37

Quick fix

38

https://gist.github.com/jboner/2841832

Latency Comparison Numbers

--------------------------

L1 cache reference 0.5 ns

Branch mispredict 5 ns

L2 cache reference 7 ns 14x L1 cache

Mutex lock/unlock 25 ns

Main memory reference 100 ns 20x L2 cache, 200x L1 cache

Compress 1K bytes with Zippy 3,000 ns

Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms

Read 4K randomly from SSD* 150,000 ns 0.15 ms

Read 1 MB sequentially from memory 250,000 ns 0.25 ms

Round trip within same datacenter 500,000 ns 0.5 ms

Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory

Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip

Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD

Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

39

Quick fix

CACHE [LAZY] TABLE [AS SELECT]

40

Premature optimization is the root of

all evil

- Donald Knuth (misquoted)

41

We should forget about small efficiencies, say about

97% of the time: premature optimization is the root of

all evil.

Yet we should not pass up our opportunities in that

critical 3%.

A good programmer will not be lulled into

complacency by such reasoning, he will be wise to

look carefully at the critical code; but only after that

code has been identified.

Donald Knuth

ACM Computing Surveys, Vol 6, No. 4, Dec. 1974

Structured Programming with go to Statements

Thank You

Recommended