37
Introduction Computing in databases Conclusion Computing near the data: let someone else do the heavy lifting for you Konrad Banachewicz AmstRdam, June 20 th 2011 Konrad Banachewicz Computing near the data

20110620 amst rdam_kpb

Embed Size (px)

DESCRIPTION

AmstRdam presentation

Citation preview

Page 1: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Computing near the data:let someone else do the heavy lifting for you

Konrad Banachewicz

AmstRdam, June 20th 2011

Konrad Banachewicz Computing near the data

Page 2: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

”We’re drowning in data and starving for information”

Konrad Banachewicz Computing near the data

Page 3: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Data coming in from the market:

1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII

different parameters → ”clones” of the same instrument

{ exchanges } x { instruments } x { days }...= A LOT

Konrad Banachewicz Computing near the data

Page 4: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Data coming in from the market:

1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII

different parameters → ”clones” of the same instrument

{ exchanges } x { instruments } x { days }...= A LOT

Konrad Banachewicz Computing near the data

Page 5: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Data coming in from the market:

1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII

different parameters → ”clones” of the same instrument

{ exchanges } x { instruments } x { days }...= A LOT

Konrad Banachewicz Computing near the data

Page 6: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Data coming in from the market:

1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII

different parameters → ”clones” of the same instrument

{ exchanges } x { instruments } x { days }...= A LOT

Konrad Banachewicz Computing near the data

Page 7: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Problems:

memory

bandwidth

Konrad Banachewicz Computing near the data

Page 8: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Typical approach

read the data to memory

analyze there

save the results

Konrad Banachewicz Computing near the data

Page 9: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Typical approach

read the data to memory

analyze there

save the results

Konrad Banachewicz Computing near the data

Page 10: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Typical approach

read the data to memory

analyze there

save the results

Konrad Banachewicz Computing near the data

Page 11: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Typical approach

read the data to memory

analyze there

save the results

Konrad Banachewicz Computing near the data

Page 12: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

But is it really necessary?

Konrad Banachewicz Computing near the data

Page 13: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

In many cases what we really need is aggregate info:Example: linear regression

classic estimatorβ̂ = (XTX )−1XT y

come to think about it, what we really need are sums, sums ofsquares and cross-products

Konrad Banachewicz Computing near the data

Page 14: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

In many cases what we really need is aggregate info:Example: linear regression

classic estimatorβ̂ = (XTX )−1XT y

come to think about it, what we really need are sums, sums ofsquares and cross-products

Konrad Banachewicz Computing near the data

Page 15: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

In many cases what we really need is aggregate info:Example: linear regression

classic estimatorβ̂ = (XTX )−1XT y

come to think about it, what we really need are sums, sums ofsquares and cross-products

Konrad Banachewicz Computing near the data

Page 16: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Two possible approaches:

1 Ripley i Chen: extra interface, pure R

2 R + SQL

Konrad Banachewicz Computing near the data

Page 17: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Ripley i Chen

R(user) // CORBA // R(servant)

��DB

Konrad Banachewicz Computing near the data

Page 18: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Alternative

R(user) // DBoo

Two scenarios:

1 pure R processing

2 computations partially in DB

Konrad Banachewicz Computing near the data

Page 19: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:Yt = β1 + β2Xt + εt

estimator:

β̂ =(XTX

)−1XTY

in the DB: arithmetic operations on a limited set of columns

Konrad Banachewicz Computing near the data

Page 20: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:Yt = β1 + β2Xt + εt

estimator:

β̂ =(XTX

)−1XTY

in the DB: arithmetic operations on a limited set of columns

Konrad Banachewicz Computing near the data

Page 21: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:Yt = β1 + β2Xt + εt

estimator:

β̂ =(XTX

)−1XTY

in the DB: arithmetic operations on a limited set of columns

Konrad Banachewicz Computing near the data

Page 22: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:Yt = β1 + β2Xt + εt

estimator:

β̂ =(XTX

)−1XTY

in the DB: arithmetic operations on a limited set of columns

Konrad Banachewicz Computing near the data

Page 23: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Pure R processing

200000 400000 600000 800000 1000000

05

1015

2025

30

Case study 1, method 1

Dataset size (number of rows)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

Page 24: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Computations partially in DB

200000 400000 600000 800000 1000000

05

1015

2025

30

Case study 1, method 2

Dataset size (number of rows)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

Page 25: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:

Cov(X ,Y ) = E [XY ]− EXEY

estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)

in the DB: large queries

Konrad Banachewicz Computing near the data

Page 26: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:

Cov(X ,Y ) = E [XY ]− EXEY

estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)

in the DB: large queries

Konrad Banachewicz Computing near the data

Page 27: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:

Cov(X ,Y ) = E [XY ]− EXEY

estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)

in the DB: large queries

Konrad Banachewicz Computing near the data

Page 28: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

base model:

Cov(X ,Y ) = E [XY ]− EXEY

estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)

in the DB: large queries

Konrad Banachewicz Computing near the data

Page 29: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Pure R processing

15 20 25 30 35

010

2030

4050

60

Case study 1, method 1

Dataset size (columns)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

Page 30: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Computations partially in DB

15 20 25 30 35

010

2030

4050

60

Case study 1, method 1

Dataset size (columns)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

Page 31: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

calculate a quantile of the portfolio PnL

Vp = inf {u : F (u) ≥ 1− p}

estimator:V̂p = X[n(1−p)]+1

in the DB: sorting

Konrad Banachewicz Computing near the data

Page 32: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

calculate a quantile of the portfolio PnL

Vp = inf {u : F (u) ≥ 1− p}

estimator:V̂p = X[n(1−p)]+1

in the DB: sorting

Konrad Banachewicz Computing near the data

Page 33: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

calculate a quantile of the portfolio PnL

Vp = inf {u : F (u) ≥ 1− p}

estimator:V̂p = X[n(1−p)]+1

in the DB: sorting

Konrad Banachewicz Computing near the data

Page 34: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

calculate a quantile of the portfolio PnL

Vp = inf {u : F (u) ≥ 1− p}

estimator:V̂p = X[n(1−p)]+1

in the DB: sorting

Konrad Banachewicz Computing near the data

Page 35: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Pure R processing

2000000 4000000 6000000 8000000 10000000

020

4060

8010

0

Case study 3, method 1

Dataset size (number of rows)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

Page 36: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Computations partially in DB

200000 400000 600000 800000 1000000

020

4060

8010

0

Case study 3, method 2

Dataset size (number of rows)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X

Konrad Banachewicz Computing near the data

Page 37: 20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

1 with minimal effort, significant speedups are possible

2 ODBC as minimal requirement

3 extensions: parallel computing...

Konrad Banachewicz Computing near the data