20110620 amst rdam_kpb

IntroductionComputing in databases

Conclusion

Computing near the data:let someone else do the heavy lifting for you

Konrad Banachewicz

AmstRdam, June 20th 2011

Konrad Banachewicz Computing near the data


Conclusion

”We’re drowning in data and starving for information”



Conclusion

Data coming in from the market:

1 liquid instrument (front month DAX Future), 1 day, 1exchange → 400 MB in pure ASCII

different parameters → ”clones” of the same instrument

{ exchanges } x { instruments } x { days }...= A LOT



Conclusion







Conclusion







Conclusion







Conclusion

Problems:

memory

bandwidth



Conclusion

Model 1: regressionModel 2: correlationModel 3: VaR

Typical approach

read the data to memory

analyze there

save the results



Conclusion


Typical approach


analyze there

save the results



Conclusion


Typical approach


analyze there

save the results



Conclusion


Typical approach


analyze there

save the results



Conclusion


But is it really necessary?



Conclusion


In many cases what we really need is aggregate info:Example: linear regression

classic estimatorβ̂ = (XTX )−1XT y

come to think about it, what we really need are sums, sums ofsquares and cross-products



Conclusion







Conclusion







Conclusion


Two possible approaches:

1 Ripley i Chen: extra interface, pure R

2 R + SQL



Conclusion


Ripley i Chen

R(user) // CORBA // R(servant)

��DB



Conclusion


Alternative

R(user) // DBoo

Two scenarios:

1 pure R processing

2 computations partially in DB



Conclusion


base model:Yt = β1 + β2Xt + εt

estimator:

β̂ =(XTX

)−1XTY

in the DB: arithmetic operations on a limited set of columns



Conclusion



estimator:

β̂ =(XTX

)−1XTY




Conclusion



estimator:

β̂ =(XTX

)−1XTY




Conclusion



estimator:

β̂ =(XTX

)−1XTY




Conclusion


Pure R processing

200000 400000 600000 800000 1000000

05

1015

2025

30

Case study 1, method 1

Dataset size (number of rows)

Exec

utio

n tim

e (s

econ

ds)

Ingres VWIngresMySQLPostgreSQLDBMS X



Conclusion


Computations partially in DB

200000 400000 600000 800000 1000000

05

1015

2025

30



Exec

utio

n tim

e (s

econ

ds)




Conclusion


base model:

Cov(X ,Y ) = E [XY ]− EXEY

estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)

in the DB: large queries



Conclusion


base model:


estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)




Conclusion


base model:


estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)




Conclusion


base model:


estimator:

ˆCov(X ,Y ) =1

n

n∑i=1

XiYi −

(1

n

n∑i=1

Xi

)(1

n

n∑i=1

Yi

)




Conclusion


Pure R processing

15 20 25 30 35

010

2030

4050

60


Dataset size (columns)

Exec

utio

n tim

e (s

econ

ds)




Conclusion



15 20 25 30 35

010

2030

4050

60


Dataset size (columns)

Exec

utio

n tim

e (s

econ

ds)




Conclusion


calculate a quantile of the portfolio PnL

Vp = inf {u : F (u) ≥ 1− p}

estimator:V̂p = X[n(1−p)]+1

in the DB: sorting



Conclusion



Vp = inf {u : F (u) ≥ 1− p}


in the DB: sorting



Conclusion



Vp = inf {u : F (u) ≥ 1− p}


in the DB: sorting



Conclusion



Vp = inf {u : F (u) ≥ 1− p}


in the DB: sorting



Conclusion


Pure R processing

2000000 4000000 6000000 8000000 10000000

020

4060

8010

0



Exec

utio

n tim

e (s

econ

ds)




Conclusion



200000 400000 600000 800000 1000000

020

4060

8010

0



Exec

utio

n tim

e (s

econ

ds)




Conclusion

1 with minimal effort, significant speedups are possible

2 ODBC as minimal requirement

3 extensions: parallel computing...


Small Business & Entrepreneurship

20110620 amst rdam_kpb