Upload
clementine-mckenzie
View
214
Download
2
Embed Size (px)
Citation preview
Software, Hardware, and Database Structure Options for
Research in Financial Economics
SAS and Computing Speed
Michael Boldin, WRDS, University of Pennsylvania [email protected]
Main Questions
1. How can researchers take advantage of modern computing technology ?
2. Which econometric software packages would you recommend to students ?
3. How do SAS features and computing speed stack up?
Q1. How can researchers take advantage of modern computing technology ?
Observations:
• Today’s PCs are better than yesterday’s ‘supercomputer’ (for single users).
• The system–hardware, software, and network connections–needs to work as a whole.
• Database management (DBMS) matters.
Q2. Which econometric software packages would you recommend to students ?
Observations:
• Undergrad and Grad advice differs.• Power, flexibility and user-friendly elements
are not mutually opposing.• Almost too many choices (and change is
hard).• Few students care about good programming
practice and they keep bad habits.
Q3. How do SAS features and computing speed stack up?
• Is SAS fast enough in raw computing speed?• Does the SAS Data step framework create
performance handicaps? • How does SAS/IML stack up to MATLAB and
GAUSS in functionality?
Other issues:• Does SAS need a better interface to C/C++
and FORTRAN modules ?• What does SAS offer as an RDBM compliment to MATLAB ?• Is greater compatibility with open-source software such as
MySQL and PHP possible ?
Statistical Software Evaluations Reviewing the Reviews
Noteworthy: Jeffrey MacKie-Mason (1992) ‘Econometric Software: A User’s View’
• Could not select an unqualified winner among: Gauss, Limdep, RATS, SAS, SST, Stata, and TSP.
• Preferred TSP. Saw advantages to SAS, but found problems in PC SAS (of 1991).
• Correctly predicted movement toward matrix algebra oriented software such as GAUSS.
John Rust (1993), ‘GAUSS and MATLAB: A Comparison’ • Highlighted the advantages of matrix oriented programming for econometrics.• Correctly predicted that users would soon be moving away from DOS.• Incorrectly predicted that the move would be toward UNIX workstations.
Problems: • Most other reviews just count features. • Or worse, stress speed overall all other issues.• Within 2 years, the review is largely obsolete. • After 5 years likely to be completely misleading if not irrelevant.
Speed Comparisons (by Stefan Steinhaus)
Speed Score Overall Score 1997 1999 2002 1997 1999 2002
GAUSS 49.94 47.96 47.90 | 64.38 63.64 64.80Mathematica 7.67 31.95 31.32 | 48.76 54.93 57.34Matlab 39.98 34.64 65.89 | 60.03 55.85 69.74Ox 66.21 68.12 62.22 | 47.30 49.22 58.45O-Matrix 70.80 67.29 69.80 | 48.72 43.68 45.83S-Plus 37.18 30.51 38.56 | 54.28 44.90 48.61
Source: http://www.scientificweb.com/ncrunch/index.html
Higher scores are better. 100 is the highest possible score in each year’s evaluation. Speed scores are not comparable across years. Overall score includes breadth of functionality and other usability considerations, using these weights: Mathematical functions 38%, Graphical functions 10%, Programming environment 9%, Data import/export 5%, Available operating systems 2%, Speed comparison 36
In pure speed comparisons (made comparable across years)
-- faster PC and new software vintage makes a poor performer the top performer relative to the ‘best’ old technology pair.
And how about SAS ?
A Helicopter View of PC Technology
1981: IBM PC = $5,500 in today’s prices
64K memory, no hard disk, monochrome monitor, no networking capabilities
Today: Dell Pentium IV < $1000
1G memory, 3000x faster, 80 Gig hard drive, DVD/CD burner, flat screen color monitor, and built-in networking.
The Speed IssueMoore’s Law in Action
PentiumClockspeed Year MWIPS Time indexI 120 Mhz 1995 79 100.0II 266 Mhz 1997 218 36.2III 550 Mhz 1999 448 17.6IV 1.8 Ghz 2001 638 12.4IV 3.6 Ghz 2003 1342 5.9IV 3.8 Ghz 2004 3899 2.0
MWIPS = Mean Whetstone Instructions per Second. A higher MWIPS score is better (i.e. faster chip), and a twice as high MWIPS translates
to roughly 50% less time to make an average numerical calculation.
Source: http://homepage.virgin.net/roy.longbottom/whetstone.htm
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1995 1997 1999 2001 2003 2004
0
20
40
60
80
100
120
MWIPS
Time Index
Evaluation of Statistical Software
Three categories1. Traditional programming languages:
FORTRAN, C/C++, and Basic. Relatively new: Perl, Python, and Java.
2. Statistical packages: EVIEWS, SAS, STATA, and TSP.
3. Matrix algebra oriented computing software: GAUSS, Mathematica, MATLAB, R and Splus.
Speed & User Friendliness
Computation Speed: Fortran > C > C++ > Matlab > SAS > Perl
User Friendliness: SAS > Matlab > Perl > C++ > C > Fortran
Rankings of other languages /packages ?? Java VBasic Stata SPSS SPlus/R
Are the speed differences significant ?Are ‘user’ elements only a matter of taste ?How can user friendliness and computation speed be
combined in an evaluation.
Computing Speed Only One Part of the Equation
Total Research Project Time1. Planning2. Data Management3. Programming4. Computation5. Analysis of Results6. Re-Evaluation
(revisit & repeat prior steps)
Simple Model of Cost/Benefit (Time) TradeoffsProgramming = (b0 + b1*x + b2*x2) / (ease-factor)
Computation = (a0j + a1*x + a2* x2) / (speed)
0 1 2 3 4 5 6 7 8 9 100
10
20
30User Programming (Time and Effort) Element
0 1 2 3 4 5 6 7 8 9 100
5
10Computation Time Element
0 1 2 3 4 5 6 7 8 9 10-10
0
10
20Differences in Costs
Complexity
Package 1 (slow and easy)
Package 2 (fast and hard)
Total
Programming
Computation
Both programming and computing time depend on the complexity of the task, and the computing speed advantage of Package 2 may overwhelm the ease of use issue for modestly complex tasks.
Package 2 preferred for complexity level above 6
Simple Model of Cost/Benefit (Time) TradeoffsProgramming = (b0 + b1*x + b2*x2) / (ease-factor*2)
Computation = (a0j + a1*x + a2* x2) / (speed*10)
0 1 2 3 4 5 6 7 8 9 100
10
20
30User Programming (Time and Effort) Element
0 1 2 3 4 5 6 7 8 9 100
5
10Computation Time Element
0 1 2 3 4 5 6 7 8 9 10-5
0
5Differences in Costs
Complexity
Package 1 (slow and easy)
Package 2 (fast and hard)
Total
Programming
Computation
Increase in computing speed (relative to ease-factor) makes Package 1 a better choice for a larger range of tasks.
Threshold for preferring Package 2 rises
Black-Scholes Calculation Speeds
1 million cases System A Sun V440
System B
Pentium 4 PC
C Program 3.0 seconds 1.5 seconds
Fortran 4.1 --
Matlab 2.4 1.4
SAS 4.6 6.7
R -- 1.9
EXCEL VBA -- 560
Perl 39.6 --
*SAS code -- Black Scholes Option Value calculation;* S= Spot price, X = Excise price, sigma= Stock return volatility* r= Risk free bond rate, q= Dividend rate, tau= Time till maturity; d1= ( log(S/X) + ( r – q + 0.5*sigma*sigma ) * tau ) / ( sigma*sqrt(tau) );d2= d1 - sigma * sqrt(tau);*Normal curve cumulative density function values;N1= cdf('normal',d1); N2= cdf('normal',d2);Vc = ( S * exp(-q*tau) * N1 ) - ( X * exp(-r*tau)* N2 );
SAS vs. MATLAB Computation Speed Comparison
Basic Statistics ExampleSimulated Data: 1million observation, 10 variables, in 10 groups
SAS MATLABData creation 3.6 1.4Mean & std 1.6 1.4Frequency 0.3 0.3REG module 0.8 2.2Sort by group 8.4 2.4REG by group 1.1 1.4
sum 15.8 seconds 9.1 seconds
Bottom line:• MATLAB is almost twice as fast in relative difference (42% faster in
this example), but only 6.7 seconds faster in absolute difference.• For most applications there are less than 1 million observations
and the absolute difference is even smaller.
SAS vs. MATLABComputation Speed Comparison
Is MATLAB’s speed advantage due to its matrix based programming ?
No. SAS also has a Interactive Matrix Language module (IML).Using SAS IML shows how alternative programming methods can
matter (within the same package).
OLS Regression Example: 1million observation, 10 variables
SAS IML MATLABB= inv(X’X)*(X’y) 2.6 0.4REG module 0.8 2.2 Programming the OLS matrix algebra equation in MATLAB beats
MATLAB’s regress(.) function in terms of speed, while the opposite is true for SAS.
Finance Research Example
CAPM (Beta) Test: Ri,t = αi + βi Rmt
500 Beta Calculations
System ASun V440(multi-user UNIX)
System BPentium 4 Windows PC
SAS 1.3 / 2.5 seconds 1.2 seconds
MATLAB ‘loop’ version
1.0
17.3
Multi-user UNIX system run time varies depending on load.MATLAB run time varies depending on program design– optimal vectorized
code versus an inefficient loop.A true CAPM test would estimate multi-factor betas (βi) for 5,000 to 25,000
stocks over different sample periods. Summarizations require sorting into portfolios and applying 2 stage estimation and testing techniques.
Example: SAS run = 40 minutes // MATLAB = 35 minutes
Conclusions:
Changes in technology change the equation for determining the best system—personal preferences are important.
Absolute speed (not relative speed) may matter but programming time is overwhelmingly the larger component (in > 90% of the cases) anyway.
Software is not an either/or situation. Advice: Learn and use two or more software packages as compliments.
Database management and connectivity is the key to the greatest possible flexibility.
Almost Counterintuitive General Conclusion:
Technological progress makes human factors and personal preferences most important.