A very efficient approach to random sampling in SAS® achieves speed increases orders of magnitude faster than the relevant “built-in” SAS® procedures. For sampling with replacement as applied to bootstraps, seven algorithms coded in SAS® are compared, and the fastest (“OPDY”), based on the new approach and using no modules beyond Base SAS®, achieves speed increases over 220x faster than Proc SurveySelect. OPDY also handles datasets many times larger than those on which two hashing algorithms crash. For sampling without replacement as applied to permutation tests, six algorithms coded in SAS® are compared, and the fastest (“OPDN”), based on the new approach and using no modules beyond Base SAS®, achieves speed increases over 215x faster than Proc SurveySelect, over 350x faster than NPAR1WAY (which crashes on datasets less than a tenth the size OPDN can handle), and over 720x faster than Proc Multtest. OPDN utilizes a simple draw-by-draw procedure that allows for the repeated creation of many without-replacement permutation samples without requiring any additional storage or memory space. Based on these results, there appear to be no faster or more scalable algorithms in SAS® for bootstraps, permutation tests, or sampling with or without replacement.
* J.D. Opdyke is Managing Director of Quantitative Strategies at DataMineIt, a consultancy specializing in applied statistical, econometric, and algorithmic solutions for the financial and consulting sectors. Clients include multiple Fortune 50 banks and credit card companies, big 4 and economic consulting firms, venture capital firms, and large marketing and advertising firms. J.D. has been a SAS® user for over 20 years and routinely writes SAS® code faster (often orders of magnitude faster) than SAS® Procs (including but not limited to Proc Logistic, Proc MultTest, Proc Summary, Proc Means, Proc NPAR1WAY, Proc Plan, and Proc SurveySelect). He earned his undergraduate degree from Yale University, his graduate degree from Harvard University where he was a Kennedy Fellow, and has completed additional post-graduate work as an ASP Fellow in the graduate mathematics department at MIT. Additional of his peer reviewed publications spanning number theory/combinatorics, statistical finance, statistical computation, applied econometrics, and hypothesis testing for statistical quality control can be accessed at www.DataMineIt.com.
re: Bootstrap,Permutation Test, Sampling Orders of Magnitude Faster v SAS Procs
Posted: Feb 9, 2011 10:17 AM in response to: Doc@Duke
They are extremely fast and efficient executions, not just notions or vague ideas but rather, actual code in the papers: modular, easily useable SAS Macros. All you have to do is change the libname reference and a half dozen macro variable values when you call the macro.
I’d actually argue that it is quite surprising that they are faster BY ORDERS OF MAGNITUDE, not just 10% or 50% or even 100% … they are 70,000% faster (700x!). That is notable, and since SAS has been around for three decades, it is surprising.
And I actually have a version of the boostrap OPDY algo already running for multivariate regression, so that is not a problem for it (take a closer look and you'll see why). And I'd bet the farm that your old %bootsamp macro, if anything like those I've seen posted, is orders of magnitude slower, which of course is not a problem per se if you're using smallish datasets.
The bootstrap OPDY algo ALSO can be used for many statistics/econometrics without closed forms, depending on the specific situation. I've simply used fast convergence algorithms where needed to estimate, with precision to the tenth decimal place, the necessary parameters.
So both the OPDY and OPDN algos, in SAS, are very general, and with just a little head scratching, very generalizable.
AND they're more robust: they execute on datasets at least an order of magnitude larger than those hash tables/objects crash on, and the same goes for some Procs (e.g. Proc NPAR1WAY - see papers for details).
So OPDN and OPDY are a) easily adaptable to many multivariate statistics/econometrics; b) easily adaptable to some statistics/econometrics without closed forms; c) orders of magnitude faster than any other implementation in SAS; d) dramatically more scalable than any other implementation in SAS because they have linear time complexity (which is not true of the relevant SAS Procs); AND e) more robust in terms of dataset size by more than an order of magnitude.