turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Bootstrap,Permutation Test, Sampling Orders of Mag...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

02-09-2011 09:09 AM

Bootstraps, Permutation Tests, and Sampling With and Without Replacement Orders of Magnitude Faster Using SAS®

J.D. Opdyke,* DataMineIt

download at http://www.datamineit.com/DMI_publications.htm

A very efficient approach to random sampling in SAS® achieves speed increases orders of magnitude faster than the relevant “built-in” SAS® procedures. For sampling with replacement as applied to bootstraps, seven algorithms coded in SAS® are compared, and the fastest (“OPDY”), based on the new approach and using no modules beyond Base SAS®, achieves speed increases over 220x faster than Proc SurveySelect. OPDY also handles datasets many times larger than those on which two hashing algorithms crash. For sampling without replacement as applied to permutation tests, six algorithms coded in SAS® are compared, and the fastest (“OPDN”), based on the new approach and using no modules beyond Base SAS®, achieves speed increases over 215x faster than Proc SurveySelect, over 350x faster than NPAR1WAY (which crashes on datasets less than a tenth the size OPDN can handle), and over 720x faster than Proc Multtest. OPDN utilizes a simple draw-by-draw procedure that allows for the repeated creation of many without-replacement permutation samples without requiring any additional storage or memory space. Based on these results, there appear to be no faster or more scalable algorithms in SAS® for bootstraps, permutation tests, or sampling with or without replacement.

Keywords: Bootstrap, Permutation, SAS, Scalable, Replacement, Sampling

JEL Classifications: C12, C13, C14, C15, C63, C88

Mathematics Subject Classification: 62G09, 62G10, 62F40

© 2011 by John Douglas Opdyke. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit, including © notice, is given to the source.

* J.D. Opdyke is Managing Director of Quantitative Strategies at DataMineIt, a consultancy specializing in applied statistical, econometric, and algorithmic solutions for the financial and consulting sectors. Clients include multiple Fortune 50 banks and credit card companies, big 4 and economic consulting firms, venture capital firms, and large marketing and advertising firms. J.D. has been a SAS® user for over 20 years and routinely writes SAS® code faster (often orders of magnitude faster) than SAS® Procs (including but not limited to Proc Logistic, Proc MultTest, Proc Summary, Proc Means, Proc NPAR1WAY, Proc Plan, and Proc SurveySelect). He earned his undergraduate degree from Yale University, his graduate degree from Harvard University where he was a Kennedy Fellow, and has completed additional post-graduate work as an ASP Fellow in the graduate mathematics department at MIT. Additional of his peer reviewed publications spanning number theory/combinatorics, statistical finance, statistical computation, applied econometrics, and hypothesis testing for statistical quality control can be accessed at www.DataMineIt.com.

J.D. Opdyke,* DataMineIt

download at http://www.datamineit.com/DMI_publications.htm

A very efficient approach to random sampling in SAS® achieves speed increases orders of magnitude faster than the relevant “built-in” SAS® procedures. For sampling with replacement as applied to bootstraps, seven algorithms coded in SAS® are compared, and the fastest (“OPDY”), based on the new approach and using no modules beyond Base SAS®, achieves speed increases over 220x faster than Proc SurveySelect. OPDY also handles datasets many times larger than those on which two hashing algorithms crash. For sampling without replacement as applied to permutation tests, six algorithms coded in SAS® are compared, and the fastest (“OPDN”), based on the new approach and using no modules beyond Base SAS®, achieves speed increases over 215x faster than Proc SurveySelect, over 350x faster than NPAR1WAY (which crashes on datasets less than a tenth the size OPDN can handle), and over 720x faster than Proc Multtest. OPDN utilizes a simple draw-by-draw procedure that allows for the repeated creation of many without-replacement permutation samples without requiring any additional storage or memory space. Based on these results, there appear to be no faster or more scalable algorithms in SAS® for bootstraps, permutation tests, or sampling with or without replacement.

Keywords: Bootstrap, Permutation, SAS, Scalable, Replacement, Sampling

JEL Classifications: C12, C13, C14, C15, C63, C88

Mathematics Subject Classification: 62G09, 62G10, 62F40

© 2011 by John Douglas Opdyke. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit, including © notice, is given to the source.

* J.D. Opdyke is Managing Director of Quantitative Strategies at DataMineIt, a consultancy specializing in applied statistical, econometric, and algorithmic solutions for the financial and consulting sectors. Clients include multiple Fortune 50 banks and credit card companies, big 4 and economic consulting firms, venture capital firms, and large marketing and advertising firms. J.D. has been a SAS® user for over 20 years and routinely writes SAS® code faster (often orders of magnitude faster) than SAS® Procs (including but not limited to Proc Logistic, Proc MultTest, Proc Summary, Proc Means, Proc NPAR1WAY, Proc Plan, and Proc SurveySelect). He earned his undergraduate degree from Yale University, his graduate degree from Harvard University where he was a Kennedy Fellow, and has completed additional post-graduate work as an ASP Fellow in the graduate mathematics department at MIT. Additional of his peer reviewed publications spanning number theory/combinatorics, statistical finance, statistical computation, applied econometrics, and hypothesis testing for statistical quality control can be accessed at www.DataMineIt.com.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to jdopdyke

02-09-2011 09:52 AM

Very nice ideas. Not surprising that the data step is faster than the procs. The permutation testing is very nice.

The bootstrap is more limited in that it only works for univariable statistics that have a closed form. It does not appear as useful for a more complex, regression or rank-based, statistic.

Kudos for the work, but I think that I am still going to need my old %bootsamp macro.

Doc Muhlbaier

Duke

The bootstrap is more limited in that it only works for univariable statistics that have a closed form. It does not appear as useful for a more complex, regression or rank-based, statistic.

Kudos for the work, but I think that I am still going to need my old %bootsamp macro.

Doc Muhlbaier

Duke

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Doc_Duke

02-09-2011 10:17 AM

They are extremely fast and efficient executions, not just notions or vague ideas but rather, but actual code in the papers: modular, easily useable SAS Macros. All you have to do is change the libname reference and a half dozen macro variable values when you call the macro.

I’d actually argue that it is quite surprising that they are faster BY ORDERS OF MAGNITUDE, not just 10% or 50% or even 100% … they are 70,000% faster (700x!). That is notable, and since SAS has been around for three decades, it is surprising.

And I actually have a version of the boostrap OPDY algo already running for multivariate regression, so that is not a problem for it (take a closer look and you'll see why). And I'd bet the farm that your old %bootsamp macro, if anything like those I've seen posted, is orders of magnitude slower, which of course is not a problem per se if you're using smallish datasets.

The bootstrap OPDY algo ALSO can be used for many statistics/econometrics without closed forms, depending on the specific situation. I've simply used fast convergence algorithms where needed to estimate, with precision to the tenth decimal place, the necessary parameters.

So both the OPDY and OPDN algos, in SAS, are very general, and with just a little head scratching, very generalizable.

AND they're more robust: they execute on datasets at least an order of magnitude larger than those hash tables/objects crash on, and the same goes for some Procs (e.g. Proc NPAR1WAY - see papers for details).

So OPDN and OPDY are a) easily adaptable to many multivariate statistics/econometrics; b) easily adaptable to some statistics/econometrics without closed forms; c) orders of magnitude faster than any other implementation in SAS; d) dramatically more scalable than any other implementation in SAS because they have linear time complexity (which is not true of the relevant SAS Procs); AND e) more robust in terms of dataset size by more than an order of magnitude.

Tough to beat.

I’d actually argue that it is quite surprising that they are faster BY ORDERS OF MAGNITUDE, not just 10% or 50% or even 100% … they are 70,000% faster (700x!). That is notable, and since SAS has been around for three decades, it is surprising.

And I actually have a version of the boostrap OPDY algo already running for multivariate regression, so that is not a problem for it (take a closer look and you'll see why). And I'd bet the farm that your old %bootsamp macro, if anything like those I've seen posted, is orders of magnitude slower, which of course is not a problem per se if you're using smallish datasets.

The bootstrap OPDY algo ALSO can be used for many statistics/econometrics without closed forms, depending on the specific situation. I've simply used fast convergence algorithms where needed to estimate, with precision to the tenth decimal place, the necessary parameters.

So both the OPDY and OPDN algos, in SAS, are very general, and with just a little head scratching, very generalizable.

AND they're more robust: they execute on datasets at least an order of magnitude larger than those hash tables/objects crash on, and the same goes for some Procs (e.g. Proc NPAR1WAY - see papers for details).

So OPDN and OPDY are a) easily adaptable to many multivariate statistics/econometrics; b) easily adaptable to some statistics/econometrics without closed forms; c) orders of magnitude faster than any other implementation in SAS; d) dramatically more scalable than any other implementation in SAS because they have linear time complexity (which is not true of the relevant SAS Procs); AND e) more robust in terms of dataset size by more than an order of magnitude.

Tough to beat.