Bootstrap - with really large data set

Reply
Contributor
Posts: 62

Bootstrap - with really large data set

Hello,

I am working with large data sets (around 7 millions observation) and i need to use bootstrap to calculate some estimates.

I use this code for bootstrap:

data outboot/view = outboot;

do replicate=1 to 1000;

do _i=1 tu nimrecs;

p=int(1+numrecs*(ranuni(39573293)));

set mydata point=p nobs=numrecs;

output;

end;

end;

run;

But after i need to run some more procedure for each sample (like to do logistic regression) and here SAS tooks ages, any suggestion how to make it work faster?

Thanks

Super User
Super User
Posts: 7,401

Re: Bootstrap - with really large data set

Hi,


Well, you have a typo in that code tu rather than to.  I am not sure about SAS views, but SQL views are basically just the code not the actual data, so each time you access the view, you are actually running the code of the view rather than just getting data.  So my suggestion would be to create a physical dataset (i.e. remove the /view), then use the physical dataset rather than re-running the code each access.

Other than that, would need to see specifcs to give more help, e.g. 7 million records, you sure you can't block that out, make sure its pre-sorted etc.  Still going to take a while with that number of rows but there may be small modifications to speed things up.

Contributor
Posts: 62

Re: Bootstrap - with really large data set

I know that it is not real data (if i use data view), but the problem is that if i try to create real data (and i need 1000 samples) the SAS runs out of memory if I use this code:

sasfile mydata load;

data outboot;

do replicate=1 to 1000;

do _i=1 tu nimrecs;

p=int(1+numrecs*(ranuni(39573293)));

set mydata point=p nobs=numrecs;

output;

end;

end;

run;

sasfile mydata close;

And I need to use all 7 million records and it is all my population of interest.

Super User
Super User
Posts: 7,401

Re: Bootstrap - with really large data set

What software are you using?  Your IT group may be able to set some additional parameters up to allow SAS access to more memory, unfortunately I can't help more there.

From my side, if your running out of memory creating the data, then you will end up having to use some sort of intermediary, e.g. swap file, or partial processing, this will however make things run slower as it has to load some data, do some processing, save that data etc. rather than just directly accessing the data.  Maybe get some more memory if your dealing with large data, or shift processing/memory usage out onto a server rather than locally.

Super User
Posts: 6,939

Re: Bootstrap - with really large data set

So you generate a new dataset that has 1000 * 7000000 records. No miracle you run out of disk space and time. If you only need 10 microseconds per iteration, that still amounts to 70000 seconds, close to a day.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 6,939

Re: Bootstrap - with really large data set

Another thing to watch: Since you have a set statement, the data step will iterate through itself (beginning with a new do replicate = 1 to 1000) until it reaches and end-of-file, which it will never do with the point= option. This causes an infinite loop anyway.

Use the stop; statement immediately before the run;

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Respected Advisor
Posts: 3,777

Re: Bootstrap - with really large data set

As mentioned by your program is an infinite loop, however I think you're doing the sampling all wrong.  You should be using PROC SURVERSELECT.


Read this http://www2.sas.com/proceedings/forum2007/183-2007.pdf before you do anything else.

Contributor
Posts: 62

Re: Bootstrap - with really large data set

To data _null_,

The paper you reference

http://www2.sas.com/proceedings/forum2007/183-2007.pdf ,

I use the new method-code that the same authors suggested in 2011.

Respected Advisor
Posts: 3,777

Re: Bootstrap - with really large data set

Please provide the link.

Use PROC SURVEYSELECT.

Contributor
Posts: 62

Re: Bootstrap - with really large data set

Respected Advisor
Posts: 3,777

Re: Bootstrap - with really large data set

I tested that code with STOP to make it work properly but with at rather small data set.  When you used it what was the code you executed.  I ask because the code in the paper PROC UNI VARIATE needs NOPRINT added to make it produce only the output data set.  If the PROC LOGISTIC you are running is producing printed output that could be contributing to your extended execution time.  Add the STOP and dial back the reps until you get your entire program working and post that part here to.

Also do you really need the same kind of sample this is N size sample with replacement.  Could you get the same result with smaller samples?

Super User
Posts: 6,939

Re: Bootstrap - with really large data set

What I don't get:

This just produces 1000 reshufflings of the original data set, with some records omitted by chance and some records multiplied by chance. And it uses the worst imaginable method to achieve that.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 62

Re: Bootstrap - with really large data set

It is my first time that i use bootstrap so I am not sure if the method I use is suitable for my case.

In general for each sample I need as well to calculate relative risk using the code below:

data pop;

set sc (in=a)

     sc (in=b);

if a then spc=1;

if b then spc=0;

run;

proc logistic=sc desc;

class spc/param=ref ref=first;

model death=spc var1 var2.../rl;

score data=pop out=pred_risk;

run;

proc means data=pred_risk nway;

clas spc;

var p_1;

output out=pop_risk mean=pop_risk;

run;

proc transpose data=pop_risk out=pop_risk prefix=spc_;

if spc;

var pop_risk;

run;

data pop_risk;

set pop_risk;

adj_rr=spc_1/spc_0;

run;

Super User
Posts: 6,939

Re: Bootstrap - with really large data set

And a final one: just tested your point= access method with a data set that ususally takes ~1 minute to read sequentially, without the 1000 repoliactions factor, of course. Stopped it after 2 hours runtime, don't even know how far it got.

Bottom line: use another method to generate your sample(s), unless you have VERY much time at hand.

I suggest you run a simple data step that assigns a random number to each row, and then sort the data set by that number. Then you can read it sequentially for your tests.

Update:

Sequential read of ~6 million records: 30 seconds.

Reading ~60.000 records with point=: 1 minute.

So be prepared to take 200* as long as reading your 7 million sequentially. For one single pass!

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Ask a Question
Discussion stats
  • 13 replies
  • 391 views
  • 0 likes
  • 4 in conversation