## Fast quasi-random sampling

Solved
Regular Contributor
Posts: 161

# Fast quasi-random sampling

Dear All:

I hope to run my code on a somewhat random sample of my main dataset before I implement it on the main dataset.  I know that proc surveyselect is there to do this in a rigorous manner.  But it takes time to run the proc surveyselect and generate the sample dataset.

My question is: is there a way to extract somewhat random dataset from the main dataset as fast as possible?  I'm thinking about extract observation every 1000 lines for example if there are 10000 lines.  Then the sample is 10%.  Something like that?  I only know firstobs= and obs= options, but they don't seem to foot the bill.

The main priority here is the speed to extract such data without extensive IOs and without indexing & sorting, etc.

Thank you guy!

Accepted Solutions
Solution
‎09-09-2014 01:33 PM
Super User
Posts: 6,789

## Re: Fast quasi-random sampling

You can avoid reading the entire data set in this way:

data subset;

do _n_=1000 to _nobs_ by 1000;

set hugefile point=_n_ nobs=_nobs_;

output;

end;

stop;

run;

That will start retrieving observation #1000, then #2000, then #3000 etc.  You'll get speed, with less I/O but you'll have to try it to see how much you actually save.

Don't forget the STOP statement or your DATA step will become an infinite loop (making it quite difficult to save on I/O).

All Replies
Regular Contributor
Posts: 233

## Re: Fast quasi-random sampling

You may not get exact count but approximately with the below.

data test1;

do i=1 to 10000;

if ranuni(123) <= 0.1 then output test1;

end;

run;

Log:

1                                                          The SAS System                           12:37 Tuesday, September 9, 2014

1          ;*';*";*/;quit;run;
2          OPTIONS PAGENO=MIN;
4          %LET _CLIENTPROJECTPATH='';
5          %LET _CLIENTPROJECTNAME='';
6          %LET _SASPROGRAMFILE=;
7
8          ODS _ALL_ CLOSE;
9          OPTIONS DEV=ACTIVEX;
10         GOPTIONS XPIXELS=0 YPIXELS=0;
11         FILENAME EGSR TEMP;
12         ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR STYLE=HtmlBlue
12       ! STYLESHEET=(URL="file:///C:/Program%20Files%20(x86)/SAS93/SASEnterpriseGuide/5.1/Styles/HtmlBlue.css") NOGTITLE
12       ! NOGFOOTNOTE GPATH=&sasworklocation ENCODING=UTF8 options(rolap="on");
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
13
14         GOPTIONS ACCESSIBLE;
15         data test1;
16         do i=1 to 10000;
17              if ranuni(123) <= 0.1 then output test1;
18         end;
19         run;

NOTE: The data set WORK.TEST1 has 999 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time           0.00 seconds
cpu time            0.00 seconds

20
21
22         GOPTIONS NOACCESSIBLE;
24         %LET _CLIENTPROJECTPATH=;
25         %LET _CLIENTPROJECTNAME=;
26         %LET _SASPROGRAMFILE=;
27
28         ;*';*";*/;quit;run;
29         ODS _ALL_ CLOSE;
30
31
32         QUIT; RUN;
33

Solution
‎09-09-2014 01:33 PM
Super User
Posts: 6,789

## Re: Fast quasi-random sampling

You can avoid reading the entire data set in this way:

data subset;

do _n_=1000 to _nobs_ by 1000;

set hugefile point=_n_ nobs=_nobs_;

output;

end;

stop;

run;

That will start retrieving observation #1000, then #2000, then #3000 etc.  You'll get speed, with less I/O but you'll have to try it to see how much you actually save.

Don't forget the STOP statement or your DATA step will become an infinite loop (making it quite difficult to save on I/O).

Regular Contributor
Posts: 161