BookmarkSubscribeRSS Feed
tampham92
Obsidian | Level 7

Hi, I have a very large data set (up to 5M observations) ranging from 2004 through 2019 by month. I would like to randomly select 10% of my data to generate a insample data for modeling. Since the dataset is very large, a normal proc survey select takes forever. I am wondering whether I can use  ds2 to perform random sample selection. Thanks.

4 REPLIES 4
Rick_SAS
SAS Super FREQ

You say PROC SURVEYSELECT "takes forever." Can you post the code you are using? Do you need to perform stratified sampling by month, or do you just want 10% of the total number of observations?

 

Most of the time for extracting a subset is reading the original data and writing the sample, which will not change if you use the DATA step or DS2. Here is an example of using SURVEYSELECT to extract 10% of a data set with 5M observations. It takes about 2.5 seconds for this example.

 

%let N = 5000000;
data Have;
array x[5];
do i = 1 to &N;
   do j = 1 to dim(x);
      x[j] = i + j;;
      output;
   end; 
end;
run;

proc surveyselect data=Have out=Sample seed=12345
     method=srs       /* sample w/o replacement */
     samprate=0.1;     /* number of observations in sample */
run;

 

If you have twice as many variables, the writing phase will take twice as long. If you have many character variables, that will also slow down the writing, but I don't think either of those facts will change if you use DS2, so consider using a KEEP statement on the DATA= statement to limit the input data set to only those variables that will be used in subsequent analyses.

tampham92
Obsidian | Level 7

I use the same proc surveyselect here as you have. I even tried proc sql combining with "having ranuni(xxx) between 0.45 and 0.55" and the amount of time it takes too run is too long

Rick_SAS
SAS Super FREQ

Please post the SAS log for your SURVEYSELECT call.

ballardw
Super User

@tampham92 wrote:

Hi, I have a very large data set (up to 5M observations) ranging from 2004 through 2019 by month. I would like to randomly select 10% of my data to generate a insample data for modeling. Since the dataset is very large, a normal proc survey select takes forever. I am wondering whether I can use  ds2 to perform random sample selection. Thanks.


Methinks that you maybe are leaving out details. The following code created a data set with 5,000,000 records and selected a 10% sample in a total of about 0.4 seconds;

1    data work.junk;
2      do i= 1 to  5000000;
3      output;
4      end;
5    run;

NOTE: The data set WORK.JUNK has 5000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.16 seconds
      cpu time            0.12 seconds


6
7    proc surveyselect data=work.junk
8       out=work.selected samprate=0.1;
9    run;

NOTE: The data set WORK.SELECTED has 500000 observations and 1 variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
      real time           0.23 seconds
      cpu time            0.15 seconds

The created data set may have been cached but still, I have a hard time understanding why the code might "take forever"?

 

Is the source data set on a network drive with performance issues? Is your session running on a server with marginal resources or your user settings limiting resources? Other network traffic involved?

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 1204 views
  • 0 likes
  • 3 in conversation