Hi, I have a very large data set (up to 5M observations) ranging from 2004 through 2019 by month. I would like to randomly select 10% of my data to generate a insample data for modeling. Since the dataset is very large, a normal proc survey select takes forever. I am wondering whether I can use ds2 to perform random sample selection. Thanks.
You say PROC SURVEYSELECT "takes forever." Can you post the code you are using? Do you need to perform stratified sampling by month, or do you just want 10% of the total number of observations?
Most of the time for extracting a subset is reading the original data and writing the sample, which will not change if you use the DATA step or DS2. Here is an example of using SURVEYSELECT to extract 10% of a data set with 5M observations. It takes about 2.5 seconds for this example.
%let N = 5000000;
data Have;
array x[5];
do i = 1 to &N;
do j = 1 to dim(x);
x[j] = i + j;;
output;
end;
end;
run;
proc surveyselect data=Have out=Sample seed=12345
method=srs /* sample w/o replacement */
samprate=0.1; /* number of observations in sample */
run;
If you have twice as many variables, the writing phase will take twice as long. If you have many character variables, that will also slow down the writing, but I don't think either of those facts will change if you use DS2, so consider using a KEEP statement on the DATA= statement to limit the input data set to only those variables that will be used in subsequent analyses.
I use the same proc surveyselect here as you have. I even tried proc sql combining with "having ranuni(xxx) between 0.45 and 0.55" and the amount of time it takes too run is too long
Please post the SAS log for your SURVEYSELECT call.
@tampham92 wrote:
Hi, I have a very large data set (up to 5M observations) ranging from 2004 through 2019 by month. I would like to randomly select 10% of my data to generate a insample data for modeling. Since the dataset is very large, a normal proc survey select takes forever. I am wondering whether I can use ds2 to perform random sample selection. Thanks.
Methinks that you maybe are leaving out details. The following code created a data set with 5,000,000 records and selected a 10% sample in a total of about 0.4 seconds;
1 data work.junk;
2 do i= 1 to 5000000;
3 output;
4 end;
5 run;
NOTE: The data set WORK.JUNK has 5000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.16 seconds
cpu time 0.12 seconds
6
7 proc surveyselect data=work.junk
8 out=work.selected samprate=0.1;
9 run;
NOTE: The data set WORK.SELECTED has 500000 observations and 1 variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
real time 0.23 seconds
cpu time 0.15 seconds
The created data set may have been cached but still, I have a hard time understanding why the code might "take forever"?
Is the source data set on a network drive with performance issues? Is your session running on a server with marginal resources or your user settings limiting resources? Other network traffic involved?
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.