I'm working on a dataset that has over 7 million records and a few hundred variables. I want a small sample, around 1,000 will do, to mess around with graphs and tables and things so every experiment doesn't have to be run on the large data set. Random records are fine, but I do need a few from each year I have data for (55 years). Here's what I have, but it's been running all day and I'm wondering if there is a more efficient way to do this, or if this will even work.
proc surveyselect data=bigdataset out=sampledataset sampsize=1000;
strata year;
run;
Thank you!
Try to load the file into RAM first using SASFILE if you have enough then to make it super fast. But check with your administrator first - are you using a SAS Server? It's possible they've limited resources as this isn't a data intensive SURVEYSELECT so any other solution I suspect will be less efficient though it's trivial to do manually.
https://documentation.sas.com/doc/en/vdmmlcdc/1.0/lestmtsref/n0osyhi338pfaan1plin9ioilduk.htm
Agree with @Reeza . "Running all day" is surprising for this amount of data.
Are you using local PC SAS, with data stored on a network?
I would try something like:
data want;
set mylib.bigdata(obs=10000); *increase obs and see how it slows down;
run;
And keep increasing the value for obs unless/until it becomes painful. If you can pull the whole 7M record down to your work library in a reasonable amount of time, maybe do that and run SURVEYSELECT against the local dataset.
But this seems suspicious. You might want to have a chat with your local support about network speeds. If you're at home on WiFi, try plugging into the router.
Your proc surveyselect will select 1000 obs per year. I am not a stats-guy, so i can't suggest a fix for the code.
Maybe posting proc contents of the dataset is helpful, so that we actually know how big "bigdataset" is.
I would doubt that PROC SUVERYSELECT running so slowly ,
Maybe it is due to you have hundreds of variables in big table .
data have;
call streaminit(123456789);
do year=1 to 55;
do i=1 to 1E6;
var=rand('uniform');
output;
end;
end;
drop i ;
run;
data temp;
set have;
id+1;
run;
proc surveyselect noprint data=temp seed=123 out=key(keep=id) sampsize=1000;
strata year;
run;
data want;
if _n_=1 then do;
if 0 then set key;
declare hash h(dataset:'key',hashexp:20);
h.definekey('id');
h.definedone();
end;
set temp;
if h.check()=0;
drop id;
run;
Doesn't sound like it should be taking that long. But if the size of the dataset (mainly number of variables) is an issue then perhaps run the SURVEYSELECT on a smaller dataset.
data rows;
set bigdatset(keep=year);
point=_n_;
run;
proc surveyselect data=rows out=sample_rows sampsize=1000;
strata year;
run;
data sampledataset ;
set sample_rows(keep=point);
set bigdatset point=point;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.