BookmarkSubscribeRSS Feed
smg3141
Obsidian | Level 7

I'm working on a dataset that has over 7 million records and a few hundred variables. I want a small sample, around 1,000 will do, to mess around with graphs and tables and things so every experiment doesn't have to be run on the large data set. Random records are fine, but I do need a few from each year I have data for (55 years). Here's what I have, but it's been running all day and I'm wondering if there is a more efficient way to do this, or if this will even work. 

 

proc surveyselect data=bigdataset out=sampledataset sampsize=1000;

strata year;

run;

 

Thank you!

9 REPLIES 9
Reeza
Super User
What are your computer specs? I would expect that to run in under 15 minutes typically on a standard desktop with 8GB of RAM.
Is the data sorted by year ahead of time?

Are you familiar with the option obs= option in SAS to help you test programs? Though I suspect you need a random sample to be able to test it all since it has multiple years.

smg3141
Obsidian | Level 7
Yes, the data is sorted by year. My computer is good, 16GB of RAM. Everything is being done over our network though, so maybe that's the reason it's so slow.
Reeza
Super User

Try to load the file into RAM first using SASFILE if you have enough then to make it super fast. But check with your administrator first - are you using a SAS Server? It's possible they've limited resources as this isn't a data intensive SURVEYSELECT so any other solution I suspect will be less efficient though it's trivial to do manually. 

 

https://documentation.sas.com/doc/en/vdmmlcdc/1.0/lestmtsref/n0osyhi338pfaan1plin9ioilduk.htm

 

 

 

 

Quentin
Super User

Agree with @Reeza .  "Running all day" is surprising for this amount of data. 

 

Are you using local PC SAS, with data stored on a network?

 

I would try something like:

 

data want;
  set mylib.bigdata(obs=10000); *increase obs and see how it slows down;
run;

And keep increasing the value for obs unless/until it becomes painful.  If you can pull the whole 7M record down to your work library in a reasonable amount of time, maybe do that and run SURVEYSELECT against the local dataset.  

 

But this seems suspicious.  You might want to have a chat with your local support about network speeds.  If you're at home on WiFi, try plugging into the router.

 

The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
andreas_lds
Jade | Level 19

Your proc surveyselect will select 1000 obs per year. I am not a stats-guy, so i can't suggest a fix for the code.

Maybe posting proc contents of the dataset is helpful, so that we actually know how big "bigdataset" is.

Ksharp
Super User

I would doubt that PROC SUVERYSELECT running so slowly ,

Maybe it is due to you have hundreds of variables in big table .

 

 

data have;
call streaminit(123456789);
do year=1 to 55;
  do i=1 to 1E6;
      var=rand('uniform');
      output;
  end;
end;
drop i ;
run;







data temp;
 set have;
 id+1;
run;
proc surveyselect noprint data=temp seed=123 out=key(keep=id) sampsize=1000;
strata year;
run;
data want;
 if _n_=1 then do;
  if 0 then set key;
  declare hash h(dataset:'key',hashexp:20);
  h.definekey('id');
  h.definedone();
 end;
 set temp;
 if h.check()=0;
 drop id;
 run;
Ksharp
Super User
Or @Rick_SAS maybe have a good idea.
Tom
Super User Tom
Super User

Doesn't sound like it should be taking that long.  But if the size of the dataset (mainly number of variables) is an issue then perhaps run the SURVEYSELECT on a smaller dataset.

data rows;
  set bigdatset(keep=year);
  point=_n_;
run;
proc surveyselect data=rows out=sample_rows sampsize=1000;
  strata year;
run;
data sampledataset ;
  set sample_rows(keep=point);
  set bigdatset point=point;
run;
Ksharp
Super User
Tom,
This code is awesome !

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1613 views
  • 8 likes
  • 6 in conversation