BookmarkSubscribeRSS Feed
smg3141
Obsidian | Level 7

I'm working on a dataset that has over 7 million records and a few hundred variables. I want a small sample, around 1,000 will do, to mess around with graphs and tables and things so every experiment doesn't have to be run on the large data set. Random records are fine, but I do need a few from each year I have data for (55 years). Here's what I have, but it's been running all day and I'm wondering if there is a more efficient way to do this, or if this will even work. 

 

proc surveyselect data=bigdataset out=sampledataset sampsize=1000;

strata year;

run;

 

Thank you!

9 REPLIES 9
Reeza
Super User
What are your computer specs? I would expect that to run in under 15 minutes typically on a standard desktop with 8GB of RAM.
Is the data sorted by year ahead of time?

Are you familiar with the option obs= option in SAS to help you test programs? Though I suspect you need a random sample to be able to test it all since it has multiple years.

smg3141
Obsidian | Level 7
Yes, the data is sorted by year. My computer is good, 16GB of RAM. Everything is being done over our network though, so maybe that's the reason it's so slow.
Reeza
Super User

Try to load the file into RAM first using SASFILE if you have enough then to make it super fast. But check with your administrator first - are you using a SAS Server? It's possible they've limited resources as this isn't a data intensive SURVEYSELECT so any other solution I suspect will be less efficient though it's trivial to do manually. 

 

https://documentation.sas.com/doc/en/vdmmlcdc/1.0/lestmtsref/n0osyhi338pfaan1plin9ioilduk.htm

 

 

 

 

Quentin
Super User

Agree with @Reeza .  "Running all day" is surprising for this amount of data. 

 

Are you using local PC SAS, with data stored on a network?

 

I would try something like:

 

data want;
  set mylib.bigdata(obs=10000); *increase obs and see how it slows down;
run;

And keep increasing the value for obs unless/until it becomes painful.  If you can pull the whole 7M record down to your work library in a reasonable amount of time, maybe do that and run SURVEYSELECT against the local dataset.  

 

But this seems suspicious.  You might want to have a chat with your local support about network speeds.  If you're at home on WiFi, try plugging into the router.

 

BASUG is hosting free webinars ! Check out our recordings of past webinars: https://www.basug.org/videos. Be sure to subscribe to our email list for notification of future BASUG events.
andreas_lds
Jade | Level 19

Your proc surveyselect will select 1000 obs per year. I am not a stats-guy, so i can't suggest a fix for the code.

Maybe posting proc contents of the dataset is helpful, so that we actually know how big "bigdataset" is.

Ksharp
Super User

I would doubt that PROC SUVERYSELECT running so slowly ,

Maybe it is due to you have hundreds of variables in big table .

 

 

data have;
call streaminit(123456789);
do year=1 to 55;
  do i=1 to 1E6;
      var=rand('uniform');
      output;
  end;
end;
drop i ;
run;







data temp;
 set have;
 id+1;
run;
proc surveyselect noprint data=temp seed=123 out=key(keep=id) sampsize=1000;
strata year;
run;
data want;
 if _n_=1 then do;
  if 0 then set key;
  declare hash h(dataset:'key',hashexp:20);
  h.definekey('id');
  h.definedone();
 end;
 set temp;
 if h.check()=0;
 drop id;
 run;
Ksharp
Super User
Or @Rick_SAS maybe have a good idea.
Tom
Super User Tom
Super User

Doesn't sound like it should be taking that long.  But if the size of the dataset (mainly number of variables) is an issue then perhaps run the SURVEYSELECT on a smaller dataset.

data rows;
  set bigdatset(keep=year);
  point=_n_;
run;
proc surveyselect data=rows out=sample_rows sampsize=1000;
  strata year;
run;
data sampledataset ;
  set sample_rows(keep=point);
  set bigdatset point=point;
run;
Ksharp
Super User
Tom,
This code is awesome !

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1272 views
  • 8 likes
  • 6 in conversation