BookmarkSubscribeRSS Feed
jamesf
Fluorite | Level 6

I am trying to do a PPS proc survey select on a large number of records (~10 million) and am running into never-ending task runs.

 

The code works when I limit the input to particular sub-samples, but goes into a never-ending task on other sub-samples, which requires a full SAS reset. I am unsure how to get any debugging information from cases with never-ending tasks.

 

If I limit my input data to state 3 (200,000 units, 9,000 selections, 58 second run time). If I limit my input data to state 1(2,500,000 units, 140,000 selections), the code runs forever (3+ hours) until I force terminate it.

 

Any assistance as to how to start narrowing down the cause would be greatly appreciated. 

 

My data is of the form

Data set: nh_p_stra

STATE SampleSize

1 2000

2 3000

3 4000

 

Data set: data_in

Unit_id STATE weight

0000001 1 1.2

0000002 1 2.1

0000003 1 0.9

0000004 2 0.8

0000005 2 1.3

 

My code is of the form

 

proc surveyselect method=pps data=data_in sampsize=nh_p_stra out=selection_out seed=123;

strata state;

id unit_id;

size weight;

run;

 

3 REPLIES 3
ballardw
Super User

I suspect that you are just terminating a tad too soon. Depending on how you did that the output set should have been populated, how far did you get?

 

Or run a loop selecting each state at a time, generating a separate output set then combine the results later.

Something like:

 

data _null_;
   length str $ 100.;
   do state= 1 to 50;
      call execute('proc surveyselect method=pps data=data_in sampsize=nh_p_stra');
      str=catt('(where=(state=',state,')) out=selection',state,'seed=123;');
      call execute (str);
      call execute('strata state;
         id unit_id;
         size weight;
         run;');
   end;
run;

I suspect the issue is just the size of your data set, as in numbers of records and possibly the values of your weight variable. There are going to be a lot of calculations, for instance the sum of weight per strata as a start, then calculate the proportion  of that total each record represents. If you haven't read the documentation details on calculation PPS samples you may not realize the scope of computations involved in a data set this big

 

jamesf
Fluorite | Level 6

Thanks for the help. Yeah, it turned out I terminated it too soon. It seems the run time goes up non-linearly with the size of the strata. I was able to get the code dialed in on the small states, before doing an overnight run on the larger states.

Watts
SAS Employee

One factor to check is system options and resources. The METHOD=PPS alogorithm involves sorting. And sorting can become slow when system resources get low. 

 

Also, consider whether a different selection method might be appropriate for the task. Some other types of PPS selection are generally faster -- e.g, systematic, sequential, with-replacement. In addition to equal-probability methods. (Doc is here.)  

 

Another possibility is to make the strata smaller (by using additional stratification, or by simply subdividing them).

 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 444 views
  • 0 likes
  • 3 in conversation