Re: Optimize SurveySelect Routine

bshoun228 · Posted 06-19-2024 11:38 AM

I have a code base that synthesizes a zip code to a claims dataset with a county code. The zip data is from one dataset and the claims dataset has the fips/county code. I can provide samples of both.
We have been running this for several years now without major issues. However, the time to run this job is taking around 10-11 days whereas in the past is was closer to 6. I believe there to be changes to the environment, but I unfortunately cannot affect those changes. So I have to work around them.
This is likely a big ask, but I was wondering if there was a possibility of optimizing the code for a potentially better production experience (shortened run time and QC).

I am attaching a zip file of the code, data, and format in a zip file that I hope has all of the information. I realize this is a big ask of anyone, I'm just kind of stuck trying to figure out if there is a better way.

ChrisNZ · Posted 06-24-2024 06:49 AM

Please provide a set of files that can be run as-is by just changing the path at the top of the program.

Do not expect us to run something that will take more than a few minutes. Someone may be able to, but help yourself by making it easier for us.

Also, provide the original log if needed, i.e. if the supplied data does not allow to see where the bottlenecks (the steps that consume all the time) are.

Lastly, you provide a 600-line uncommented program.

You need to work and isolate the issue into a much smaller package.

High-Performance SAS Coding - Third Edition

bshoun228

Thank you for taking a look. I realized this was a big ask of someone. I didn't really know the best way to articulate what I was looking for without the entire package.
I am attaching a log (which I hope makes sense) for one macro run for one facility. I'm sure I'm overthinking the best way to show and communicate what I'm looking for.

ChrisNZ

I'm not able to look at this log but it's dirty and you need to clean it and remove the warnings and bad notes like
NOTE: Numeric values have been converted to character

High-Performance SAS Coding - Third Edition

ballardw

@bshoun228 wrote:

Thank you for taking a look. I realized this was a big ask of someone. I didn't really know the best way to articulate what I was looking for without the entire package.
I am attaching a log (which I hope makes sense) for one macro run for one facility. I'm sure I'm overthinking the best way to show and communicate what I'm looking for.

How about a LOG with the actual CODE executing in it. You have multiple macros hiding everything so set OPTIONS MPRINT;

So the log has the details of the generated code.

Your code also has %include files that were not in the ZIP file. So those should be provided. I realize your code is writing some of them. But there is no way we can run code to generate anything from this so use the SOURCE2 option so the contents are included in the log. Considering most of those have "looper" in the name I suspect that those are contributing to the time.

I am not actually seeing where Surveyselect is likely to be the part that needs optimizing UNLESS you are looping over "providers" or "fips" to create one sample at a time. Perhaps you should be using a STRATA providers and then you could provide a SAMPSIZE= dataset to have the sample sizes for all of the providers and select them at one pass. This is assuming that you are providing different sample sizes for different providers. A SAMPSIZE data set would have one observation for each "provide", level of Strata in the data and a variable named either _NSIZE_ or SAMPLESIZE with the desired sample. The data and sampsize data sets would both be sorted by the strata variable.

A quick example using a data set you should have so you can actually run this code.

data work.ssize;
   input sex $1. _nsize_;
datalines;
F 4
M 0
;

proc sort data=sashelp.class
   out=work.class;
   by sex;
run;

proc surveyselect data=work.class sampsize=work.ssize
   out=work.selected;
   strata sex;
run;

The Work.ssize is the data set to be used as SAMPSIZE option data set. I make it in sorted order by SEX (the desired strata variable). Note that the name of the variable and type needs to match the Strata variable in Data= set used in Surveyselect.

I used an example with 0 to select for Sex=M just to show.

If you call a procedure with a largish set loading the same input data set multiple times that takes time.

If you are splitting a large set into multiple smaller sets to select one at a time you are adding lots of time and complexity in general (I note you have a Proc Append apparently sticking results together which is another level of complexity).

bshoun228

I am attaching a cleaner log along with the options for the code to be printed in the log along with the input file used for the "Macro Doit". Please accept my apologies for not including all of the relevant information. My past questions and research in the community were much simpler in nature.

There are two main datasets used in this process. One is a claims file (ip_clm_23q3.csv) that has the FIPS code on each record. The other dataset is the af22.csv which provides the provider (from the first file), a zip code, and the count of records that came from that zip code. Since we have a FIPS to Zip crosswalk (zip23fips_fmt) we can 'synthesize' a zip to the claim record based on that fips. We are using the distribution of the records by zip to assign that zip to the claim record so if you did a frequency from the area file and the claims file, you should get the same count overall.

I do have both tables indexed on the relevant variables (provider, fips, zip, etc) but it sounds like sorting by those would be more efficient.

The routine begins with provider 010001 and the total count of records. From that point we grab all of the FIPS codes associated with those records. In my exported log, I chose a facility with 100 records as opposed to one with thousands.

ballardw

Minor but indicative.

There multiple steps involving created a data sets Cntr2, cntr3 etc. like this that appear to only be intended to create macro variable:

data cntr4;
length cntr2 $3;
cntr2=4-1;
cntr2=compress(cntr2);
run;

proc sql noprint;
select cntr2 into :cntr2 from cntr4;
quit;

If the 4 comes from a macro variable that is only ever integer values then this could replace the data step and the Proc sql call. I know they aren't taking much time but proliferating data steps and procedure calls do add up. I couldn't see where any of these CntrX data sets were used for anything except creation of the same named macro variable.

%let cntr2 = %eval(&macrovariable. -1);

What I am seeing is apparently a bunch of data sets named in part with Provide, data source or time frame and FIPS code values. My feeling, especially because of the LOOP over Provider and FIPS if not quite the data source is that By group processing still seems more likely to reduce run time.

ballardw · Posted 06-24-2024 10:40 AM

A brief reading of your code shows about 20 data steps and one call to proc surveyselect. So the topic likely should not be "optimize surveyselect".

You code also does not include

1) definition of apparently multiple formats that start &ZIP used in $zip&popyr.fips

2) definition of a macro used in multiple places %dsobs

I would be examining the LOG to see which steps are taking the most time and see if those can be addressed.

I am wondering if your data is actually so volatile with zipcodes and generated Fips code values that those steps need to be rerun all the time. Perhaps you should segregate that and only recreate that when something in your data indicates it needs to update. You don't mention the size of files but I think there is a lot of redundant and likely inefficient code just with those steps.

FreelanceReinh

@ballardw wrote:

If you are splitting a large set into multiple smaller sets to select one at a time you are adding lots of time and complexity in general ...

Indeed. Your sample log file mentions 694 numbers of observations being processed in the various steps. These numbers range from 0 to 120 with an average of about 10. So it really seems like you're working with large numbers of tiny datasets. But most of the time (in general) it is much more efficient to work with larger datasets possibly comprising the information of many of those small datasets in the form of BY groups. DATA steps and most procedures (such as PROC FREQ, which you are using) support BY-group processing. A single DATA or PROC step using a BY statement (e.g., by fips) can possibly replace many similar steps processing one BY value at a time (in your log indicated by WHERE statements of the form WHERE fips='xxxxx' with a lot of individual values xxxxx). In PROC SURVEYSELECT the STRATA statement plays the role of a BY statement, as ballardw has already pointed out.

Patrick

Also an elapsed time of 6 days is "excessive".

Browsing through your code my first guess is that you're reading/writing from/to network drives with some really low throughput.

When it comes to performance investigation and tuning then the first thing you need to know is which steps consume the most time.

Please set below options at the beginning of your program and then share the full SAS log with us.

options source2 mprint mprintnest fullstimer;

Update: Looking into the log you shared all the step execute within a fraction of a second so it's likely like others already mentioned that you're calling these steps within a macro many many time. Like others already stated changing the approach/design to process all the data in one go (by group processing) will highly likely improve performance massively.

The log you've shared also uses Proc Printto with a log line number of 427333. There is no such printto in the SAS code you shared so there is certainly something missing that you haven't told us yet. This also is supported by the fact that the sums of the real time steps in the log are nowhere near to the total elapsed time of your program (219:37:24.15).

NOTE: PROCEDURE PRINTTO used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      

427333     proc printto; run;

SAS Innovate 2025: Call for Content

Classroom Training Available!