topic Re: array to big in SAS Programming

array to big

Batman — Sat, 11 Apr 2020 22:46:42 GMT

Hello, I'm trying to do some distance analysis. Basically, I want to determine the share of patients who are with 30 miles of certain medical providers. I grouped patients by zip code and I have a separate list of zip codes where 1 or more of the providers has a practice. For each patient zip code, loop through an array of provider zips and measure the distance for each provider zip code using the ZIPCITYDISTANCE function. I then would take the minimum value of all the distances and that would be distance value for that patient zip code.

I ran into trouble building the array of provider zip codes because I bumped into the size limit for array values. Can anyone suggest work around? I'm wondering if a hash table might do the same thing without the limit, but am not sure of the syntax.

200 PROC SQL NOPRINT;
201 /*CREATE AN ARRY OF PROVIDER ZIP CODES*/
202 SELECT QUOTE(PROV_ZIP,"'") INTO :ZIPLIST SEPARATED BY ' '
203 FROM VALID_PROVZ
204 WHERE VALID=1;
ERROR: THE LENGTH OF THE VALUE OF THE MACRO VARIABLE ZIPLIST (65540) EXCEEDS THE MAXIMUM LENGTH (65534). THE VALUE HAS BEEN
TRUNCATED TO 65534 CHARACTERS.
205 %LET ZIPCNT=&SQLOBS;
206 %PUT &ZIPCNT;
8194
207
NOTE: THE SAS SYSTEM STOPPED PROCESSING THIS STEP BECAUSE OF ERRORS.
NOTE: THE PROCEDURE SQL USED 0.05 CPU SECONDS AND 24308K.

NOTE: THE ADDRESS SPACE HAS USED A MAXIMUM OF 1940K BELOW THE LINE AND 650284K ABOVE THE LINE.

208 DATA CALC_DIST(KEEP=ZIP_CD VALID MINDIST);
209 SET VALID_BENEZ;
210 IF ZIP_CD NE '';

211 ARRAY PROVZIPS{&ZIPCNT} $ (&ZIPLIST); /*PROVIDER ZIP CODE ARRAY*/
212 ARRAY PROVDIST{&ZIPCNT}; /*DISTANCE ARRAY*/
213 IF VALID=1 THEN DO I=1 TO &ZIPCNT;
214 IF PROVZIPS(I) NE '' THEN PROVDIST(I)=
NOTE: THE QUOTED STRING CURRENTLY BEING PROCESSED HAS BECOME MORE THAN 262 CHARACTERS LONG. YOU MIGHT HAVE UNBALANCED QUOTATION
MARKS.
215 ZIPCITYDISTANCE(ZIP_CD,PROVZIPS(I));
216 END;
217 MINDIST=MIN(OF PROVDIST(*));
218 RUN;

Re: array to big

Tom — Sat, 11 Apr 2020 23:18:04 GMT

Why do you need an array?

You have one dataset with patient ZIP codes and a second dataset with provider ZIP code.

Just join them.

proc sql ;
create table patzip as select ZIP_CD,count(distinct PATIENT) as n_patients from VALID_BENEZ ;
create table provzip as select distinct PROV_ZIP from VALID_PROVZ
create table want as
  select ZIP_CD  as pat_zip
       , PROV_ZIP 
       , n_patients
       , ZIPCITYDISTANCE(ZIP_CD,PROVZIPS) as distance
  from patzip, provzip  
  where calculated distance < 30
;
quit;

If you want to summarize by provider zip code you can add an additional step.

create table summary as 
  select prov_zip,sum(n_patients) as n_patients_in_30miles
  from want;
  group by prov_zip
;

You could then link this back to individual providers by their zip code.

Re: array to big

Kurt_Bremser — Sun, 12 Apr 2020 11:08:00 GMT

You do not have an array problem, you have a macro variable problem.

This happens because you abuse the macro processor for something that it's not meant for: processing data.

Use a join or a hash object.

Re: array to big

Batman — Sun, 12 Apr 2020 13:28:04 GMT

Thanks, that did work although it took a long time and I also got this message "NOTE: THE EXECUTION OF THIS QUERY INVOLVES PERFORMING ONE OR MORE CARTESIAN PRODUCT JOINS THAT CAN NOT BE OPTIMIZED."

Do you have any suggested hash code, I'm wondering it might be more efficient.

Re: array to big

Batman — Sun, 12 Apr 2020 13:38:37 GMT

Understood. Do you have any suggested hash code? I've mainly used hash tables as a lookup in the past and this is a little different.

Re: array to big

Tom — Sun, 12 Apr 2020 14:50:36 GMT

It is doing a cartesian product on purpose since you want to check the distance for each pair of zip code.

Make sure to only keep the valid provider zip codes.

To improve performance you probably will want keep only the zip codes in the provider subset. Then you can remerge with the list of actual providers based on provider zip code once you have the patient counts for that zip code to see the results by provider.

Not sure hash will improve performance a lot. You might be able to do it in one step with hash and that might improve performance a little by eliminating some of the IO. But you still need to compare every possible pair of zipcodes.