Solved: Re: Simplifying PROC FREQ output

rlafond · Posted 02-14-2020 08:24 AM

I have a table designed like this:

ID HCC_1 HCC_2.... HCC_254

313 0 1 1

It has 58k rows and 129 columns. One column is the ID field (the key) and the other 128 columns are binary (1 or 0) indicators.

Essentially I'm trying to figure out the frequency that any two indicators are both 1 for all the records vs. one of them being 0.

If I use the following code my SAS freezes:

PROC FREQ DATA=WORK.HCC_COMBORID_STEP3;
TABLES (HCC_1 -- HCC_254) * (HCC_1 -- HCC_254);
RUN;

I can make this project workable by just doing this:

PROC FREQ DATA=WORK.HCC_COMBORID_STEP3;
TABLES (HCC_1) * (HCC_1 -- HCC_254);
where hcc_1 = 1;
RUN;

But that would mean doing that 128 times and that seems wildly inefficient.

Any suggestions for how I can simplify this?

PaigeMiller · Posted 02-14-2020 08:55 AM

Your original statement said 254 variables. So this remains a point of confusion.

Nevertheless, to find "frequently paired indicators" I would think you are going to want to create some large output data set which you can then narrow down to the "frequently paired indicators", whatever that means.

/* UNTESTED CODE */

ods select none;
ods output crosstabfreqs=freqs;
proc freq data=have;
    tables (hcc_1-hcc_128)*(hcc_1-hcc_128);
run;

Data set FREQS will be huge, and this may take a long time to run, but it should have all the information you need.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 02-14-2020 08:36 AM

Essentially I'm trying to figure out the frequency that any two indicators are both 1 for all the records vs. one of them being 0.

Any two?? Meaning you need examine all 254*253/2 pairs? Or do you just need to know that an observation has 2 ones and 252 zeros? Can you explain this further?

--
Paige Miller

rlafond · Posted 02-14-2020 08:43 AM

Every record in my table has 128 binary indicators. My goal is to find patterns in the table by identifying frequently paired indicators (1,1), so I'm trying to evaluate all 129*128 combinations. Does that make sense?

PaigeMiller · Posted 02-14-2020 08:55 AM

Your original statement said 254 variables. So this remains a point of confusion.

Nevertheless, to find "frequently paired indicators" I would think you are going to want to create some large output data set which you can then narrow down to the "frequently paired indicators", whatever that means.

/* UNTESTED CODE */

ods select none;
ods output crosstabfreqs=freqs;
proc freq data=have;
    tables (hcc_1-hcc_128)*(hcc_1-hcc_128);
run;

Data set FREQS will be huge, and this may take a long time to run, but it should have all the information you need.

--
Paige Miller

rlafond · Posted 02-14-2020 09:00 AM

Thank you, I'm trying it now. I apologize for the confusion about the column naming convention and how I presented it. The names are deliberate but it was misleading.

ballardw · Posted 02-14-2020 10:27 AM

You don't give any indication of how sparse your data may be populated with multiple 1 values in the data.

So the first thing I might try would be to see if reducing the data to records with 2 or more ones may help. Easily done:

data temp;
   set have;
   array b  hcc_1-hcc_128;
   totalones = sum(of b(*));
   if totalones ge 2;
run;

If this data set has significantly reduced the number of records that might help.

Another approach might be to look at clustering procedures to find similar clusters of values of your binary variables.

Something like:

proc fastclus data=have;
   var hcc_1 - hcc_128;
run;

Without a MAXCLUSTER option on the proc statement the procedure will try to find 100 clusters of similar combinations of the variables. You might want to consider using the reduced data set I suggested as the data source.

Astounding · Posted 02-15-2020 04:07 PM

Suggestion: switch from PROC FREQ to PROC CORR. It has a better chance of finishing, and probably gives you a better way of searching for meaningful patterns. Here's an example of why that would be. Perhaps some of your variables might look like this:

0 1

HCC_1 25 25

HCC_2 25 25

HCC_3 78 1

HCC_4 1 20

So PROC FREQ would reveal that HCC_1 and HCC_2 have a 25% chance of matching "1" values. Yet their relationship is statistically random. It would reveal that HCC_3 and HCC_4 match on "1" only 20% of the time. Yet their relationship is compelling. PROC CORR would reveal this, while PROC FREQ would not.

PaigeMiller · Posted 02-15-2020 06:37 PM

@Astounding wrote:

Suggestion: switch from PROC FREQ to PROC CORR. It has a better chance of finishing, and probably gives you a better way of searching for meaningful patterns. Here's an example of why that would be. Perhaps some of your variables might look like this:

0 1

HCC_1 25 25

HCC_2 25 25

HCC_3 78 1

HCC_4 1 20

So PROC FREQ would reveal that HCC_1 and HCC_2 have a 25% chance of matching "1" values. Yet their relationship is statistically random. It would reveal that HCC_3 and HCC_4 match on "1" only 20% of the time. Yet their relationship is compelling. PROC CORR would reveal this, while PROC FREQ would not.

I originally thought this might be a good idea, but as I thought about it more, if the question is to find the prevalence of when BOTH of the two variables are 1, then correlations won't do it, because it will (in layman's terms) treat (0,0) as correlation and (1,1) as correlation, and (0,1) and (1,0) as "not correlation". Which isn't the same as the original question.

Then I thought about arrays and looping, which could then determine if both variables in the pair equal 1. But, there are some problems here because you need an output array that has variables named something like HCC_1xHCC_2, HCC_1xHCC_3 and so on until all possible pairs are created. That's a lot of typing, don't make any mistakes. But wait ... you could create a macro loop that creates a macro variable with all of these variable names, and then use the macro variable in an ARRAY statement. Well, that's still a lot of programming, and so I return to the idea above of PROC FREQ with ODS output and then parsing the resulting data set.

I think you could also get tricky with PROC GLMMOD and perhaps accomplish this. GLMMOD will create variables named for all the interactions of HCC_n and HCC_m (or maybe those go in an output data set as a label, I don't really remember). But I'm not sure it will then do the proper determination that both pairs of variables are 1. Experimentation might be required!!

--
Paige Miller

Tom · Posted 02-15-2020 07:13 PM

Are you just looking to see how often each pair of variables are both true for the same observation?

Perhaps you could just count it yourself.

I made a sample dataset with 100 observations of 5 boolean variables X1 to X5.

Here is code to count the number of times each pair was both true.

%let n=5 ;

data pairs;
  set have end=eof;
  array x [&n];
  array pairs [%sysfunc(comb(&n,2))] _temporary_;
  index=0;
  do lower=1 to &n-1;
   do upper=lower+1 to &n;
     index=index+1;
     if x[lower] and x[upper] then pairs[index]=sum(pairs[index],1);
   end;
  end;
  if eof then do;
    index=0;
    totaln=_n_;
    do lower=1 to &n-1;
      do upper=lower+1 to &n;
        index=index+1;
        count=pairs[index];
        percent=count/totaln;
        output;
      end;
    end;
  end;
  keep lower upper totaln count percent;
run;

Obs    lower    upper    totaln    count    percent

  1      1        2        100       25       0.25
  2      1        3        100       28       0.28
  3      1        4        100       20       0.20
  4      1        5        100       21       0.21
  5      2        3        100       22       0.22
  6      2        4        100       20       0.20
  7      2        5        100       19       0.19
  8      3        4        100       30       0.30
  9      3        5        100       24       0.24
 10      4        5        100       24       0.24

Tom · Posted 02-15-2020 07:27 PM

If that is what you want you can also ask PROC TRANSPOSE and PROC SQL to help you get it.

proc transpose data=have out=step1;
  by id;
  var x1-x5;
run;

proc sql ;
  create table pairs2 as 
  select a._name_ as lower
       , b._name_ as upper 
       , sum(a.col1*b.col1) as count
  from step1 a inner join step1 b
    on a.id=b.id and a._name_ < b._name_
  group by lower,upper
  ;
quit;

Obs    lower    upper    count

  1     x1       x2        25
  2     x1       x3        28
  3     x1       x4        20
  4     x1       x5        21
  5     x2       x3        22
  6     x2       x4        20
  7     x2       x5        19
  8     x3       x4        30
  9     x3       x5        24
 10     x4       x5        24

PaigeMiller · Posted 02-15-2020 08:33 PM

Brilliant!

--
Paige Miller

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away