I have a table designed like this:
ID HCC_1 HCC_2.... HCC_254
313 0 1 1
It has 58k rows and 129 columns. One column is the ID field (the key) and the other 128 columns are binary (1 or 0) indicators.
Essentially I'm trying to figure out the frequency that any two indicators are both 1 for all the records vs. one of them being 0.
If I use the following code my SAS freezes:
PROC FREQ DATA=WORK.HCC_COMBORID_STEP3;
TABLES (HCC_1 -- HCC_254) * (HCC_1 -- HCC_254);
RUN;
I can make this project workable by just doing this:
PROC FREQ DATA=WORK.HCC_COMBORID_STEP3;
TABLES (HCC_1) * (HCC_1 -- HCC_254);
where hcc_1 = 1;
RUN;
But that would mean doing that 128 times and that seems wildly inefficient.
Any suggestions for how I can simplify this?
Your original statement said 254 variables. So this remains a point of confusion.
Nevertheless, to find "frequently paired indicators" I would think you are going to want to create some large output data set which you can then narrow down to the "frequently paired indicators", whatever that means.
/* UNTESTED CODE */
ods select none;
ods output crosstabfreqs=freqs;
proc freq data=have;
tables (hcc_1-hcc_128)*(hcc_1-hcc_128);
run;
Data set FREQS will be huge, and this may take a long time to run, but it should have all the information you need.
Essentially I'm trying to figure out the frequency that any two indicators are both 1 for all the records vs. one of them being 0.
Any two?? Meaning you need examine all 254*253/2 pairs? Or do you just need to know that an observation has 2 ones and 252 zeros? Can you explain this further?
Every record in my table has 128 binary indicators. My goal is to find patterns in the table by identifying frequently paired indicators (1,1), so I'm trying to evaluate all 129*128 combinations. Does that make sense?
Your original statement said 254 variables. So this remains a point of confusion.
Nevertheless, to find "frequently paired indicators" I would think you are going to want to create some large output data set which you can then narrow down to the "frequently paired indicators", whatever that means.
/* UNTESTED CODE */
ods select none;
ods output crosstabfreqs=freqs;
proc freq data=have;
tables (hcc_1-hcc_128)*(hcc_1-hcc_128);
run;
Data set FREQS will be huge, and this may take a long time to run, but it should have all the information you need.
Thank you, I'm trying it now. I apologize for the confusion about the column naming convention and how I presented it. The names are deliberate but it was misleading.
You don't give any indication of how sparse your data may be populated with multiple 1 values in the data.
So the first thing I might try would be to see if reducing the data to records with 2 or more ones may help. Easily done:
data temp; set have; array b hcc_1-hcc_128; totalones = sum(of b(*)); if totalones ge 2; run;
If this data set has significantly reduced the number of records that might help.
Another approach might be to look at clustering procedures to find similar clusters of values of your binary variables.
Something like:
proc fastclus data=have; var hcc_1 - hcc_128; run;
Without a MAXCLUSTER option on the proc statement the procedure will try to find 100 clusters of similar combinations of the variables. You might want to consider using the reduced data set I suggested as the data source.
Suggestion: switch from PROC FREQ to PROC CORR. It has a better chance of finishing, and probably gives you a better way of searching for meaningful patterns. Here's an example of why that would be. Perhaps some of your variables might look like this:
0 1
HCC_1 25 25
HCC_2 25 25
HCC_3 78 1
HCC_4 1 20
So PROC FREQ would reveal that HCC_1 and HCC_2 have a 25% chance of matching "1" values. Yet their relationship is statistically random. It would reveal that HCC_3 and HCC_4 match on "1" only 20% of the time. Yet their relationship is compelling. PROC CORR would reveal this, while PROC FREQ would not.
@Astounding wrote:
Suggestion: switch from PROC FREQ to PROC CORR. It has a better chance of finishing, and probably gives you a better way of searching for meaningful patterns. Here's an example of why that would be. Perhaps some of your variables might look like this:
0 1
HCC_1 25 25
HCC_2 25 25
HCC_3 78 1
HCC_4 1 20
So PROC FREQ would reveal that HCC_1 and HCC_2 have a 25% chance of matching "1" values. Yet their relationship is statistically random. It would reveal that HCC_3 and HCC_4 match on "1" only 20% of the time. Yet their relationship is compelling. PROC CORR would reveal this, while PROC FREQ would not.
I originally thought this might be a good idea, but as I thought about it more, if the question is to find the prevalence of when BOTH of the two variables are 1, then correlations won't do it, because it will (in layman's terms) treat (0,0) as correlation and (1,1) as correlation, and (0,1) and (1,0) as "not correlation". Which isn't the same as the original question.
Then I thought about arrays and looping, which could then determine if both variables in the pair equal 1. But, there are some problems here because you need an output array that has variables named something like HCC_1xHCC_2, HCC_1xHCC_3 and so on until all possible pairs are created. That's a lot of typing, don't make any mistakes. But wait ... you could create a macro loop that creates a macro variable with all of these variable names, and then use the macro variable in an ARRAY statement. Well, that's still a lot of programming, and so I return to the idea above of PROC FREQ with ODS output and then parsing the resulting data set.
I think you could also get tricky with PROC GLMMOD and perhaps accomplish this. GLMMOD will create variables named for all the interactions of HCC_n and HCC_m (or maybe those go in an output data set as a label, I don't really remember). But I'm not sure it will then do the proper determination that both pairs of variables are 1. Experimentation might be required!!
Are you just looking to see how often each pair of variables are both true for the same observation?
Perhaps you could just count it yourself.
I made a sample dataset with 100 observations of 5 boolean variables X1 to X5.
Here is code to count the number of times each pair was both true.
%let n=5 ;
data pairs;
set have end=eof;
array x [&n];
array pairs [%sysfunc(comb(&n,2))] _temporary_;
index=0;
do lower=1 to &n-1;
do upper=lower+1 to &n;
index=index+1;
if x[lower] and x[upper] then pairs[index]=sum(pairs[index],1);
end;
end;
if eof then do;
index=0;
totaln=_n_;
do lower=1 to &n-1;
do upper=lower+1 to &n;
index=index+1;
count=pairs[index];
percent=count/totaln;
output;
end;
end;
end;
keep lower upper totaln count percent;
run;
Obs lower upper totaln count percent 1 1 2 100 25 0.25 2 1 3 100 28 0.28 3 1 4 100 20 0.20 4 1 5 100 21 0.21 5 2 3 100 22 0.22 6 2 4 100 20 0.20 7 2 5 100 19 0.19 8 3 4 100 30 0.30 9 3 5 100 24 0.24 10 4 5 100 24 0.24
If that is what you want you can also ask PROC TRANSPOSE and PROC SQL to help you get it.
proc transpose data=have out=step1;
by id;
var x1-x5;
run;
proc sql ;
create table pairs2 as
select a._name_ as lower
, b._name_ as upper
, sum(a.col1*b.col1) as count
from step1 a inner join step1 b
on a.id=b.id and a._name_ < b._name_
group by lower,upper
;
quit;
Obs lower upper count 1 x1 x2 25 2 x1 x3 28 3 x1 x4 20 4 x1 x5 21 5 x2 x3 22 6 x2 x4 20 7 x2 x5 19 8 x3 x4 30 9 x3 x5 24 10 x4 x5 24
Brilliant!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.