BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rlafond
Obsidian | Level 7

I have a table designed like this:

 

ID          HCC_1         HCC_2....      HCC_254

313         0                     1                    1 

 

It has 58k rows and 129 columns. One column is the ID field (the key) and the other 128 columns are binary (1 or 0) indicators.

 

Essentially I'm trying to figure out the frequency that any two indicators are both 1 for all the records vs. one of them being 0.

 

If I use the following code my SAS freezes:

 

PROC FREQ DATA=WORK.HCC_COMBORID_STEP3;
TABLES (HCC_1 -- HCC_254) * (HCC_1 -- HCC_254);
RUN;

I can make this project workable by just doing this:

 

PROC FREQ DATA=WORK.HCC_COMBORID_STEP3;
TABLES (HCC_1) * (HCC_1 -- HCC_254);
where hcc_1 = 1;
RUN;

But that would mean doing that 128 times and that seems wildly inefficient.

 

Any suggestions for how I can simplify this?

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

Your original statement said 254 variables. So this remains a point of confusion.

 

Nevertheless, to find "frequently paired indicators" I would think you are going to want to create some large output data set which you can then narrow down to the "frequently paired indicators", whatever that means.

 

/* UNTESTED CODE */

ods select none;
ods output crosstabfreqs=freqs;
proc freq data=have;
    tables (hcc_1-hcc_128)*(hcc_1-hcc_128);
run;

Data set FREQS will be huge, and this may take a long time to run, but it should have all the information you need. 

--
Paige Miller

View solution in original post

10 REPLIES 10
PaigeMiller
Diamond | Level 26

Essentially I'm trying to figure out the frequency that any two indicators are both 1 for all the records vs. one of them being 0.

 

Any two?? Meaning you need examine all 254*253/2 pairs? Or do you just need to know that an observation has 2 ones and 252 zeros? Can you explain this further?

--
Paige Miller
rlafond
Obsidian | Level 7

Every record in my table has 128 binary indicators. My goal is to find patterns in the table by identifying frequently paired indicators (1,1), so I'm trying to evaluate all 129*128 combinations. Does that make sense?

PaigeMiller
Diamond | Level 26

Your original statement said 254 variables. So this remains a point of confusion.

 

Nevertheless, to find "frequently paired indicators" I would think you are going to want to create some large output data set which you can then narrow down to the "frequently paired indicators", whatever that means.

 

/* UNTESTED CODE */

ods select none;
ods output crosstabfreqs=freqs;
proc freq data=have;
    tables (hcc_1-hcc_128)*(hcc_1-hcc_128);
run;

Data set FREQS will be huge, and this may take a long time to run, but it should have all the information you need. 

--
Paige Miller
rlafond
Obsidian | Level 7

Thank you, I'm trying it now. I apologize for the confusion about the column naming convention and how I presented it. The names are deliberate but it was misleading.

ballardw
Super User

You don't give any indication of how sparse your data may be populated with multiple 1 values in the data.

So the first thing I might try would be to see if reducing the data to records with 2 or more ones may help. Easily done:

 

data temp;
   set have;
   array b  hcc_1-hcc_128;
   totalones = sum(of b(*));
   if totalones ge 2;
run;

If this data set has significantly reduced the number of records that might help.

 

Another approach might be to look at clustering procedures to find similar clusters of values of your binary variables.

Something like:

proc fastclus data=have;
   var hcc_1 - hcc_128;
run;

Without a MAXCLUSTER option on the proc statement the procedure will try to find 100 clusters of similar combinations of the variables. You might want to consider using the reduced data set I suggested as the data source.

Astounding
PROC Star

Suggestion:  switch from PROC FREQ to PROC CORR.  It has a better chance of finishing, and probably gives you a better way of searching for meaningful patterns.  Here's an example of why that would be.  Perhaps some of your variables might look like this:

 

 

       0     1

 

HCC_1  25   25

HCC_2  25   25

 

HCC_3  78    1

HCC_4   1   20

 

 

So PROC FREQ would reveal that HCC_1 and HCC_2 have a 25% chance of matching "1" values.  Yet their relationship is statistically random.  It would reveal that HCC_3 and HCC_4 match on "1" only 20% of the time.  Yet their relationship is compelling.  PROC CORR would reveal this, while PROC FREQ would not.

PaigeMiller
Diamond | Level 26

@Astounding wrote:

Suggestion:  switch from PROC FREQ to PROC CORR.  It has a better chance of finishing, and probably gives you a better way of searching for meaningful patterns.  Here's an example of why that would be.  Perhaps some of your variables might look like this:

 

 

       0     1

 

HCC_1  25   25

HCC_2  25   25

 

HCC_3  78    1

HCC_4   1   20

 

 

So PROC FREQ would reveal that HCC_1 and HCC_2 have a 25% chance of matching "1" values.  Yet their relationship is statistically random.  It would reveal that HCC_3 and HCC_4 match on "1" only 20% of the time.  Yet their relationship is compelling.  PROC CORR would reveal this, while PROC FREQ would not.


I originally thought this might be a good idea, but as I thought about it more, if the question is to find the prevalence of when BOTH of the two variables are 1, then correlations won't do it, because it will (in layman's terms) treat (0,0) as correlation and (1,1) as correlation, and (0,1) and (1,0) as "not correlation". Which isn't the same as the original question.

 

Then I thought about arrays and looping, which could then determine if both variables in the pair equal 1. But, there are some problems here because you need an output array that has variables named something like HCC_1xHCC_2, HCC_1xHCC_3 and so on until all possible pairs are created. That's a lot of typing, don't make any mistakes. But wait ... you could create a macro loop that creates a macro variable with all of these variable names, and then use the macro variable in an ARRAY statement. Well, that's still a lot of programming, and so I return to the idea above of PROC FREQ with ODS output and then parsing the resulting data set.

 

I think you could also get tricky with PROC GLMMOD and perhaps accomplish this. GLMMOD will create variables named for all the interactions of HCC_n and HCC_m (or maybe those go in an output data set as a label, I don't really remember). But I'm not sure it will then do the proper determination that both pairs of variables are 1. Experimentation might be required!!

--
Paige Miller
Tom
Super User Tom
Super User

Are you just looking to see how often each pair of variables are both true for the same observation?

Perhaps you could just count it yourself.

I made a sample dataset with 100 observations of 5 boolean variables X1 to X5.

Here is code to count the number of times each pair was both true.

%let n=5 ;

data pairs;
  set have end=eof;
  array x [&n];
  array pairs [%sysfunc(comb(&n,2))] _temporary_;
  index=0;
  do lower=1 to &n-1;
   do upper=lower+1 to &n;
     index=index+1;
     if x[lower] and x[upper] then pairs[index]=sum(pairs[index],1);
   end;
  end;
  if eof then do;
    index=0;
    totaln=_n_;
    do lower=1 to &n-1;
      do upper=lower+1 to &n;
        index=index+1;
        count=pairs[index];
        percent=count/totaln;
        output;
      end;
    end;
  end;
  keep lower upper totaln count percent;
run;
Obs    lower    upper    totaln    count    percent

  1      1        2        100       25       0.25
  2      1        3        100       28       0.28
  3      1        4        100       20       0.20
  4      1        5        100       21       0.21
  5      2        3        100       22       0.22
  6      2        4        100       20       0.20
  7      2        5        100       19       0.19
  8      3        4        100       30       0.30
  9      3        5        100       24       0.24
 10      4        5        100       24       0.24
Tom
Super User Tom
Super User

If that is what you want you can also ask PROC TRANSPOSE and PROC SQL to help you get it.

proc transpose data=have out=step1;
  by id;
  var x1-x5;
run;

proc sql ;
  create table pairs2 as 
  select a._name_ as lower
       , b._name_ as upper 
       , sum(a.col1*b.col1) as count
  from step1 a inner join step1 b
    on a.id=b.id and a._name_ < b._name_
  group by lower,upper
  ;
quit;
Obs    lower    upper    count

  1     x1       x2        25
  2     x1       x3        28
  3     x1       x4        20
  4     x1       x5        21
  5     x2       x3        22
  6     x2       x4        20
  7     x2       x5        19
  8     x3       x4        30
  9     x3       x5        24
 10     x4       x5        24
PaigeMiller
Diamond | Level 26

Brilliant!

 
--
Paige Miller

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1279 views
  • 5 likes
  • 5 in conversation