BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Bond007
Obsidian | Level 7

Hi All,

 

I have found the correlation between numeric variables in a dataset. Here is the code I used to do the same. (output n*n matrix format)

proc template;
      edit Base.Corr.StackedMatrix;
         column (RowName RowLabel) (Matrix) * (Matrix2);
         edit matrix;
            cellstyle _val_  = -1.00 as {backgroundcolor=CXEEEEEE},
                      _val_ <= -0.75 as {backgroundcolor=red},
                      _val_ <= -0.50 as {backgroundcolor=blue},
                      _val_ <= -0.25 as {backgroundcolor=cyan},
                      _val_ <=  0.25 as {backgroundcolor=white},
                      _val_ <=  0.50 as {backgroundcolor=cyan},
                      _val_ <=  0.75 as {backgroundcolor=blue},
                      _val_ <   1.00 as {backgroundcolor=red},
                      _val_  =  1.00 as {backgroundcolor=CXEEEEEE};
            end;
         end;
      run;
   
   ods html body='corr.html' style=statistical;
   ods listing close;
   proc corr data=cars noprob;
      ods select PearsonCorr;
   run;
   ods listing;
   ods html close;
   
   proc template;
   delete Base.Corr.StackedMatrix;
run;

Output: xyz.PNG

I want to compare the character variables using chi square and get a score, similar to the correlation process done for numeric variables. (Required output format n*n: Same as above)

Do I have to edit the proc freq template to get the output I want? How do I achieve the n*n output?

Please help.

 

Thanks,

Bond

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

Assuming that n is the number of variables, there are only n*(n-1)/2 elements, since the matrix you ask for is symmetric and the diagonal elements are uninteresting. Here's how to get the numbers you want. If you really want the info displayed as an n*n matrix, you restructure the statistics into a matrix yourself.

 

The basic idea is to enumerate the pairwise combinations, use ODS OUTPUT to save the statistics to a SAS data set,

and then display the statistics in the form you want. Here is the way you could create the chi-square statistics for the 6 pairwise tests for the four variables BP_Status, Chol_Status, Smoking_Status, and Weight_Status in the Sashelp.Heart data:

 

ods exclude all;
ods output ChiSq(persist)=chisq(where=(statistic="Chi-Square")
           rename=(Value=ChiSq));

proc freq data=sashelp.heart;
   tables BP_Status * Chol_Status / nocol norow chisq;
   tables BP_Status * Smoking_Status / nocol norow chisq;
   tables BP_Status * Weight_Status/ nocol norow chisq;
   tables Chol_Status * Smoking_Status / nocol norow chisq;
   tables Chol_Status * Weight_Status/ nocol norow chisq;
   tables Smoking_Status * Weight_Status/ nocol norow chisq;
run;

ods exclude none;

proc print data=ChiSq;
   var Table Statistic DF ChiSq Prob;
run;

 

To visualize the data, you might consider a bar chart of the pairwise associations. You can make a similar bar chart for correlations.

View solution in original post

10 REPLIES 10
Ksharp
Super User

Recently @Rick_SAS  has written a blog about building a recommendation system . which used COSINE to test the correlation of two statement. 

I think it is very suited for your question .

Bond007
Obsidian | Level 7

I have to use chi square to calculate the relationship between the character variables in a dataset, considering the correlation can be used for numeric variables only.

Rick_SAS
SAS Super FREQ

I am not aware of a correlation-type measurement for character variables. You might be thinking about testing for the assoication between levels of two categorical variables by using PROC FREQ. For example, if you want to know whether there is an association between weights of patients (measured as (Underweight, Normal, or Overweight) and cholesterol levels (measured as Desirable, Borderline, or High), you can run the following PROC FREQ step and use the CHISQ option to test for association. 

 

proc freq data=sashelp.heart;
   tables Chol_status*Weight_status / nocol norow chisq;
run;

You can use multiple TABLES statements if you want to test several pairs of character variables.

 

By the way, PROC FREQ can also create graphs of your two-way tables.

Bond007
Obsidian | Level 7

Using the code you mentioned, I will get a similar output of two tables.

abc.PNG

All I need is the Chi-Square value when two variables are compared and not other details. 

If possible, I would prefer the output to be in a format of n*n matrix when the chi-square code is subjected to all the character variables. The output format should be similar to the one that I have mentioned in the question (which I got for correlation of numeric values).

Rick_SAS
SAS Super FREQ

Assuming that n is the number of variables, there are only n*(n-1)/2 elements, since the matrix you ask for is symmetric and the diagonal elements are uninteresting. Here's how to get the numbers you want. If you really want the info displayed as an n*n matrix, you restructure the statistics into a matrix yourself.

 

The basic idea is to enumerate the pairwise combinations, use ODS OUTPUT to save the statistics to a SAS data set,

and then display the statistics in the form you want. Here is the way you could create the chi-square statistics for the 6 pairwise tests for the four variables BP_Status, Chol_Status, Smoking_Status, and Weight_Status in the Sashelp.Heart data:

 

ods exclude all;
ods output ChiSq(persist)=chisq(where=(statistic="Chi-Square")
           rename=(Value=ChiSq));

proc freq data=sashelp.heart;
   tables BP_Status * Chol_Status / nocol norow chisq;
   tables BP_Status * Smoking_Status / nocol norow chisq;
   tables BP_Status * Weight_Status/ nocol norow chisq;
   tables Chol_Status * Smoking_Status / nocol norow chisq;
   tables Chol_Status * Weight_Status/ nocol norow chisq;
   tables Smoking_Status * Weight_Status/ nocol norow chisq;
run;

ods exclude none;

proc print data=ChiSq;
   var Table Statistic DF ChiSq Prob;
run;

 

To visualize the data, you might consider a bar chart of the pairwise associations. You can make a similar bar chart for correlations.

Bond007
Obsidian | Level 7
proc freq data=sashelp.heart;
   tables BP_Status * Chol_Status / nocol norow chisq;
   tables BP_Status * Smoking_Status / nocol norow chisq;
   tables BP_Status * Weight_Status/ nocol norow chisq;
   tables Chol_Status * Smoking_Status / nocol norow chisq;
   tables Chol_Status * Weight_Status/ nocol norow chisq;
   tables Smoking_Status * Weight_Status/ nocol norow chisq;
run;

Is there any way in SAS where the tables statement could be called for entire dataset to do the set of table commands mentioned above internally with all variable combinations?  For example: If the dataset has 50-100 character variables, it is difficult to write so many table statements for each combination.

ballardw
Super User

@Bond007 wrote:
proc freq data=sashelp.heart;
   tables BP_Status * Chol_Status / nocol norow chisq;
   tables BP_Status * Smoking_Status / nocol norow chisq;
   tables BP_Status * Weight_Status/ nocol norow chisq;
   tables Chol_Status * Smoking_Status / nocol norow chisq;
   tables Chol_Status * Weight_Status/ nocol norow chisq;
   tables Smoking_Status * Weight_Status/ nocol norow chisq;
run;

Is there any way in SAS where the tables statement could be called for entire dataset to do the set of table commands mentioned above internally with all variable combinations?  For example: If the dataset has 50-100 character variables, it is difficult to write so many table statements for each combination.


Entire dataset might be a tad excessive.

You could use something like:

 

tables _character_ * _character_ /options.

Caution: Run that on a small data set with a just a few variables. 50 variables would mean 50*50 = 2500 tables with at least 50 pretty meaningless for the variables when compared with them selves Plus almost half will duplicate results: var1*var2 will have the same chisq as var2*var1. Note that to use the ODS select you have to have the output go to a destination and 2500 tables may fill up the results window. YMMV.

 

So you could use a tables statement like

tables ( <list of variable names here>) *( <list of other variables here>) / options.

to compare everyone of the first list with everyone of the second list.

You can use the lists like A -- D (that is two dashes) to indicate variables that are in sequential columns in the data set or BP: (a colon immediately after the starting characters of a variable name) to use all the variable names that start with the letters BP.

 

I would suggest looking at which variables actually make sense to test in a chisq (Hint: addresses, identification variables, phone numbers and such are almost certainly not worth pursuing. Geographic groups like city, zipcode, state might). Break them into two lists and if there are a few specific comparisons that would not get done with the two lists then add them:

 

tables (<list1>)*(<list2>)   var1*var2 var1*var7 /<options>;

for instance.

 

 

 

ballardw
Super User

@Bond007 wrote:

Using the code you mentioned, I will get a similar output of two tables.

 

All I need is the Chi-Square value when two variables are compared and not other details. 

If possible, I would prefer the output to be in a format of n*n matrix when the chi-square code is subjected to all the character variables. The output format should be similar to the one that I have mentioned in the question (which I got for correlation of numeric values).

 


 One way to get an output data set of just the basic chi-square statistics:

ods select chisq;
proc freq data=sashelp.heart ;
   ods output chisq=work.chisq;
   tables (Chol_status smoking_status)*Weight_status / nocol norow chisq;
run;

You would have to parse the data set a bit to select the statistic(s) you want and separate the variable names from the TABLE variable.

Something like:

data work.report;
   set work.chisq;
   where statistic='Chi-Square';
   length var1 var2 $ 32;
   var1 = scan(table,2,' * ');
   var2 = scan(table,3,' * ');
run;

proc report data=work.report;
   column var2 var1,prob;
   define var2 /group;
   define var1/across;
   define prob/ ' '  ;
run;

 

Note that chi-square statistic values without additional information tend to be a tad difficult to interpret. So I use the p-value above in the example.

Rick_SAS
SAS Super FREQ

> is there any way in SAS where the tables statement could be called for entire dataset to do the set of table

> commands mentioned above internally with all variable combinations?  

 

No, not automatically. You would have to generate those statements yourself by using either MACRO or by using CALL EXECUTE to generate the code in the DATA step.

 

The reason this is not "automatic" is that I've never seen anyone want to perform this kind of analysis. It is not clear to me that it is a useful computation. Chi-square tests for association are quite different from correlations. For large data, almost all the tests will reject the hypothesis of association. You will also encounter computational issues when you encounter a character variable such as Name, ID, address, ... that has many levels. PROC FREQ will not perform well when you ask for a frequency table of Name*Address. because the  table will be huge and will require a LOT of memory,

 

You can google for how to generate all pairwise combinations of variables in SAS, but I encourage you to rethink what you are trying to accomplish and whether your efforts are likely to give you any meaningful insights into your data.

Bond007
Obsidian | Level 7

Thanks for the help. Much appreciated.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 3311 views
  • 3 likes
  • 4 in conversation