Hello
I want to run a code that calculate Kramer correlation matrix between categorical variables.
I have a question please about improving the code.
My code check correlation between any pair of variables
In my example there are 4 variables X Y Z R so it calculate correlation between:
X,X X,Y X,Z X,R
Y,Y Y,X Y,Z Y,R
Z,Z Z,X Z,Y Z,R
R,R R,X R,Y R,Z
So it calculate 16 correlations
However, correlation of X,Y is same as correlation of Y,X (and so on)
How can we improve the code that it will calculate the correlation matrix quicker?
Another question-
I have a warning message -
WARNING: 31% of the cells have expected counts less than 5, for the table of X by _X
Chi-Square may not be a valid test.
What is the way to solve it?
/****categorical Variables list****/
%let list1var=X Y Z R;
%let list2var=&list1var.;
/**Number of varaibles***/
%let n =%sysfunc(countw(&list1var));
%put n=&n;
%macro CRAMV;
%do i = 1 %to &n.;
%let var1=%scan(&list1var.,&i.,+);
%do j = 1 %to &n.;
%let var2=%scan(&list2var.,&j.,+);
ods select none;
PROC FREQ DATA=Raw_Data_tbl;
TABLES &VAR1.* (&VAR2.)/CHISQ MISSING NOPERCENT NOCOL;
OUTPUT OUT=ttt CRAMV ;
RUN;
DATA want;
RETAIN VAR1 VAR2;
SET ttt;
LENGTH VAR1 VAR2 $32.;
VAR1 ="&VAR1";
VAR2 ="&VAR2";
RUN;
PROC APPEND DATA=want BASE=Kramer_All force;quit;
%end;
%end;
%mend;
%CRAMV;
/***Create matrix***/
title;
ods select all;
proc tabulate data=Kramer_All;
class VAR1 VAR2;
var _CRAMV_;
table (VAR1=''),VAR2=''*(_CRAMV_=''*min=''*f=6.5 );
run;
but it also calculate correlation between:
ods select none;
ods output ChiSq= ChiSq;
PROC FREQ DATA=sashelp.heart(obs=1000);
TABLES (sex status bp_status)*(sex status bp_status)/CHISQ MISSING NOPERCENT NOCOL;
RUN;
ods select all;
data ChiSq2;
set ChiSq(where=(Statistic="Cramer's V"));
row=scan(table,-2,' *');
col=scan(table,-1,' *');
run;
proc tabulate data=ChiSq2 ;
class row col;
var value;
table row='',col=''*value=''*sum=''*f=6.3;
run;
"WARNING: 31% of the cells have expected counts less than 5, for the table of X by _X
Chi-Square may not be a valid test."
Yes. You need EXACT Chisquare Test , Like:
PROC FREQ DATA=sashelp.heart(obs=1000);
TABLES sex *status/CHISQ MISSING NOPERCENT NOCOL ;
exact chisq ;
RUN;
ods select none;
ods output ChiSq= ChiSq;
PROC FREQ DATA=sashelp.heart(obs=1000);
TABLES (sex status bp_status)*(sex status bp_status)/CHISQ MISSING NOPERCENT NOCOL;
RUN;
ods select all;
data ChiSq2;
set ChiSq(where=(Statistic="Cramer's V"));
row=scan(table,-2,' *');
col=scan(table,-1,' *');
run;
proc tabulate data=ChiSq2 ;
class row col;
var value;
table row='',col=''*value=''*sum=''*f=6.3;
run;
"WARNING: 31% of the cells have expected counts less than 5, for the table of X by _X
Chi-Square may not be a valid test."
Yes. You need EXACT Chisquare Test , Like:
PROC FREQ DATA=sashelp.heart(obs=1000);
TABLES sex *status/CHISQ MISSING NOPERCENT NOCOL ;
exact chisq ;
RUN;
Thank you so much.
So as I understand only one change is needed to change chisq to exact chisq.
(What is the difference in the statistical names? I want to ready the theory)
About my question how to improve running time of kramer matrix.
What is the way?
Thanks
The question is by using my code ( user define the variables via macro variable) how to cause the code to calculate only one time the correlation between 2 variables
(In my code for example calculated correlation between x and y and then between y and x but they are same...)
You seem to be asking about how to write a macro loop that iterates over (i,j) pairs where i <= j. Look at this example. Hope it helps.
%let n = 4;
%macro TRI;
%do i = 1 %to &n.;
%do j = &i. %to &n.;
%put &=i, &=j;
%end;
%end;
%mend;
%TRI;
The key is to start to J loop at &i instead of at 1.
Thank you.
The correlation matrix is not full (not all arguments have value) when I apply this method.
I need to define that Correlation between I and J equal to Correlation between J and I
I used this way that is working 100%
Any idea of any other way to fill the all arguments in matrix?
%let list1var=sex+status+bp_status;
%let list2var=&list1var.;
%let n =%sysfunc(countw(&list1var));
%macro CRAMV;
%do i = 1 %to &n.;
%let var1=%scan(&list1var.,&i.,+);
%do j = &i. %to &n.;
%let var2=%scan(&list2var.,&j.,+);
ods select none;
PROC FREQ DATA=sashelp.heart;
TABLES &VAR1.* (&VAR2.) /CHISQ MISSING NOPERCENT NOCOL;
OUTPUT OUT=ttt CRAMV ;
RUN;
DATA ttt&i.&j.;
RETAIN VAR1 VAR2;
SET ttt;
LENGTH VAR1 VAR2 $32.;
VAR1 ="&VAR1";
VAR2 ="&VAR2";
RUN;
DATA RRR&j.&i.(DROP=VAR1 VAR2 rename=(_VAR2_=VAR2 _VAR1_=VAR1));
RETAIN VAR1 VAR2;
SET ttt&i.&j.;
LENGTH VAR1 VAR2 $32.;
_VAR2_ =VAR1;
_VAR1_ =VAR2;
RUN;
Data qqq;
set ttt&i.&j. RRR&j.&i.;
Run;
PROC APPEND DATA=qqq BASE=Kramer_All_Way2 force;
RUN;
%end;
%end;
%mend;
%CRAMV;
title;
ods select all;
proc tabulate data=Kramer_All_Way2;
class VAR1 VAR2;
var _CRAMV_;
table (VAR1=''),VAR2='' *(_CRAMV_=''*min=''*f=6.5 );
run;
I want to summarize the ways to create karmer matrix ,which calculate correlation between categorical variables.
In my original code I calculated the correlation between each pair of variables.
In the real life with 100,000 rows and 20 variables it took long time ( 10 minutes or more) to run it.
Then I saw how to improve it by calculate only once the correlation between any pair ( so no need to calculate correlation between X and W and W and X. It is enought to calculate only between X and W)
Then I saw that can calculate the correlation between all variables in one step and it took 1 minute or less to calculate it.
Thank you Mr. Ksharp!
Really big time saving!
/***************WAY1---Very slow running time**********************************/
/***************WAY1---Very slow running time**********************************/
/***************WAY1---Very slow running time**********************************/
proc delete data=work._all_;Run;
%let list1var=sex+status+bp_status;
%let list2var=&list1var.;
%let n =%sysfunc(countw(&list1var));
%macro CRAMV;
%do i = 1 %to &n.;
%let var1=%scan(&list1var.,&i.,+);
%do j = 1 %to &n.;
%let var2=%scan(&list2var.,&j.,+);
ods select none;
PROC FREQ DATA=sashelp.heart;
TABLES &VAR1.* (&VAR2.) /CHISQ MISSING NOPERCENT NOCOL;
OUTPUT OUT=ttt CRAMV ;
RUN;
DATA ttt;
RETAIN VAR1 VAR2;
SET ttt;
LENGTH VAR1 VAR2 $32.;
VAR1 ="&VAR1";
VAR2 ="&VAR2";
RUN;
PROC APPEND DATA=ttt BASE=Kramer_All_Way1 force;
RUN;
%end;
%end;
%mend;
%CRAMV;
title;
ods select all;
proc tabulate data=Kramer_All_Way1;
class VAR1 VAR2;
var _CRAMV_;
table (VAR1=''),VAR2='' *(_CRAMV_=''*min=''*f=6.5 );
run;
/***************WAY2---Very slow running time**********************************/
/***************WAY2---Very slow running time**********************************/
/***************WAY2---Very slow running time**********************************/
proc delete data=work._all_;Run;
%let list1var=sex+status+bp_status;
%let list2var=&list1var.;
%let n =%sysfunc(countw(&list1var));
%macro CRAMV;
%do i = 1 %to &n.;
%let var1=%scan(&list1var.,&i.,+);
%do j = &i. %to &n.;
%let var2=%scan(&list2var.,&j.,+);
ods select none;
PROC FREQ DATA=sashelp.heart;
TABLES &VAR1.* (&VAR2.) /CHISQ MISSING NOPERCENT NOCOL;
OUTPUT OUT=ttt CRAMV ;
RUN;
DATA ttt&i.&j.;
RETAIN VAR1 VAR2;
SET ttt;
LENGTH VAR1 VAR2 $32.;
VAR1 ="&VAR1";
VAR2 ="&VAR2";
RUN;
DATA RRR&j.&i.(DROP=VAR1 VAR2 rename=(_VAR2_=VAR2 _VAR1_=VAR1));
RETAIN VAR1 VAR2;
SET ttt&i.&j.;
LENGTH VAR1 VAR2 $32.;
_VAR2_ =VAR1;
_VAR1_ =VAR2;
RUN;
Data qqq;
set ttt&i.&j. RRR&j.&i.;
Run;
PROC APPEND DATA=qqq BASE=Kramer_All_Way2 force;
RUN;
%end;
%end;
%mend;
%CRAMV;
title;
ods select all;
proc tabulate data=Kramer_All_Way2;
class VAR1 VAR2;
var _CRAMV_;
table (VAR1=''),VAR2='' *(_CRAMV_=''*min=''*f=6.5 );
run;
/****Other way for way2 to fill the matrix with all arguments****/
/****Other way for way2 to fill the matrix with all arguments****/
/****Other way for way2 to fill the matrix with all arguments****/
proc delete data=work._all_;Run;
%let list1var=sex+status+bp_status;
%let list2var=&list1var.;
%let n =%sysfunc(countw(&list1var));
%macro CRAMV;
%do i = 1 %to &n.;
%let var1=%scan(&list1var.,&i.,+);
%do j = &i. %to &n.;
%let var2=%scan(&list2var.,&j.,+);
ods select none;
PROC FREQ DATA=sashelp.heart;
TABLES &VAR1.* (&VAR2.) /CHISQ MISSING NOPERCENT NOCOL;
OUTPUT OUT=ttt CRAMV ;
RUN;
DATA ttt;
RETAIN VAR1 VAR2;
SET ttt;
LENGTH VAR1 VAR2 $32.;
VAR1 ="&VAR1";
VAR2 ="&VAR2";
RUN;
PROC APPEND DATA=ttt BASE=Kramer_All_Way2b force;
RUN;
%end;
%end;
%mend;
%CRAMV;
title;
ods select all;
data Kramer_All_Way2b;
set Kramer_All_Way2b;
/***Add Rows **/
output;
if var1 ne var2 then do;
temp=var1;
var1=var2;
var2=temp;
output;
end;
drop temp;
run;
proc tabulate data=Kramer_All_Way2b;
class VAR1 VAR2;
var _CRAMV_;
table (VAR1=''),VAR2='' *(_CRAMV_=''*min=''*f=6.5 );
run;
/*****************WAY3---Much better running time!!!*******************/
/*****************WAY3---Much better running time!!!*******************/
/*****************WAY3---Much better running time!!!*******************/
%let list_var=sex status bp_status;
ods select none;
ods output ChiSq= ChiSq;
PROC FREQ DATA=sashelp.heart;
TABLES (&list_var.)*(&list_var.)/CHISQ MISSING NOPERCENT NOCOL;
RUN;
ods select all;
data ChiSq2;
set ChiSq(where=(Statistic="Cramer's V"));
row=scan(table,-2,' *');
col=scan(table,-1,' *');
run;
proc tabulate data=ChiSq2 ;
class row col;
var value;
table row='',col=''*value=''*sum=''*f=6.5;
run;
Just add one more data step to impute these missing value.
%let list1var=sex+status+bp_status;
%let list2var=&list1var.;
%let n =%sysfunc(countw(&list1var));
%macro CRAMV;
%do i = 1 %to &n.;
%let var1=%scan(&list1var.,&i.,+);
%do j = &i. %to &n.;
%let var2=%scan(&list2var.,&j.,+);
ods select none;
PROC FREQ DATA=sashelp.heart;
TABLES &VAR1.* (&VAR2.) /CHISQ MISSING NOPERCENT NOCOL;
OUTPUT OUT=ttt CRAMV ;
RUN;
DATA ttt;
RETAIN VAR1 VAR2;
SET ttt;
LENGTH VAR1 VAR2 $32.;
VAR1 ="&VAR1";
VAR2 ="&VAR2";
RUN;
PROC APPEND DATA=ttt BASE=Kramer_All_Way2 force;
RUN;
%end;
%end;
%mend;
%CRAMV;
title;
ods select all;
data Kramer_All_Way2;
set Kramer_All_Way2;
output;
if var1 ne var2 then do;
temp=var1;var1=var2;var2=temp;output;
end;
drop temp;
run;
proc tabulate data=Kramer_All_Way2;
class VAR1 VAR2;
var _CRAMV_;
table (VAR1=''),VAR2='' *(_CRAMV_=''*min=''*f=6.5 );
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.