Hello,
I have 20 years of data and for every year, I estimated the correlations for all the variables using the following code.
proc corr data=... out=pearson_corr;
var x1 x2 x3 x4;
with x1 x2 x3 x4;
by year
run;
In the next step I would like to get a correlation matrix which shows the mean of this 20 correlation matrixes.
Does anyone have an idea how this is possible? Using proc means I dont get the right result.
Thanks a lot for your help!
@DomUk wrote:
Thanks a lot. I get already the right numbers, however on the x-axis the varaibles are not sorted (so the diagonal "1-line" is not there, instead the 1-values are distributed thorughout the whole table.
I see. Your variable names are not alphabetically sorted (unlike x1, x2, x3, x4), are they? In this case you may want to retrieve their sort order from dataset pearson_corr and finally use this information to sort the output dataset from PROC SUMMARY.
As an example, let's assume the variable names are h, d, s, q.
%let vars=h d s q;
/* Compute correlation coefficients by year */
proc corr data=have out=pearson_corr noprint;
var &vars;
by year;
run;
/* Retrieve sort order of analysis variables */
data _vars(keep=_seqno _name_);
set pearson_corr(firstobs=4);
_seqno=_n_;
if _type_~=:'C' then stop;
run;
/* Compute mean values of correlation coefficients */
proc summary data=pearson_corr nway;
class _name_;
var &vars;
output out=_tmp(drop=_type_ _freq_) mean=;
run;
/* Sort the resulting dataset */
proc sql;
create table want as
select a.* from _tmp a, _vars b
where a._name_=b._name_
order by _seqno;
drop table _vars, _tmp;
quit;
EDIT: If you store the variable names in macro variable VARS (as shown above), you can also use this macro variable to obtain the desired sort order. That is, you can omit the DATA step (data _vars ...) and replace the PROC SQL step by this modified version:
proc sql;
create table want as
select * from _tmp
order by whichc(upcase(_name_),"%sysfunc(tranwrd(%upcase(&vars),%str( ),%str(",")))");
drop table _tmp;
quit;
If you have SAS/IML, you can do this in a few lines of code:
data Have;
call streaminit(12345);
do year = 1 to 20;
do i = 1 to 100;
x1 = rand("Normal");
x2 = x1 + rand("Normal");
x3 = x2 + rand("Normal");
x4 = x3 + rand("Normal");
output;
end;
end;
run;
proc corr data=Have out=pearson_corr noprint;
by year;
var x1 x2 x3 x4;
with x1 x2 x3 x4;
run;
proc iml;
use pearson_corr(where=(_TYPE_='CORR'));
read all var {x1 x2 x3 x4} into X;
close;
/* "flatten" the 4x4 matrices into a 1 x 16 row vector */
Y = shape(X, 0, 16);
/* compute the mean of each column */
mean = mean(Y);
/* reshape into 4x4 matrix */
meanCorr = shape(mean, 0, 4);
print meanCorr;
If you don't have SAS/IML, you can do the same operations, but you'll need to use the DATA step and arrays to reshape the data, then use PROC MEANS to find the means of the 16-variable data.
Hello @DomUk,
Try this:
proc summary data=pearson_corr nway;
class _name_;
var x:;
output out=want(keep=_name_ x:) mean=;
run;
[Edit: Removed redundant statement where _type_=:'C';]
@DomUk wrote:
Thanks a lot. I get already the right numbers, however on the x-axis the varaibles are not sorted (so the diagonal "1-line" is not there, instead the 1-values are distributed thorughout the whole table.
I see. Your variable names are not alphabetically sorted (unlike x1, x2, x3, x4), are they? In this case you may want to retrieve their sort order from dataset pearson_corr and finally use this information to sort the output dataset from PROC SUMMARY.
As an example, let's assume the variable names are h, d, s, q.
%let vars=h d s q;
/* Compute correlation coefficients by year */
proc corr data=have out=pearson_corr noprint;
var &vars;
by year;
run;
/* Retrieve sort order of analysis variables */
data _vars(keep=_seqno _name_);
set pearson_corr(firstobs=4);
_seqno=_n_;
if _type_~=:'C' then stop;
run;
/* Compute mean values of correlation coefficients */
proc summary data=pearson_corr nway;
class _name_;
var &vars;
output out=_tmp(drop=_type_ _freq_) mean=;
run;
/* Sort the resulting dataset */
proc sql;
create table want as
select a.* from _tmp a, _vars b
where a._name_=b._name_
order by _seqno;
drop table _vars, _tmp;
quit;
EDIT: If you store the variable names in macro variable VARS (as shown above), you can also use this macro variable to obtain the desired sort order. That is, you can omit the DATA step (data _vars ...) and replace the PROC SQL step by this modified version:
proc sql;
create table want as
select * from _tmp
order by whichc(upcase(_name_),"%sysfunc(tranwrd(%upcase(&vars),%str( ),%str(",")))");
drop table _tmp;
quit;
You changed
set pearson_corr(firstobs=4);
to
set pearson_corr(firstobs=21);
probably assuming that the "4" meant the number of variables -- but that was pure coincidence. The SET statement must start at the fourth observation irrespective of the number of variables.
Yes, it's possible to obtain average p-values as well.
Here's a complete example:
data have;
call streaminit(27182818);
do year=1 to 20;
do _n_=1 to 100;
d=rand('uniform');
h=rand('uniform');
s=rand('uniform');
q=rand('uniform');
output;
end;
end;
run;
%let vars=h d s q; /* Variable order may be defined differently than in the dataset. */
/* Compute correlation coefficients by year */
ods output pearsoncorr=pearson_corr;
proc corr data=have;
var &vars;
by year;
run;
/* Compute mean values of correlation coefficients and p-values */
proc summary data=pearson_corr nway;
class Variable;
var _numeric_;
output out=_tmp(drop=year _:) mean=;
run;
/* Sort the resulting dataset */
proc sql;
create table want as
select * from _tmp
order by whichc(upcase(Variable),"%sysfunc(tranwrd(%upcase(&vars),%str( ),%str(",")))");
drop table _tmp;
quit;
Result:
Variable h d s q Ph Pd Ps Pq h 1.00000 0.00485 0.01906 0.01108 . 0.5860 0.4575 0.5198 d 0.00485 1.00000 0.00465 -0.00434 0.5860 . 0.4407 0.4343 s 0.01906 0.00465 1.00000 -0.01984 0.4575 0.4407 . 0.5118 q 0.01108 -0.00434 -0.01984 1.00000 0.5198 0.4343 0.5118 .
The variable list _numeric_ in the PROC SUMMARY step includes year (which is then dropped from dataset _tmp). Alternatively, you could use a different variable list:
proc summary data=pearson_corr nway;
class Variable;
var &vars p:;
output out=_tmp(drop=_:) mean=;
run;
This works in the current example (where none of the variable names in &vars starts with "P"). In other cases you might need to construct the variable list for the VAR statement differently (e.g. similarly to the TRANWRD technique used in the PROC SQL step).
Hi @DomUk,
Yes, you can ignore that note. It refers to the WHICHC expression (in the ORDER BY clause), which indeed "doesn't appear in [the] SELECT clause" -- but exactly this is intended.
What doesn't work? Explain. Show us.
Of course, you can apply the same techniques to Spearman correlation coefficients that we used for Pearson correlation coefficients. Just specify the appropriate option SPEARMAN in the PROC CORR statement.
proc corr data=mylib.endversion spearman;
Otherwise you will get this warning in the log:
WARNING: Output 'SpearmanCorr' was not created. Make sure that the output object name, label, or path is spelled correctly. Also, verify that the appropriate procedure options are used to produce the requested output object. For example, verify that the NOPRINT option is not used.
The request to "verify that the appropriate procedure options are used to produce the requested output object" means: Check column "Option" in the table of ODS table names, which is found under "Details" in the procedure documentation. For PROC CORR: https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/procstat/procstat_corr_details24.htm.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.