I have hourly measurements from different parts of the city for one complete year (VAR1 - VAR5). I am trying to do both regression and correlation between the measurements between different parts of the city. For some reason, when I use a "BY" group variable, in this case month, it returns the same correlation coefficients for each month.
1 | 1 | 0.60846 | 0.53326 | 0.28989 | 0.38048 | 0.38778 |
1 | 2 | 0.58190 | 0.53971 | 0.27384 | 0.31625 | 0.33136 |
1 | 3 | 0.57963 | 0.54900 | 0.27453 | 0.29811 | 0.32408 |
2 | 1 | 0.59448 | 0.55839 | 0.28766 | 0.31432 | 0.35230 |
2 | 2 | 0.34333 | 0.58383 | 0.48293 | 0.12345 | 0.12345 |
2 | 3 | 0.12345 | 0.12345 | 0.12345 | 0.12345 | 0.12345 |
proc corr data = HAVE outp = corr_out noprint;
var var1 var2 var3 var4 var5;
by month;
run;
proc reg data = HAVE outest = reg_out noprint;
BY month;
model var1 = var2 var3 var4 var5;
run;
For the correlation matrix, I get the same correlation coefficients between variables, regardless of month. For regression, I get the same RMSE for all months, but with different intercepts for each month. I am looking for the correlation coefficients between variables separately by month and then the same for regression.
Examples below:
1 | CORR | var1 | 1.00000 | 0.90649 | 0.77150 | 0.94332 | 0.96095 |
1 | CORR | var2 | 0.90649 | 1.00000 | 0.49266 | 0.96825 | 0.83921 |
1 | CORR | var3 | 0.77150 | 0.49266 | 1.00000 | 0.67258 | 0.86697 |
1 | CORR | var4 | 0.94332 | 0.96825 | 0.67258 | 1.00000 | 0.93250 |
1 | CORR | var5 | 0.96095 | 0.83921 | 0.86697 | 0.93250 | 1.00000 |
2 | CORR | var1 | 1.00000 | 0.90649 | 0.77150 | 0.94332 | 0.96095 |
2 | CORR | var2 | 0.90649 | 1.00000 | 0.49266 | 0.96825 | 0.83921 |
2 | CORR | var3 | 0.77150 | 0.49266 | 1.00000 | 0.67258 | 0.86697 |
2 | CORR | var4 | 0.94332 | 0.96825 | 0.67258 | 1.00000 | 0.93250 |
2 | CORR | var5 | 0.96095 | 0.83921 | 0.86697 | 0.93250 | 1.00000 |
1 | MODEL1 | PARMS | var1 | .002905584 | -0.043973 | 1.16554 | 0.28112 | -0.93889 | 0.32547 | -1 |
2 | MODEL1 | PARMS | var1 | .002905584 | -0.080612 | 1.16554 | 0.28112 | -0.93889 | 0.32547 | -1 |
3 | MODEL1 | PARMS | var1 | .002905584 | -0.061661 | 1.16554 | 0.28112 | -0.93889 | 0.32547 | -1 |
4 | MODEL1 | PARMS | var1 | .002905584 | -0.067980 | 1.16554 | 0.28112 | -0.93889 | 0.32547 | -1 |
5 | MODEL1 | PARMS | var1 | .002905584 | -0.083593 | 1.16554 | 0.28112 | -0.93889 | 0.32547 | -1 |
Hi, just checking, have you first sorted your data? Also, is the data actually different between the different months?
The data are sorted and different between months.
Okay, the issue is strange. What happens when you try to use the WHERE statement to select the individual month instead of the BYstatement? Do you still end up with the same coefficients?
Can you please post your data as a SAS datastep
I have attached it as a .txt file
I used this code, and I did get different intercepts. Some of the coefficients were the same for some of the variables for each month though, and it's probably because of how the data is.
proc reg data = example outest = reg_out noprint;
BY month;
model channel1 = channel2 channel3 channel4 channel5;
run;
For, example when I plotted just two variables, and grouped the regression line by month, the regression lines were on top of each other.
proc sgplot data = example;
reg x= channel1 y = channel2 / group = month;
run;
Hello @JohnPederson,
Thanks for providing sample data (to be read without the DSD option of the INFILE statement). It turns out that the values of variable pclinton differ between any two months only by a constant for every hour (hour=0, 1, ..., 23). See the output of a step like this:
proc sql;
select a.hour, b.pclinton-a.pclinton as dpclinton
from example(where=(month=1)) a,
example(where=(month=2)) b
where a.hour=b.hour;
quit;
Therefore, in a plot like this
proc sgplot data=example;
series x=hour y=pclinton / group=month;
run;
we see 12 parallel "curves."
The same holds for all eleven other analysis variables phrm, ..., channel6 as well. The variable labels "... Predicted Values" suggest that pclinton, phrm, etc. do not contain measured values, but predictions based on some statistical model, which explains the "systematic" differences described above.
Since the correlation coefficient is invariant under linear transformations f(x)=ax+b with a>0, in particular translations (f(x)=x+c), the correlation between, e.g., pclinton and phrm must be the same for every month: If the ("predicted") values are X0, ..., X23 (for pclinton) and Y0, ..., Y23 (for phrm) for one month, they are X0+c, ..., X23+c and Y0+d, ..., Y23+d for another month, with constants c and d depending only on the month. The same applies to all other pairs of analysis variables (excluding missing values).
The "parallel" results of your linear regressions can be explained similarly.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.