- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have hourly measurements from different parts of the city for one complete year (VAR1 - VAR5). I am trying to do both regression and correlation between the measurements between different parts of the city. For some reason, when I use a "BY" group variable, in this case month, it returns the same correlation coefficients for each month.
1 | 1 | 0.60846 | 0.53326 | 0.28989 | 0.38048 | 0.38778 |
1 | 2 | 0.58190 | 0.53971 | 0.27384 | 0.31625 | 0.33136 |
1 | 3 | 0.57963 | 0.54900 | 0.27453 | 0.29811 | 0.32408 |
2 | 1 | 0.59448 | 0.55839 | 0.28766 | 0.31432 | 0.35230 |
2 | 2 | 0.34333 | 0.58383 | 0.48293 | 0.12345 | 0.12345 |
2 | 3 | 0.12345 | 0.12345 | 0.12345 | 0.12345 | 0.12345 |
proc corr data = HAVE outp = corr_out noprint;
var var1 var2 var3 var4 var5;
by month;
run;
proc reg data = HAVE outest = reg_out noprint;
BY month;
model var1 = var2 var3 var4 var5;
run;
For the correlation matrix, I get the same correlation coefficients between variables, regardless of month. For regression, I get the same RMSE for all months, but with different intercepts for each month. I am looking for the correlation coefficients between variables separately by month and then the same for regression.
Examples below:
1 | CORR | var1 | 1.00000 | 0.90649 | 0.77150 | 0.94332 | 0.96095 |
1 | CORR | var2 | 0.90649 | 1.00000 | 0.49266 | 0.96825 | 0.83921 |
1 | CORR | var3 | 0.77150 | 0.49266 | 1.00000 | 0.67258 | 0.86697 |
1 | CORR | var4 | 0.94332 | 0.96825 | 0.67258 | 1.00000 | 0.93250 |
1 | CORR | var5 | 0.96095 | 0.83921 | 0.86697 | 0.93250 | 1.00000 |
2 | CORR | var1 | 1.00000 | 0.90649 | 0.77150 | 0.94332 | 0.96095 |
2 | CORR | var2 | 0.90649 | 1.00000 | 0.49266 | 0.96825 | 0.83921 |
2 | CORR | var3 | 0.77150 | 0.49266 | 1.00000 | 0.67258 | 0.86697 |
2 | CORR | var4 | 0.94332 | 0.96825 | 0.67258 | 1.00000 | 0.93250 |
2 | CORR | var5 | 0.96095 | 0.83921 | 0.86697 | 0.93250 | 1.00000 |
1 | MODEL1 | PARMS | var1 | .002905584 | -0.043973 | 1.16554 | 0.28112 | -0.93889 | 0.32547 | -1 |
2 | MODEL1 | PARMS | var1 | .002905584 | -0.080612 | 1.16554 | 0.28112 | -0.93889 | 0.32547 | -1 |
3 | MODEL1 | PARMS | var1 | .002905584 | -0.061661 | 1.16554 | 0.28112 | -0.93889 | 0.32547 | -1 |
4 | MODEL1 | PARMS | var1 | .002905584 | -0.067980 | 1.16554 | 0.28112 | -0.93889 | 0.32547 | -1 |
5 | MODEL1 | PARMS | var1 | .002905584 | -0.083593 | 1.16554 | 0.28112 | -0.93889 | 0.32547 | -1 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi, just checking, have you first sorted your data? Also, is the data actually different between the different months?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The data are sorted and different between months.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Okay, the issue is strange. What happens when you try to use the WHERE statement to select the individual month instead of the BYstatement? Do you still end up with the same coefficients?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Can you please post your data as a SAS datastep
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have attached it as a .txt file
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I used this code, and I did get different intercepts. Some of the coefficients were the same for some of the variables for each month though, and it's probably because of how the data is.
proc reg data = example outest = reg_out noprint;
BY month;
model channel1 = channel2 channel3 channel4 channel5;
run;
For, example when I plotted just two variables, and grouped the regression line by month, the regression lines were on top of each other.
proc sgplot data = example;
reg x= channel1 y = channel2 / group = month;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @JohnPederson,
Thanks for providing sample data (to be read without the DSD option of the INFILE statement). It turns out that the values of variable pclinton differ between any two months only by a constant for every hour (hour=0, 1, ..., 23). See the output of a step like this:
proc sql;
select a.hour, b.pclinton-a.pclinton as dpclinton
from example(where=(month=1)) a,
example(where=(month=2)) b
where a.hour=b.hour;
quit;
Therefore, in a plot like this
proc sgplot data=example;
series x=hour y=pclinton / group=month;
run;
we see 12 parallel "curves."
The same holds for all eleven other analysis variables phrm, ..., channel6 as well. The variable labels "... Predicted Values" suggest that pclinton, phrm, etc. do not contain measured values, but predictions based on some statistical model, which explains the "systematic" differences described above.
Since the correlation coefficient is invariant under linear transformations f(x)=ax+b with a>0, in particular translations (f(x)=x+c), the correlation between, e.g., pclinton and phrm must be the same for every month: If the ("predicted") values are X0, ..., X23 (for pclinton) and Y0, ..., Y23 (for phrm) for one month, they are X0+c, ..., X23+c and Y0+d, ..., Y23+d for another month, with constants c and d depending only on the month. The same applies to all other pairs of analysis variables (excluding missing values).
The "parallel" results of your linear regressions can be explained similarly.