Re: BY group not working in PROC CORR or PROC REG

JohnPederson · Posted 07-06-2021 05:43 PM

I have hourly measurements from different parts of the city for one complete year (VAR1 - VAR5). I am trying to do both regression and correlation between the measurements between different parts of the city. For some reason, when I use a "BY" group variable, in this case month, it returns the same correlation coefficients for each month.

Obs month hour var1 var2 var3 var4 var5123456

1	1	0.60846	0.53326	0.28989	0.38048	0.38778
1	2	0.58190	0.53971	0.27384	0.31625	0.33136
1	3	0.57963	0.54900	0.27453	0.29811	0.32408
2	1	0.59448	0.55839	0.28766	0.31432	0.35230
2	2	0.34333	0.58383	0.48293	0.12345	0.12345
2	3	0.12345	0.12345	0.12345	0.12345	0.12345

proc corr data = HAVE outp = corr_out noprint;
var var1 var2 var3 var4 var5;
	by month;
run;

proc reg data = HAVE outest = reg_out noprint;
	BY month;
	model var1 = var2 var3 var4 var5;
run;

For the correlation matrix, I get the same correlation coefficients between variables, regardless of month. For regression, I get the same RMSE for all months, but with different intercepts for each month. I am looking for the correlation coefficients between variables separately by month and then the same for regression.

Examples below:

Obs month _TYPE_ _NAME_ var1 var2 var3 var4 var5123451314151617

1	CORR	var1	1.00000	0.90649	0.77150	0.94332	0.96095
1	CORR	var2	0.90649	1.00000	0.49266	0.96825	0.83921
1	CORR	var3	0.77150	0.49266	1.00000	0.67258	0.86697
1	CORR	var4	0.94332	0.96825	0.67258	1.00000	0.93250
1	CORR	var5	0.96095	0.83921	0.86697	0.93250	1.00000
2	CORR	var1	1.00000	0.90649	0.77150	0.94332	0.96095
2	CORR	var2	0.90649	1.00000	0.49266	0.96825	0.83921
2	CORR	var3	0.77150	0.49266	1.00000	0.67258	0.86697
2	CORR	var4	0.94332	0.96825	0.67258	1.00000	0.93250
2	CORR	var5	0.96095	0.83921	0.86697	0.93250	1.00000

Obs month _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ Intercept var2 var3 var4 var5 var112345

1	MODEL1	PARMS	var1	.002905584	-0.043973	1.16554	0.28112	-0.93889	0.32547	-1
2	MODEL1	PARMS	var1	.002905584	-0.080612	1.16554	0.28112	-0.93889	0.32547	-1
3	MODEL1	PARMS	var1	.002905584	-0.061661	1.16554	0.28112	-0.93889	0.32547	-1
4	MODEL1	PARMS	var1	.002905584	-0.067980	1.16554	0.28112	-0.93889	0.32547	-1
5	MODEL1	PARMS	var1	.002905584	-0.083593	1.16554	0.28112	-0.93889	0.32547	-1

djrisks · Posted 07-06-2021 06:22 PM

Hi, just checking, have you first sorted your data? Also, is the data actually different between the different months?

JohnPederson · Posted 07-07-2021 09:45 AM

The data are sorted and different between months.

djrisks · Posted 07-07-2021 10:13 AM

Okay, the issue is strange. What happens when you try to use the WHERE statement to select the individual month instead of the BYstatement? Do you still end up with the same coefficients?

Sajid01 · Posted 07-07-2021 10:24 AM

Can you please post your data as a SAS datastep

JohnPederson · Posted 07-07-2021 11:29 AM

I have attached it as a .txt file

djrisks · Posted 07-07-2021 01:45 PM

I used this code, and I did get different intercepts. Some of the coefficients were the same for some of the variables for each month though, and it's probably because of how the data is.

proc reg data = example outest = reg_out noprint;
	BY month;
	model channel1 = channel2 channel3 channel4 channel5;
run;

For, example when I plotted just two variables, and grouped the regression line by month, the regression lines were on top of each other.

proc sgplot data = example;
  reg x= channel1 y = channel2 / group = month;
run;

FreelanceReinh · Posted 07-07-2021 02:32 PM

Hello @JohnPederson,

Thanks for providing sample data (to be read without the DSD option of the INFILE statement). It turns out that the values of variable pclinton differ between any two months only by a constant for every hour (hour=0, 1, ..., 23). See the output of a step like this:

proc sql;
select a.hour, b.pclinton-a.pclinton as dpclinton
from example(where=(month=1)) a,
     example(where=(month=2)) b
where a.hour=b.hour;
quit;

Therefore, in a plot like this

proc sgplot data=example;
series x=hour y=pclinton / group=month;
run;

we see 12 parallel "curves."

The same holds for all eleven other analysis variables phrm, ..., channel6 as well. The variable labels "... Predicted Values" suggest that pclinton, phrm, etc. do not contain measured values, but predictions based on some statistical model, which explains the "systematic" differences described above.

Since the correlation coefficient is invariant under linear transformations f(x)=ax+b with a>0, in particular translations (f(x)=x+c), the correlation between, e.g., pclinton and phrm must be the same for every month: If the ("predicted") values are X0, ..., X23 (for pclinton) and Y0, ..., Y23 (for phrm) for one month, they are X0+c, ..., X23+c and Y0+d, ..., Y23+d for another month, with constants c and d depending only on the month. The same applies to all other pairs of analysis variables (excluding missing values).

The "parallel" results of your linear regressions can be explained similarly.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!