BookmarkSubscribeRSS Feed
JohnPederson
Calcite | Level 5

I have hourly measurements from different parts of the city for one complete year (VAR1 - VAR5). I am trying to do both regression and correlation between the measurements between different parts of the city. For some reason, when I use a "BY" group variable, in this case month, it returns the same correlation coefficients for each month.

 

Obs month hour var1 var2 var3 var4 var5123456
110.608460.533260.289890.380480.38778
120.581900.539710.273840.316250.33136
130.579630.549000.274530.298110.32408
210.594480.558390.287660.314320.35230
220.343330.583830.482930.123450.12345
230.123450.123450.123450.123450.12345
 
proc corr data = HAVE outp = corr_out noprint;
var var1 var2 var3 var4 var5;
	by month;
run;

proc reg data = HAVE outest = reg_out noprint;
	BY month;
	model var1 = var2 var3 var4 var5;
run;

For the correlation matrix, I get the same correlation coefficients between variables, regardless of month. For regression, I get the same RMSE for all months, but with different intercepts for each month. I am looking for the correlation coefficients between variables separately by month and then the same for regression.

 

Examples below:

 

Obs month _TYPE_ _NAME_ var1 var2 var3 var4 var5123451314151617
1CORRvar11.000000.906490.771500.943320.96095
1CORRvar20.906491.000000.492660.968250.83921
1CORRvar30.771500.492661.000000.672580.86697
1CORRvar40.943320.968250.672581.000000.93250
1CORRvar50.960950.839210.866970.932501.00000
2CORRvar11.000000.906490.771500.943320.96095
2CORRvar20.906491.000000.492660.968250.83921
2CORRvar30.771500.492661.000000.672580.86697
2CORRvar40.943320.968250.672581.000000.93250
2CORRvar50.960950.839210.866970.932501.00000

 

Obs month _MODEL_ _TYPE_ _DEPVAR_ _RMSE_ Intercept var2 var3 var4 var5 var112345
1MODEL1PARMSvar1.002905584-0.0439731.165540.28112-0.938890.32547-1
2MODEL1PARMSvar1.002905584-0.0806121.165540.28112-0.938890.32547-1
3MODEL1PARMSvar1.002905584-0.0616611.165540.28112-0.938890.32547-1
4MODEL1PARMSvar1.002905584-0.0679801.165540.28112-0.938890.32547-1
5MODEL1PARMSvar1.002905584-0.0835931.165540.28112-0.938890.32547-1
7 REPLIES 7
djrisks
Barite | Level 11

Hi, just checking, have you first sorted your data? Also, is the data actually different between the different months?

JohnPederson
Calcite | Level 5

The data are sorted and different between months.

djrisks
Barite | Level 11

Okay, the issue is strange. What happens when you try to use the WHERE statement to select the individual month instead of the BYstatement? Do you still end up with the same coefficients?

Sajid01
Meteorite | Level 14

Can you please post your data as a SAS datastep

JohnPederson
Calcite | Level 5

I have attached it as a .txt file

djrisks
Barite | Level 11

I used this code, and I did get different intercepts. Some of the coefficients were the same for some of the variables for each month though, and it's probably because of how the data is.

 

proc reg data = example outest = reg_out noprint;
	BY month;
	model channel1 = channel2 channel3 channel4 channel5;
run;

For, example when I plotted just two variables, and grouped the regression line by month, the regression lines were on top of each other.

proc sgplot data = example;
  reg x= channel1 y = channel2 / group = month;
run;

Scatter by month.png

 

FreelanceReinh
Jade | Level 19

Hello @JohnPederson,

 

Thanks for providing sample data (to be read without the DSD option of the INFILE statement). It turns out that the values of variable pclinton differ between any two months only by a constant for every hour  (hour=0, 1, ..., 23). See the output of a step like this:

proc sql;
select a.hour, b.pclinton-a.pclinton as dpclinton
from example(where=(month=1)) a,
     example(where=(month=2)) b
where a.hour=b.hour;
quit;

Therefore, in a plot like this

proc sgplot data=example;
series x=hour y=pclinton / group=month;
run;

we see 12 parallel "curves."

 

The same holds for all eleven other analysis variables phrm, ..., channel6 as well. The variable labels "... Predicted Values" suggest that pclinton, phrm, etc. do not contain measured values, but predictions based on some statistical model, which explains the "systematic" differences described above.

 

Since the correlation coefficient is invariant under linear transformations f(x)=ax+b with a>0, in particular translations (f(x)=x+c), the correlation between, e.g., pclinton and phrm must be the same for every month: If the ("predicted") values are X0, ..., X23 (for pclinton) and Y0, ..., Y23 (for phrm) for one month, they are X0+c, ..., X23+c and Y0+d, ..., Y23+d for another month, with constants c and d depending only on the month. The same applies to all other pairs of analysis variables (excluding missing values).

 

The "parallel" results of your linear regressions can be explained similarly.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1082 views
  • 1 like
  • 4 in conversation