It was helpful to use work.source instead of source. However, I still lose CVDf1, CVDf2, CVDf3, and CVDf4 after a regression step. The final answer is CVDf5 and it gets that answer. That answer depends on CVDf1, CVDf2, CVDf3, and CVDf4. If I stop before the final CVDf5 data step followed by proc corr, then CVDf1, CVDf2, CVDf3, and CVDf4 are included in the work.souce. If I take it to the final step to derive CVDf5, then CVDf5 is included in the output but not CVDf1, CVDf2, CVDf3, and CVDf4. The log says: WARNING: Variable CVDF1 not found in data set WORK.SOURCE.
How does it lose CVDf1, CVDf2, CVDf3, and CVDf4 in the final proc corr? The last data step must have those variables because CVDf5, the final answer, depends on them.
Thanks
SAS code:
data work.source;
set projects.source;
CVDf1=
- pmeat17KCsW * 5.49 * 0.0443
- rmeat17KCsW * 50.70 * 0.0560
- fish17KCsW * 10.01 * 0.0412
- milk17KCsW * 25.37 * 0.0398
- poultry16KCsW * 45.06 * 0.0867
- eggs16KCsW * 19.47 * 0.1544
- SFA16KCsW * 191.27 * 0.0603 * 0.46
- PUFA17KCsW * 82.24 * 0.1033 * 0.46
- TFA17KCsW * 13.40 * 0.0128 * 0.46
- Alcohol17KCsW * 81.71 * 0.0047
+ Sugarb17KCsW * 297.65 * 0.0136
+ potatoes16KCsW * 84.16 * 0.0024
- corn16KCsW * 34.67 * 0.0037
- fruits17KCsW * 40.39 * 0.1291
- Vegetables17KCsW * 80.14 * 0.0127
- nutsseeds17KCsW * 8.51 * 0.0797
- wgrains17KCsW * 55.65 * 0.0376
- legumes17KCsW * 51.66 * 0.0005
+ rice16KCsW * 141.23 * 0.0001
- swtpot16KCsW * 22.67 * 0.0270
;
CVDf2=
+ smoke17msW * 0.2046 * 0.08899
+ SLTobacco17msW * 0.0680 * 0.08179
+ kidneydz17msW * 0.056 * 0.037636
;
CVDf3=
+ T1DM17msW * 10.34 * 0.1169
+ T2DM17msW * 17.47 * 0.05193
;
run;quit;
proc corr data=work.source fisher;
label
CVDf1=Combination of 20 diet risk factors
CVDf2= Kidney Dz, Smoking tobacco, sublingual tobacco
CVDf3=Types 1 and 2 DM;
var CVDf1 CVDf2 CVDf3 SBP17msW sex_IDsW;
with CVD2017m;
run;quit;
Proc reg data=work.source;
label
CVDf1=Combination of 20 diet risk factors
CVDf2= Kidney Dz, Smoking tobacco, sublingual tobacco
CVDf3=Types 1 and 2 DM;
;
model CVD2017msW=CVDf1 CVDf2 CVDf3 SBP17msW sex_IDsW
/ selection=STEPWISE
slentry=.25 slstay=.25;
run; quit;
*
CVDf1 0.01476 0.00041890 716.32252 1242.01 <.0001
CVDf2 16.45730 0.78178 255.58277 443.15 <.0001
CVDf3 0.11176 0.00481 311.03348 539.29 <.0001
SBP17msW 0.17041 0.00881 215.71423 374.02 <.0001
sex_IDsW -0.17070 0.01396 86.21686 149.49 <.0001
1 CVDf1 Combination of 20 diet risk factors 1 0.1717 0.1717 3424.47 1626.24 <.0001
2 CVDf2 Child wt, air pollution, Kidney Dz 2 0.1530 0.3247 1345.69 1776.62 <.0001
3 CVDf3 Types 1 and 2 DM 3 0.0630 0.3877 491.270 806.32 <.0001
4 SBP17msW Systolic BP mm Hg 4 0.0250 0.4126 153.489 333.47 <.0001
5 sex_IDsW Sex male 1 and female 2 5 0.0110 0.4236 6.0000 149.49 <.0001
;
*CVD formula R2=0.4236;
data work.source;
set projects.source;
label
CVDf1=Combination of 20 diet risk factors
CVDf2= Kidney Dz, Smoking tobacco, sublingual tobacco
CVDf3=Types 1 and 2 DM
CVDf4=Mult reg CVDf1 CVDf2 CVDf3 SBP sex ;
CVDf4=
+ CVDf1 * 0.01476
+ CVDf2 * 16.45730
+ CVDf3 * 0.11176
+ SBP17msW * 0.17041
- sex_IDsW * 0.17070
;
run; quit;
proc corr data=work.source fisher;
label
CVDf1=Combination of 20 diet risk factors
CVDf2= Kidney Dz, Smoking tobacco, sublingual tobacco
CVDf3=Types 1 and 2 DM
CVDf4=Mult reg CVDf1 CVDf2 CVDf3 SBP sex ;
var CVDf1 CVDf2 CVDf3 CVDf4 SBP17msW sex_IDsW;
with CVD2017m;
run;quit;
*CVD formula R2=0.4236;
data work.source;
set projects.source;
CVDf5=
- pmeat17KCsW * 0.09
- rmeat17KCsW * 1.14
- fish17KCsW * 0.17
- milk17KCsW * 0.39
- poultry16KCsW * 1.54
- eggs16kcsW * 1.23
- SFA16KCsW * 2.10
- PUFA17KCsW * 1.56
- TFA17KCsW * 0.03
- ALCOHOL17KCsW * 0.13
+ Sugarb17KCsW * 1.59
+ potatoes16KCsW * 0.09
- corn16kcsW * 0.06
- fruits17KCsW * 2.12
- vegetables17KCsW* 0.38
- nutsseeds17KCsW * 0.27
- wgrains17KCsW * 0.88
- legumes17kcsW * 0.01
+ rice16kcsW * 0.00
- swtpot16kcsW * 0.27
+ smoke17msW * 8.45
+ SLTobacco17msW * 2.57
+ kidneydz17msW * 0.98
+ T1DM17msW * 3.80
+ T2DM17msW * 2.85
+ SBP17msW * 4.83
- sex_IDsW * 4.84
;
run; quit;
proc corr data=work.source fisher;
label
CVDf1=Combination of 20 diet risk factors
CVDf2= Kidney Dz, Smoking tobacco, sublingual tobacco
CVDf3=Types 1 and 2 DM
CVDf4=Mult reg CVDf1 CVDf2 CVDf3 SBP sex
CVDf5=Final CVD risk factor formula;
var CVDf5 SBP17msW sex_IDsW;
with CVD2017m;
run;quit;
results:
The REG Procedure
Model: MODEL1
Dependent Variable: CVD2017msW CVD/100k/year ages 15-69
Number of Observations Read 7846
Number of Observations Used 7846
Stepwise Selection: Step 1
Variable CVDf1 Entered: R-Square = 0.1717 and C(p) = 3424.469
Analysis of Variance
Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 1 1347.15123 1347.15123 1626.24 <.0001
Error 7844 6497.84877 0.82838
Corrected Total 7845 7845.00000
Variable Parameter
Estimate Standard
Error Type II SS F Value Pr > F
Intercept -5.1482E-14 0.01028 2.07946E-23 0.00 1.0000
CVDf1 0.01917 0.00047543 1347.15123 1626.24 <.0001
Bounds on condition number: 1, 1
Stepwise Selection: Step 2
Variable CVDf2 Entered: R-Square = 0.3247 and C(p) = 1345.693
Analysis of Variance
Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 2 2547.22254 1273.61127 1885.50 <.0001
Error 7843 5297.77746 0.67548
Corrected Total 7845 7845.00000
Variable Parameter
Estimate Standard
Error Type II SS F Value Pr > F
Intercept -4.5343E-14 0.00928 1.61312E-23 0.00 1.0000
CVDf1 0.01823 0.00042990 1214.49377 1797.98 <.0001
CVDf2 21.80856 0.51740 1200.07130 1776.62 <.0001
Bounds on condition number: 1.0027, 4.0109
Stepwise Selection: Step 3
Variable CVDf3 Entered: R-Square = 0.3877 and C(p) = 491.2699
Analysis of Variance
Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 3 3041.15785 1013.71928 1654.84 <.0001
Error 7842 4803.84215 0.61258
Corrected Total 7845 7845.00000
Variable Parameter
Estimate Standard
Error Type II SS F Value Pr > F
Intercept -4.4852E-14 0.00884 1.57839E-23 0.00 1.0000
CVDf1 0.01438 0.00043120 681.70150 1112.84 <.0001
CVDf2 23.56816 0.49661 1379.71384 2252.30 <.0001
CVDf3 0.13686 0.00482 493.93531 806.32 <.0001
Bounds on condition number: 1.1212, 9.7564
Stepwise Selection: Step 4
Variable SBP17msW Entered: R-Square = 0.4126 and C(p) = 153.4894
Analysis of Variance
Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 4 3237.12365 809.28091 1377.11 <.0001
Error 7841 4607.87635 0.58766
Corrected Total 7845 7845.00000
Variable Parameter
Estimate Standard
Error Type II SS F Value Pr > F
Intercept -1.2225E-13 0.00865 1.17256E-22 0.00 1.0000
CVDf1 0.01453 0.00042241 695.29876 1183.16 <.0001
CVDf2 23.97928 0.48692 1425.21520 2425.22 <.0001
CVDf3 0.11914 0.00482 359.11233 611.08 <.0001
SBP17msW 0.16191 0.00887 195.96581 333.47 <.0001
Bounds on condition number: 1.1686, 17.406
Stepwise Selection: Step 5
Variable sex_IDsW Entered: R-Square = 0.4236 and C(p) = 6.0000
Analysis of Variance
Source DF Sum of
Squares Mean
Square F Value Pr > F
Model 5 3323.34051 664.66810 1152.45 <.0001
Error 7840 4521.65949 0.57674
Corrected Total 7845 7845.00000
Variable Parameter
Estimate Standard
Error Type II SS F Value Pr > F
Intercept -1.2845E-13 0.00857 1.29454E-22 0.00 1.0000
CVDf1 0.01476 0.00041890 716.32252 1242.01 <.0001
CVDf2 16.45730 0.78178 255.58277 443.15 <.0001
CVDf3 0.11176 0.00481 311.03348 539.29 <.0001
SBP17msW 0.17041 0.00881 215.71423 374.02 <.0001
sex_IDsW -0.17070 0.01396 86.21686 149.49 <.0001
Bounds on condition number: 2.6811, 43.454
All variables left in the model are significant at the 0.2500 level.
All variables have been entered into the model.
Summary of Stepwise Selection
Step Variable
Entered Variable
Removed Label Number
Vars In Partial
R-Square Model
R-Square C(p) F Value Pr > F
1 CVDf1 Combination of 20 diet risk factors 1 0.1717 0.1717 3424.47 1626.24 <.0001
2 CVDf2 Kidney Dz, Smoking tobacco, sublingual tobacco 2 0.1530 0.3247 1345.69 1776.62 <.0001
3 CVDf3 Types 1 and 2 DM 3 0.0630 0.3877 491.270 806.32 <.0001
4 SBP17msW Systolic BP mm Hg 4 0.0250 0.4126 153.489 333.47 <.0001
5 sex_IDsW Sex male 1 and female 2 5 0.0110 0.4236 6.0000 149.49 <.0001
The REG Procedure
Model: MODEL1
Dependent Variable: CVD2017msW CVD/100k/year ages 15-69
Panel of heat maps of residuals by regressors for CVD2017msW.
The CORR Procedure
1 With Variables: CVD2017m
6 Variables: CVDf1 CVDf2 CVDf3 CVDf4 SBP17msW sex_IDsW
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum Label
CVD2017m 7846 543.66067 288.00939 4265562 73.47499 1844 CVD/100k/year ages 15-69
CVDf1 0 . . . . . Combination of 20 diet risk factors
CVDf2 0 . . . . . Kidney Dz, Smoking tobacco, sublingual tobacco
CVDf3 0 . . . . . Types 1 and 2 DM
CVDf4 0 . . . . . Mult reg CVDf1 CVDf2 CVDf3 SBP sex
SBP17msW 7846 4.9045E-13 1.00000 3.84807E-9 -2.43011 3.23505 Systolic BP mm Hg
sex_IDsW 7846 0 1.00000 0 -0.99994 0.99994 Sex male 1 and female 2
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
CVDf1 CVDf2 CVDf3 CVDf4 SBP17msW sex_IDsW
CVD2017m
CVD/100k/year ages 15-69
.
.
0
.
.
0
.
.
0
.
.
0
0.19492
<.0001
7846
-0.39527
<.0001
7846
Pearson Correlation Statistics (Fisher's z Transformation)
Variable With Variable N Sample Correlation Fisher's z Bias Adjustment Correlation Estimate 95% Confidence Limits p Value for
H0:Rho=0
CVDf1 CVD2017m 0 . . . . . . .
CVDf2 CVD2017m 0 . . . . . . .
CVDf3 CVD2017m 0 . . . . . . .
CVDf4 CVD2017m 0 . . . . . . .
SBP17msW CVD2017m 7846 0.19492 0.19745 0.0000124 0.19491 0.173531 0.216106 <.0001
sex_IDsW CVD2017m 7846 -0.39527 -0.41803 -0.0000252 -0.39525 -0.413755 -0.376410 <.0001
The CORR Procedure
1 With Variables: CVD2017m
3 Variables: CVDf5 SBP17msW sex_IDsW
Simple Statistics
Variable N Mean Std Dev Sum Minimum Maximum Label
CVD2017m 7846 543.66067 288.00939 4265562 73.47499 1844 CVD/100k/year ages 15-69
CVDf5 7846 2.6151E-12 18.11747 2.05182E-8 -48.22646 83.18385 Final CVD risk factor formula
SBP17msW 7846 4.9045E-13 1.00000 3.84807E-9 -2.43011 3.23505 Systolic BP mm Hg
sex_IDsW 7846 0 1.00000 0 -0.99994 0.99994 Sex male 1 and female 2
Pearson Correlation Coefficients, N = 7846
Prob > |r| under H0: Rho=0
CVDf5 SBP17msW sex_IDsW
CVD2017m
CVD/100k/year ages 15-69
0.65244
<.0001
0.19492
<.0001
-0.39527
<.0001
Pearson Correlation Statistics (Fisher's z Transformation)
Variable With Variable N Sample Correlation Fisher's z Bias Adjustment Correlation Estimate 95% Confidence Limits p Value for
H0:Rho=0
CVDf5 CVD2017m 7846 0.65244 0.77953 0.0000416 0.65241 0.639517 0.664941 <.0001
SBP17msW CVD2017m 7846 0.19492 0.19745 0.0000124 0.19491 0.173531 0.216106 <.0001
sex_IDsW CVD2017m 7846 -0.39527 -0.41803 -0.0000252 -0.39525 -0.413755 -0.376410 <.0001
Log;
217 *CVD formula R2=0.4236;
218 data work.source;
219 set projects.source;
220 CVDf5=
221 -pmeat17KCsW*0.09
222 -rmeat17KCsW*1.14
223 -fish17KCsW*0.17
224 -milk17KCsW*0.39
225 -poultry16KCsW*1.54
226 -eggs16kcsW*1.23
227 -SFA16KCsW*2.10
228 -PUFA17KCsW*1.56
229 -TFA17KCsW*0.03
230 -ALCOHOL17KCsW*0.13
231 +Sugarb17KCsW*1.59
232 +potatoes16KCsW*0.09
233 -corn16kcsW*0.06
234 -fruits17KCsW*2.12
235 -vegetables17KCsW*0.38
236 -nutsseeds17KCsW*0.27
237 -wgrains17KCsW*0.88
238 -legumes17kcsW*0.01
239 +rice16kcsW*0.00
240 -swtpot16kcsW*0.27
241 +smoke17msW*8.45
242 +SLTobacco17msW*2.57
243 +kidneydz17msW*0.98
244 +T1DM17msW*3.80
245 +T2DM17msW*2.85
246 +SBP17msW*4.83
247 -sex_IDsW*4.84
248
249 ;
250 run;
NOTE: There were 7846 observations read from the data set PROJECTS.SOURCE.
NOTE: The data set WORK.SOURCE has 7846 observations and 1273 variables.
NOTE: DATA statement used (Total process time):
real time 0.11 seconds
user cpu time 0.02 seconds
system cpu time 0.10 seconds
memory 5043.34k
OS Memory 68032.00k
Timestamp 04/30/2021 02:16:01 AM
Step Count 299 Switch Count 15
Page Faults 0
Page Reclaims 743
Page Swaps 0
Voluntary Context Switches 51
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 167952
250 ! quit;
251
252 proc corr data=work.source fisher;
253 label
254 CVDf1=Combination of 20 diet risk factors
255 CVDf2= Kidney Dz, Smoking tobacco, sublingual tobacco
WARNING: Variable CVDF1 not found in data set WORK.SOURCE.
256 CVDf3=Types 1 and 2 DM
WARNING: Variable CVDF2 not found in data set WORK.SOURCE.
257 CVDf4=Mult reg CVDf1 CVDf2 CVDf3 SBP sex
WARNING: Variable CVDF3 not found in data set WORK.SOURCE.
258 CVDf5=Final CVD risk factor formula;
WARNING: Variable CVDF4 not found in data set WORK.SOURCE.
259 var CVDf5 SBP17msW sex_IDsW;
260 with CVD2017m;
261 run;
NOTE: PROCEDURE CORR used (Total process time):
real time 0.08 seconds
user cpu time 0.06 seconds
system cpu time 0.02 seconds
memory 2373.78k
OS Memory 64952.00k
Timestamp 04/30/2021 02:16:01 AM
Step Count 300 Switch Count 13
Page Faults 0
Page Reclaims 246
Page Swaps 0
Voluntary Context Switches 36
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 8
261 ! quit;
262
263 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
275
... View more