Solved: Risk factors combined variable in a multiple regression was not workin...

dkcundiffMD · Posted 02-01-2024 12:21 AM

SAS on Demand for Academics

I partitioned a large analysis risk factor and health outcome database into a subset of interest. I standardized Dietary and other variables. And tried to proceed to a multiple regression with dietary variables in one combined variable and non-dietary variables with non-communicable disease early deaths as the dependent variable. With the standardized risk factors and non-communicable disease early deaths (dependent variable) data, I checked to determine that all the standardized variables were found. With proc corr, they were not found.

I have used combined dietary variables before in multiple regression equations. Why does it not work here?

FreelanceReinh · Posted 02-01-2024 05:30 AM

Hello @dkcundiffMD,

The DATA step

data source2;
 set Projects.source;
where (NCD17m2 <1070.22659 and afoods2 <400) or afoods2 <149;
  NCD17m2f1=  
- pmeat17KC2s * 1.45735 * 0.159272828
- rmeat17KC2s * 17.0053 * 0.197695837
...
;
run;

must have caused a lot of notes in the log saying

NOTE: Variable pmeat17KC2s is uninitialized.
NOTE: Variable rmeat17KC2s is uninitialized.
...

and

NOTE: Missing values were generated as a result of performing an operation on missing values.
      Each place is given by: (Number of times) at (Line):(Column).
...

because all those standardized variables are not contained in dataset PROJECTS.SOURCE, which the SET statement reads, but were defined in a previous DATA step that created another WORK dataset named SOURCE2.

So, the above DATA step must read that SOURCE2 dataset and should ideally write the results to a dataset with a different name to avoid overwriting the previous one (and the risk of confusion). But first you need to rerun the DATA step creating the standardized variables because now you have overwritten it. Better yet, check if PROC STDIZE can compute the standardized variables with much less code and without copying (rounded!) descriptive statistics into the code window.

The structure of the code could be like this:

/* Create the analysis subset */

data source2;
set Projects.source;
where (NCD17m2 <1070.22659 and afoods2 <400) or afoods2 <149;
run;

/* Create standardized variables */

proc stdize data=source2 out=source2s sprefix=s_ method=...;
var NCD17m2 Pmeat17KC2 ...;
run;

/* Compute correlations */

proc corr data=source2s;
...;
run;

/* Combine risk factors */

data source2c;
set source2s;
NCD17m2f1= ...;
run;

/* Compute more correlations */

proc corr data=source2c;
...;
run;

Most likely the DATA step creating dataset SOURCE2C could be modified to read not only SOURCE2S, but also an output dataset from the statistical procedure (PROC CORR?) which has computed the statistics used in the formula for variable NCD17m2f1.

View solution in original post

FreelanceReinh · Posted 02-01-2024 05:30 AM

Hello @dkcundiffMD,

The DATA step

data source2;
 set Projects.source;
where (NCD17m2 <1070.22659 and afoods2 <400) or afoods2 <149;
  NCD17m2f1=  
- pmeat17KC2s * 1.45735 * 0.159272828
- rmeat17KC2s * 17.0053 * 0.197695837
...
;
run;

must have caused a lot of notes in the log saying

NOTE: Variable pmeat17KC2s is uninitialized.
NOTE: Variable rmeat17KC2s is uninitialized.
...

and

NOTE: Missing values were generated as a result of performing an operation on missing values.
      Each place is given by: (Number of times) at (Line):(Column).
...

because all those standardized variables are not contained in dataset PROJECTS.SOURCE, which the SET statement reads, but were defined in a previous DATA step that created another WORK dataset named SOURCE2.

So, the above DATA step must read that SOURCE2 dataset and should ideally write the results to a dataset with a different name to avoid overwriting the previous one (and the risk of confusion). But first you need to rerun the DATA step creating the standardized variables because now you have overwritten it. Better yet, check if PROC STDIZE can compute the standardized variables with much less code and without copying (rounded!) descriptive statistics into the code window.

The structure of the code could be like this:

/* Create the analysis subset */

data source2;
set Projects.source;
where (NCD17m2 <1070.22659 and afoods2 <400) or afoods2 <149;
run;

/* Create standardized variables */

proc stdize data=source2 out=source2s sprefix=s_ method=...;
var NCD17m2 Pmeat17KC2 ...;
run;

/* Compute correlations */

proc corr data=source2s;
...;
run;

/* Combine risk factors */

data source2c;
set source2s;
NCD17m2f1= ...;
run;

/* Compute more correlations */

proc corr data=source2c;
...;
run;

Most likely the DATA step creating dataset SOURCE2C could be modified to read not only SOURCE2S, but also an output dataset from the statistical procedure (PROC CORR?) which has computed the statistics used in the formula for variable NCD17m2f1.

dkcundiffMD · Posted 02-02-2024 12:18 AM

That solution worked to get the variables standardized in a new workspace (source3). Thanks.

Now I can't get proc stdize to work. My effort is attached. It would save me many hours to quickly standardize lots of variables.

Thanks again.

FreelanceReinh · Posted 02-02-2024 05:10 AM

By default, the output dataset from PROC STDIZE (i.e. SOURCE3S in your example) uses the original names for the standardized variables: see the description of the OUT= option in the documentation. But you can opt for new names by specifying the SPREFIX= option and for including the original variables in the output dataset by using the OPREFIX= option (alone or together with the SPREFIX= option). In both cases you name a prefix, not a suffix, to be added to the original variable names. Then use the names of the standardized variables to refer to them in later steps (e.g. PROC CORR). To check the variable names you can use PROC CONTENTS and/or PROC PRINT (the latter with an OBS= dataset option such as

proc print data=source3s(obs=5);
run;

to limit the output to a few [here: 5] observations.)

Also note that you don't need to repeat the WHERE condition over and over again. Once the condition has been applied (i.e. in dataset SOURCE2 in the code I suggested), all observations in the dataset necessarily satisfy the condition and applying the WHERE statement again is just redundant -- as long as the variable values are not changed. (If you instruct PROC STDIZE to use the original names for the standardized variables, then their values do change, of course, and you certainly don't want to apply the same WHERE condition that suited the variables before standardization.)

dkcundiffMD · Posted 02-04-2024 01:30 PM

Now I'm having problem with the proc contents syntax.

Thanks for your help.

FreelanceReinh · Posted 02-04-2024 01:53 PM

You're welcome. If you just want to see the list of variables in a dataset, say, in dataset SOURCE5, you can use the simple syntax

proc contents data=source5;
run;

Only rarely you'll need more options.

dkcundiffMD · Posted 02-04-2024 06:24 PM

That works! I was putting in variables and getting errors. Thanks.

Risk factors combined variable in a multiple regression was not working

Re: Risk factors combined variable in a multiple regression was not working

Re: Risk factors combined variable in a multiple regression was not working

Re: Risk factors combined variable in a multiple regression was not working

Re: Risk factors combined variable in a multiple regression was not working

Re: Risk factors combined variable in a multiple regression was not working

Re: Risk factors combined variable in a multiple regression was not working

Re: Risk factors combined variable in a multiple regression was not working

Catch up on SAS Innovate 2026