BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
dkcundiffMD
Quartz | Level 8

SAS on Demand for Academics

 

I partitioned a large analysis risk factor and health outcome database into a subset of interest. I standardized Dietary and other variables. And tried to proceed to a multiple regression with dietary variables in one combined variable and non-dietary variables with non-communicable disease early deaths as the dependent variable. With the standardized risk factors and non-communicable disease early deaths (dependent variable) data, I checked to determine that all the standardized variables were found. With proc corr, they were not found. 

 

I have used combined dietary variables before in multiple regression equations. Why does it not work here?

 

 
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @dkcundiffMD,

 

The DATA step

data source2;
 set Projects.source;
where (NCD17m2 <1070.22659 and afoods2 <400) or afoods2 <149;
  NCD17m2f1=  
- pmeat17KC2s * 1.45735 * 0.159272828
- rmeat17KC2s * 17.0053 * 0.197695837
...
;
run;

must have caused a lot of notes in the log saying

NOTE: Variable pmeat17KC2s is uninitialized.
NOTE: Variable rmeat17KC2s is uninitialized.
...

and

NOTE: Missing values were generated as a result of performing an operation on missing values.
      Each place is given by: (Number of times) at (Line):(Column).
...

because all those standardized variables are not contained in dataset PROJECTS.SOURCE, which the SET statement reads, but were defined in a previous DATA step that created another WORK dataset named SOURCE2.

 

So, the above DATA step must read that SOURCE2 dataset and should ideally write the results to a dataset with a different name to avoid overwriting the previous one (and the risk of confusion). But first you need to rerun the DATA step creating the standardized variables because now you have overwritten it. Better yet, check if PROC STDIZE can compute the standardized variables with much less code and without copying (rounded!) descriptive statistics into the code window.

 

The structure of the code could be like this:

/* Create the analysis subset */

data source2;
set Projects.source;
where (NCD17m2 <1070.22659 and afoods2 <400) or afoods2 <149;
run;

/* Create standardized variables */

proc stdize data=source2 out=source2s sprefix=s_ method=...;
var NCD17m2 Pmeat17KC2 ...;
run;

/* Compute correlations */

proc corr data=source2s;
...;
run;

/* Combine risk factors */

data source2c;
set source2s;
NCD17m2f1= ...;
run;

/* Compute more correlations */

proc corr data=source2c;
...;
run;

Most likely the DATA step creating dataset SOURCE2C could be modified to read not only SOURCE2S, but also an output dataset from the statistical procedure (PROC CORR?) which has computed the statistics used in the formula for variable NCD17m2f1.

View solution in original post

6 REPLIES 6
FreelanceReinh
Jade | Level 19

Hello @dkcundiffMD,

 

The DATA step

data source2;
 set Projects.source;
where (NCD17m2 <1070.22659 and afoods2 <400) or afoods2 <149;
  NCD17m2f1=  
- pmeat17KC2s * 1.45735 * 0.159272828
- rmeat17KC2s * 17.0053 * 0.197695837
...
;
run;

must have caused a lot of notes in the log saying

NOTE: Variable pmeat17KC2s is uninitialized.
NOTE: Variable rmeat17KC2s is uninitialized.
...

and

NOTE: Missing values were generated as a result of performing an operation on missing values.
      Each place is given by: (Number of times) at (Line):(Column).
...

because all those standardized variables are not contained in dataset PROJECTS.SOURCE, which the SET statement reads, but were defined in a previous DATA step that created another WORK dataset named SOURCE2.

 

So, the above DATA step must read that SOURCE2 dataset and should ideally write the results to a dataset with a different name to avoid overwriting the previous one (and the risk of confusion). But first you need to rerun the DATA step creating the standardized variables because now you have overwritten it. Better yet, check if PROC STDIZE can compute the standardized variables with much less code and without copying (rounded!) descriptive statistics into the code window.

 

The structure of the code could be like this:

/* Create the analysis subset */

data source2;
set Projects.source;
where (NCD17m2 <1070.22659 and afoods2 <400) or afoods2 <149;
run;

/* Create standardized variables */

proc stdize data=source2 out=source2s sprefix=s_ method=...;
var NCD17m2 Pmeat17KC2 ...;
run;

/* Compute correlations */

proc corr data=source2s;
...;
run;

/* Combine risk factors */

data source2c;
set source2s;
NCD17m2f1= ...;
run;

/* Compute more correlations */

proc corr data=source2c;
...;
run;

Most likely the DATA step creating dataset SOURCE2C could be modified to read not only SOURCE2S, but also an output dataset from the statistical procedure (PROC CORR?) which has computed the statistics used in the formula for variable NCD17m2f1.

dkcundiffMD
Quartz | Level 8

That solution worked to get the variables standardized in a new workspace (source3). Thanks.

 

Now I can't get proc stdize to work. My effort is attached. It would save me many hours to quickly standardize lots of variables. 

Thanks again. 

 

 

FreelanceReinh
Jade | Level 19

By default, the output dataset from PROC STDIZE (i.e. SOURCE3S in your example) uses the original names for the standardized variables: see the description of the OUT= option in the documentation. But you can opt for new names by specifying the SPREFIX= option and for including the original variables in the output dataset by using the OPREFIX= option (alone or together with the SPREFIX= option). In both cases you name a prefix, not a suffix, to be added to the original variable names. Then use the names of the standardized variables to refer to them in later steps (e.g. PROC CORR). To check the variable names you can use PROC CONTENTS and/or PROC PRINT (the latter with an OBS= dataset option such as

proc print data=source3s(obs=5);
run;

to limit the output to a few [here: 5] observations.)

 

Also note that you don't need to repeat the WHERE condition over and over again. Once the condition has been applied (i.e. in dataset SOURCE2 in the code I suggested), all observations in the dataset necessarily satisfy the condition and applying the WHERE statement again is just redundant -- as long as the variable values are not changed. (If you instruct PROC STDIZE to use the original names for the standardized variables, then their values do change, of course, and you certainly don't want to apply the same WHERE condition that suited the variables before standardization.)

dkcundiffMD
Quartz | Level 8

Now I'm having problem with the proc contents syntax.

Thanks for your help.

FreelanceReinh
Jade | Level 19

You're welcome. If you just want to see the list of variables in a dataset, say, in dataset SOURCE5, you can use the simple syntax

proc contents data=source5;
run;

Only rarely you'll need more options.

dkcundiffMD
Quartz | Level 8
That works! I was putting in variables and getting errors. Thanks.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 632 views
  • 3 likes
  • 2 in conversation