Solved: Replacing missing values in the DATA statement that combines multiple ...

DRNelson · Posted 02-17-2022 06:56 PM

This is something I could just avoid, but I would rather try to understand it. EXAMPLE: The early years of the NHANES data do not calculate a ratio of two variables, but later years do, and it is URDACT. If I try to replace the missing in the DATA when I combine them with the SET statement, I get a single value. If I do the same replacement in a DATA step after the combining, I get the correct results. What logic am I missing? THANKS!

data alb_cr;
set dat_.LAB16 dat_.L16_b dat_.L16_c dat_.alb_cr_d /*I removed the names of the rest of the data sets*/... ;

if missing(URDACT) then URDACT=(URXUMA*100/URXUCR);

proc print data=alb_cr(obs=10);
var seqn URDACT;

run;

data alb_cr;
set dat_.LAB16 dat_.L16_b dat_.L16_c dat_.alb_cr_d /*I removed the names of the rest of the data sets*/... ;

data alb_cr;
set alb_cr;
if missing(URDACT) then URDACT=(URXUMA*100/URXUCR);

proc print data=alb_cr(obs=10);
var seqn URDACT;

run;

Obs SEQN URDACT

1 2 6.27586
2 3 6.27586
3 5 6.27586
4 6 6.27586
5 7 6.27586
6 8 6.27586
7 9 6.27586
8 10 6.27586
9 11 6.27586
10 12 6.27586

Obs SEQN URDACT

1 2 6.2759
2 3 8.2540
3 5 3.5465
4 6 4.0323
5 7 5.2344
6 8 5.0413
7 9 4.1379
8 10 3.9785
9 11 2.6506

Astounding · Posted 02-17-2022 10:07 PM

Here's the nature of the "feature" you are observing.

Any variable that comes from a SAS data set is automatically retained. That applies to URDACT. Even though it doesn't exist on every incoming SAS data set, it does exist. So it gets retained.

When you combine several data sets, some containing URDACT and some not, SAS has to decide when to set URDACT to a missing value (overriding this automatic retain feature). So when the SET statement switches from one incoming data set to another, SAS begins with a missing value for URDACT. Then it reads in the data for the next data set.

Here's the tricky part. When the next incoming data set does not contain URDACT, it has a missing value. The IF/THEN condition is true, and the computation takes place assigning URDACT a nonmissing value. That value now gets retained, so on the next observation from the same data, the retained value is still there and URDACT is no longer missing. So the calculated value just sits there, and repeats.

The bottom line: when reading observations from a data set that does not contact URDACT, the software sets URDACT to missing for the first observation only. It does NOT reset URDACT on every observation, just the first one (for each data set).

View solution in original post

ballardw · Posted 02-17-2022 07:35 PM

Ideally you would provide example data to demonstrate the problem. LIke "what single value" are you seeing?

What is the format assigned to URDACT and what does it display when a value is missing?

When you get the same result for a calculation you need to prove that the values URXUMA and URXUCR are actually different before determining if there is an error.

What do those proc print results look like when you do:

proc print data=alb_cr(obs=10);
var seqn URDACT URXUMA URXUCR;

run;

It is much better to show LOG with the code and notes for a question like this so we can see what actually was run plus the notes that might be generated. Because you are "replacing" the same data set so many times if there is an error (that would show in the log) then the set might not actually be replaced when you think it is. Your comment about "removed data sets" might have a syntax issue that we can't see because of incomplete code and the log might reflect that.

I do note that both SEQN=2 records have the same value for the calculation barring a rounding difference (one reason to ask about FORMATS assigned to a variable. The code you show, if there weren't any problems, I would expect proc print to show the same number of decimals if the set statement is identical when using all of those data set.

I haven't worked with NHANES much but I do work with other CDC sponsored data and some of the sets variables might change meaning from year to year as well as coding of values in the data set.

Style comment: Use TITLES with out so we know what we are looking at why it is important. I have to guess that first output.

It really is a good idea to use RUN; statements.

Astounding · Posted 02-17-2022 10:07 PM

Here's the nature of the "feature" you are observing.

Any variable that comes from a SAS data set is automatically retained. That applies to URDACT. Even though it doesn't exist on every incoming SAS data set, it does exist. So it gets retained.

When you combine several data sets, some containing URDACT and some not, SAS has to decide when to set URDACT to a missing value (overriding this automatic retain feature). So when the SET statement switches from one incoming data set to another, SAS begins with a missing value for URDACT. Then it reads in the data for the next data set.

Here's the tricky part. When the next incoming data set does not contain URDACT, it has a missing value. The IF/THEN condition is true, and the computation takes place assigning URDACT a nonmissing value. That value now gets retained, so on the next observation from the same data, the retained value is still there and URDACT is no longer missing. So the calculated value just sits there, and repeats.

The bottom line: when reading observations from a data set that does not contact URDACT, the software sets URDACT to missing for the first observation only. It does NOT reset URDACT on every observation, just the first one (for each data set).

Replacing missing values in the DATA statement that combines multiple data sets in a SET statement

Re: Replacing missing values in the DATA statement that combines multiple data sets in a SET stateme

Re: Replacing missing values in the DATA statement that combines multiple data sets in a SET stateme

Re: Replacing missing values in the DATA statement that combines multiple data sets in a SET stateme

Replacing missing values in the DATA statement that combines multiple data sets in a SET statement

Re: Replacing missing values in the DATA statement that combines multiple data sets in a SET stateme

Re: Replacing missing values in the DATA statement that combines multiple data sets in a SET stateme

Re: Replacing missing values in the DATA statement that combines multiple data sets in a SET stateme

Registration is open

SAS Training: Just a Click Away