Solved: Re: strange behavior SET statement

karolis_b · Posted 04-05-2024 04:07 AM

I'm facing strange behavior of SET statement when reading more than one SAS dataset with different column names. As per documentation, in execution phase variables values are set to missing in PDV. However, it seems that is not a case in the scenario provided below. Would some one able to explain more detailed what is happening?

data M(drop=Height) F(drop=Weight);
set sashelp.class;
if sex='M' then output M;
else if sex='F' then output F;
run;

data result;
set M F;
if age < 13 then do;
   Height = 1;
   Weight = 1;
end;
run;

I expect to have Height and Weights values =1 only where age < 13. However, Height fills to all rows from M table after first condition was correct and same with Weight from F table. I understand that I could explicitly call missing those columns after SET statement, but why it's not done automatically?

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/basess/n053a58fwk57v7n14h8x7y7u34y4.htm

FreelanceReinh · Posted 04-05-2024 05:05 AM

Hello @karolis_b,

See the more detailed description in section "The Execution Phase" of "How the DATA Step Processes Data" and also sections Default DATA Step Behavior and Redundancy in the documentation of the RETAIN statement.

Since HEIGHT and WEIGHT are read with the SET statement, they are retained. If the dataset currently being read does not supply values for one of these variables, the value assigned by the most recent assignment statement is not overwritten, so it is repeated. There is one detail, though, that I think is missing in the documentation linked above: When SET starts reading from a new dataset (in your example: dataset F), values from the previous dataset (here: dataset M) are not retained (edit: more precisely: they are reset to missing [the actual "retaining" from one iteration of the DATA step to the next does occur; it just occurs before the SET statement is executed]). Otherwise, in your example, William's weight would be carried over into Alice's observation, but this is not the case.

View solution in original post

karolis_b · Posted 04-05-2024 04:21 AM

it seems that PDV is created with all variables from SET statement datasets, but only those are set missing, which are available at the pointer. So if the program reads observation from table M, only variables from M table are set to missing.

Ksharp · Posted 04-05-2024 04:45 AM

ANY variable from dataset would be retained .

althought 'height' not in M dataset, but in F dataset, therefore 'height' would be retained, and you would see "Height fills to all rows from M table after first condition was correct".

Same thing happen to variable 'weight'.

ballardw · Posted 04-05-2024 04:54 AM

I expect to have Height and Weights values =1 only where age < 13

Why do you expect that? Data set M every observation has values for Weight. Every observation in F has values for Height.

Except for ages less than 13 then the observation will have the value it had from either M or F.

When you use a single SET statement with 2 or more data sets then you have told SAS to "stack" or append the second (and third or fourth or ...) observations after the previous set's values are brought in. The automatic "missing" values are only applied to the observations that did not have the variable in the source set. Otherwise it keeps the value of the variable.

It might help if showed exactly what you expected to see as the output. Then we could show code to replicate that.

Note: multiple SET statements can be even more confusing.

karolis_b · Posted 04-05-2024 05:07 AM

I expect having Height=1 only where age < 13.

At the moment I get as provided below. As I've mentioned, how to achieve the desired result I do know. I'm more interested how come that the values are retained.

FreelanceReinh · Posted 04-05-2024 05:05 AM

Hello @karolis_b,

See the more detailed description in section "The Execution Phase" of "How the DATA Step Processes Data" and also sections Default DATA Step Behavior and Redundancy in the documentation of the RETAIN statement.

Since HEIGHT and WEIGHT are read with the SET statement, they are retained. If the dataset currently being read does not supply values for one of these variables, the value assigned by the most recent assignment statement is not overwritten, so it is repeated. There is one detail, though, that I think is missing in the documentation linked above: When SET starts reading from a new dataset (in your example: dataset F), values from the previous dataset (here: dataset M) are not retained (edit: more precisely: they are reset to missing [the actual "retaining" from one iteration of the DATA step to the next does occur; it just occurs before the SET statement is executed]). Otherwise, in your example, William's weight would be carried over into Alice's observation, but this is not the case.

karolis_b · Posted 04-05-2024 05:16 AM

Thank you, I believe I got confused with this, where I thought all variables in PDV are set to missing before reading the next observation

How the DATA Step Works: A Basic Introduction

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!