I try to understand the mechanism of SET statement in DATA step by following code:
data c1;
set sashelp.class;
data c2;
set c1;
keep name;
if _n_<=3;
data c3;
set c1;
keep age;
if _n_<=3;
run;
data c_out;
put 'initial' @10 _all_;
set c2 c3;
put 'mid' @10 _all_;
set c3 c2;
put 'last' @10 _all_;
run;
I find it hard to explain why the first and the third obs has missing value(name at obs 1 and age at obs 4). Maybe it's caused by SAS reset all variables in PDV to missing value every time SET reads a new dataset and finishes reading a dataset?
I agree with my good friend @yabwon more than copilot, but still want to add something / try explaining differently.
When you use a SET statement, the variables read in on that SET statement are automatically retained.
When you use a SET statement to concatenate datasets e.g.:
set c2 c3;
you would not want a value read from the last record of c2 to be retained, and be applied to every record in c3. Therefore the SET statement sensibly has a rule: whenever it reads from a new dataset, it resets the variables read by the SET statement to missing before reading the first record of the new dataset. (It doesn't actually reset the full PDV).
Note that the rule about when to reset variables to missing applies to each SET statement independently.
I would say:
During compile time, both SET statements are compiled, and both SET statements are going to read in the variables NAME and AGE. So they are retained.
_N_=1
1) Before reading the first record from C2, the first SET statement resets NAME and AGE to missing. Then it reads the first record of C2, which gives Name the value 'Alfred'.
2) Before reading the first record from C3, the second SET statement resets NAME and AGE to missing. Then it reads the first record of C3, which gives Age the value 14.
_N_=2
1) The first SET statement will read the second record from C2. Because this is not a new dataset (i.e. the prior record was also read from C2), no variables are reset to missing. The SET statement reads the second record of C2, gives Name the value 'Alice.'
2) the second SET statement will read the second record from C3. Because this is not a new dataset (i.e. the prior record was also read from C3), no variables are reset to missing. The SET statement reads the second record of C3, gives Age the value 13.
_N_=3
1) The first SET statement will read the third record from C2. Because this is not a new dataset (i.e. the prior record was also read from C2), no variables are reset to missing. The SET statement reads the second record of C2, gives Name the value 'Barbara.'
2) the second SET statement will read the third record from C3. Because this is not a new dataset (i.e. the prior record was also read from C3), no variables are reset to missing. The SET statement reads the third record of C3, gives Age the value 13.
_N_=4
1) The first SET statement will read the first record from C3. Because this *is* a new dataset (i.e. the prior record was read from C2), Name and Age are reset to missing. The SET statement reads the first record of C3, gives Age the value 14.
2) the second SET statement will read the first record from C2. Because this *is* a new dataset (i.e. the prior record was read from C3), Name and Age are set to missing. The SET statement reads the first record of C2, gives Name the value 'Alfred'.
@odahviing I simply copy/pasted your question as-is into Copilot. I've only "browsed" through the answer but it looks right to me and provides certainly much more detail than I could - and certainly explains things in much more detail than I could.
You're on the right track! The behaviour you're observing is indeed related to how the Program Data Vector (PDV) is managed in SAS when using the SET
statement.
Here's a step-by-step explanation:
Creating c1
, c2
, and c3
:
c1
is created by copying all observations from sashelp.class
.c2
is created by copying the first 3 observations from c1
and keeping only the name
variable.c3
is created by copying the first 3 observations from c1
and keeping only the age
variable.Creating c_out
:
PUT
statement outputs the current values of all variables in the PDV.SET
statement reads from c2
and c3
. When it reads from c2
, it initializes the name
variable and sets age
to missing. When it reads from c3
, it initializes the age
variable and sets name
to missing.SET
statement reads from c3
and c2
. When it reads from c3
, it initializes the age
variable and sets name
to missing. When it reads from c2
, it initializes the name
variable and sets age
to missing.The key point is that each SET
statement reinitializes the PDV, setting variables not in the current dataset to missing. This is why you see missing values for name
in the first observation and age
in the fourth observation.
Here's a simplified breakdown of the c_out
dataset creation:
First Observation:
SET c2 c3
: Reads from c2
, initializes name
, sets age
to missing.SET c3 c2
: Reads from c3
, initializes age
, sets name
to missing.Second Observation:
SET c2 c3
: Reads from c2
, initializes name
, sets age
to missing.SET c3 c2
: Reads from c2
, initializes name
, sets age
to missing.Third Observation:
SET c2 c3
: Reads from c3
, initializes age
, sets name
to missing.SET c3 c2
: Reads from c3
, initializes age
, sets name
to missing.Fourth Observation:
SET c2 c3
: Reads from c3
, initializes age
, sets name
to missing.SET c3 c2
: Reads from c2
, initializes name
, sets age
to missing.I hope this clarifies the behavior you're seeing! If you have any more questions or need further explanation, feel free to ask.
When you do
set c2 c3;
the DATA step compiler creates a read which reads all variables contained in either dataset; when this read then executes, it will "read" all variables not contained in the current dataset as missing. That's why it seems that values are not retained; in fact, they are retained, but overwritten with missing values.
PS after testing it is revealed that this only happens in the first iteration where a new dataset is read in a particular SET statement.
@Patrick The Copilot answer does not seem to be 100% correct.
From the explanation one may have impression that setting a variable not existing in a data set to missing is happening with every data read, but it happens only before first read.
Using C2, C3, and C2 again:
data XXX;
put 'initial' @10 _all_;
set c2 c3 c2 indsname=indsname;
put 'mid' @10 _all_;
if _N_=5 then do; name="XYZ";age=100; end;
put 'last' @10 _all_;
output;
put;
run;
We can clearly see that after setting name to XYZ for _N_=5 the value remains in PDV not changed - retained - until we start reading C2 for the second time, log:
1 data XXX;
2 put 'initial' @10 _all_;
3 set c2 c3 c2 indsname=indsname;
4 put 'mid' @10 _all_;
5 if _N_=5 then do; name="XYZ";age=100; end;
6 put 'last' @10 _all_;
7 output;
8 put;
9 run;
initial indsname= Name= Age=. _ERROR_=0 _N_=1
mid indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=1
last indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=1
initial indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=2
mid indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=2
last indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=2
initial indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=3
mid indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=3
last indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=3
initial indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=4
mid indsname=WORK.C3 Name= Age=14 _ERROR_=0 _N_=4
last indsname=WORK.C3 Name= Age=14 _ERROR_=0 _N_=4
initial indsname=WORK.C3 Name= Age=14 _ERROR_=0 _N_=5
mid indsname=WORK.C3 Name= Age=13 _ERROR_=0 _N_=5
last indsname=WORK.C3 Name=XYZ Age=100 _ERROR_=0 _N_=5
initial indsname=WORK.C3 Name=XYZ Age=100 _ERROR_=0 _N_=6
mid indsname=WORK.C3 Name=XYZ Age=13 _ERROR_=0 _N_=6
last indsname=WORK.C3 Name=XYZ Age=13 _ERROR_=0 _N_=6
initial indsname=WORK.C3 Name=XYZ Age=13 _ERROR_=0 _N_=7
mid indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=7
last indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=7
initial indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=8
mid indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=8
last indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=8
initial indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=9
mid indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=9
last indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=9
initial indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=10
NOTE: There were 3 observations read from the data set WORK.C2.
NOTE: There were 3 observations read from the data set WORK.C3.
NOTE: There were 3 observations read from the data set WORK.C2.
NOTE: The data set WORK.XXX has 9 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
I think the process goes a bit differently (log below):
0) Compilation phase sets up PDV with NAME and AGE and initialise them to missing. In this case the RETAIN statement added by me just replaces initialization values with "#########" and 666 just to highlight what will happen in the next step.
_N_=1
1) the first set statement want to read C2 (see value indsname=set1), but it sees that AGE is not present in C2 so changes it to missing, and next reads first observation from C2 (obs1=1), what makes NAME=Alfred.
2) the second set statement want to read C3 (see value indsname=set2), but it sees that MANE is not present in C3 so changes it to missing, and next reads first observation from C3 (obs2=1), what makes AGE=14.
3) content of PDV is outputed: NAME="" AGE=14.
_N_=2
0) values of PDV variables are retained,
1) the first set statement reads C2 (see value indsname=set1), variable AGE is not present in C2 so its value stays unchanged, and observation 2 from C2 (obs1=2) is read, what makes NAME=Alice.
2) the second set statement reads C3 (see value indsname=set2), variable NAME is not present in C3 so its value stays unchanged, and observation 2 from C3(obs1=2) is read, what makes AGE=13.
3) content of PDV is outputed: NAME="Alice" AGE=13.
_N_=3
0) values of PDV variables are retained,
1) the first set statement reads C2 (see value indsname=set1), variable AGE is not present in C2 so its value stays unchanged, and observation 3 from C2 (obs1=3) is read, what makes NAME=Barbara.
2) the second set statement reads C3 (see value indsname=set2), variable NAME is not present in C3 so its value stays unchanged, and observation 3 from C3(obs1=3) is read, what makes AGE=13 (this is new 13, because values are 14,13,13, would be more visible if third value be for example 12).
3) content of PDV is outputed: NAME="Barbara" AGE=13.
_N_=4
0) values of PDV variables are retained,
1) the first set statement want to read C3 (see value indsname=set1), but it sees that NAME is not present in C3 so changes it to missing (from previous value "Barbara"), and next reads first observation from C3 (obs1=1), what makes AGE=14.
2) the second set statement want to read C2 (see value indsname=set2), but it sees that AGE is not present in C2 so changes it to missing, and next reads first observation from C2 (obs2=1), what makes NAME=Alfred.
3) content of PDV is outputed: NAME="Alfred" AGE=.
_N_=5
... etc. etc. ...
1 %let putForm=@10 name= $char8. @24 age= 8. @32 obs1= @40 set1=$char8. @54 obs2= @62 set2=$char8. @76_N_=;
2 data c_out;
3 length name $ 8 age 8;
4 retain name "#########" age 666;
5
6 put 'initial' &putForm.;
7 set c2 c3 curobs=obs1 indsname=set1;
8 put 'mid' &putForm.;
9 set c3 c2 curobs=obs2 indsname=set2;
10 put 'last' &putForm.;
11 put;
12 run;
initial name=######## age=666 obs1=. set1= obs2=. set2= _N_=1
mid name=Alfred age=. obs1=1 set1=WORK.C2 obs2=. set2= _N_=1
last name= age=14 obs1=1 set1=WORK.C2 obs2=1 set2=WORK.C3 _N_=1
initial name= age=14 obs1=1 set1=WORK.C2 obs2=1 set2=WORK.C3 _N_=2
mid name=Alice age=14 obs1=2 set1=WORK.C2 obs2=1 set2=WORK.C3 _N_=2
last name=Alice age=13 obs1=2 set1=WORK.C2 obs2=2 set2=WORK.C3 _N_=2
initial name=Alice age=13 obs1=2 set1=WORK.C2 obs2=2 set2=WORK.C3 _N_=3
mid name=Barbara age=13 obs1=3 set1=WORK.C2 obs2=2 set2=WORK.C3 _N_=3
last name=Barbara age=13 obs1=3 set1=WORK.C2 obs2=3 set2=WORK.C3 _N_=3
initial name=Barbara age=13 obs1=3 set1=WORK.C2 obs2=3 set2=WORK.C3 _N_=4
mid name= age=14 obs1=1 set1=WORK.C3 obs2=3 set2=WORK.C3 _N_=4
last name=Alfred age=. obs1=1 set1=WORK.C3 obs2=1 set2=WORK.C2 _N_=4
initial name=Alfred age=. obs1=1 set1=WORK.C3 obs2=1 set2=WORK.C2 _N_=5
mid name=Alfred age=13 obs1=2 set1=WORK.C3 obs2=1 set2=WORK.C2 _N_=5
last name=Alice age=13 obs1=2 set1=WORK.C3 obs2=2 set2=WORK.C2 _N_=5
initial name=Alice age=13 obs1=2 set1=WORK.C3 obs2=2 set2=WORK.C2 _N_=6
mid name=Alice age=13 obs1=3 set1=WORK.C3 obs2=2 set2=WORK.C2 _N_=6
last name=Barbara age=13 obs1=3 set1=WORK.C3 obs2=3 set2=WORK.C2 _N_=6
initial name=Barbara age=13 obs1=3 set1=WORK.C3 obs2=3 set2=WORK.C2 _N_=7
NOTE: There were 3 observations read from the data set WORK.C2.
NOTE: There were 3 observations read from the data set WORK.C3.
NOTE: There were 3 observations read from the data set WORK.C3.
NOTE: There were 3 observations read from the data set WORK.C2.
NOTE: The data set WORK.C_OUT has 6 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
Bart
I agree with my good friend @yabwon more than copilot, but still want to add something / try explaining differently.
When you use a SET statement, the variables read in on that SET statement are automatically retained.
When you use a SET statement to concatenate datasets e.g.:
set c2 c3;
you would not want a value read from the last record of c2 to be retained, and be applied to every record in c3. Therefore the SET statement sensibly has a rule: whenever it reads from a new dataset, it resets the variables read by the SET statement to missing before reading the first record of the new dataset. (It doesn't actually reset the full PDV).
Note that the rule about when to reset variables to missing applies to each SET statement independently.
I would say:
During compile time, both SET statements are compiled, and both SET statements are going to read in the variables NAME and AGE. So they are retained.
_N_=1
1) Before reading the first record from C2, the first SET statement resets NAME and AGE to missing. Then it reads the first record of C2, which gives Name the value 'Alfred'.
2) Before reading the first record from C3, the second SET statement resets NAME and AGE to missing. Then it reads the first record of C3, which gives Age the value 14.
_N_=2
1) The first SET statement will read the second record from C2. Because this is not a new dataset (i.e. the prior record was also read from C2), no variables are reset to missing. The SET statement reads the second record of C2, gives Name the value 'Alice.'
2) the second SET statement will read the second record from C3. Because this is not a new dataset (i.e. the prior record was also read from C3), no variables are reset to missing. The SET statement reads the second record of C3, gives Age the value 13.
_N_=3
1) The first SET statement will read the third record from C2. Because this is not a new dataset (i.e. the prior record was also read from C2), no variables are reset to missing. The SET statement reads the second record of C2, gives Name the value 'Barbara.'
2) the second SET statement will read the third record from C3. Because this is not a new dataset (i.e. the prior record was also read from C3), no variables are reset to missing. The SET statement reads the third record of C3, gives Age the value 13.
_N_=4
1) The first SET statement will read the first record from C3. Because this *is* a new dataset (i.e. the prior record was read from C2), Name and Age are reset to missing. The SET statement reads the first record of C3, gives Age the value 14.
2) the second SET statement will read the first record from C2. Because this *is* a new dataset (i.e. the prior record was read from C3), Name and Age are set to missing. The SET statement reads the first record of C2, gives Name the value 'Alfred'.
"whenever it reads from a new dataset, it resets the variables read by the SET statement to missing before reading the first record of the new dataset. (It doesn't actually reset the full PDV)." very very simple and concise, I'll write it down in my SAS learning notebook. thanks
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.