Solved: Re: Use multiple SET STATEMENTS with multiple datasets to read data, b...

odahviing · Posted 07-09-2025 02:25 AM

I try to understand the mechanism of SET statement in DATA step by following code:

data c1;
set sashelp.class;
data c2;
set c1;
keep name;
if _n_<=3;
data c3;
set c1;
keep age;
if _n_<=3;
run;

data c_out;
put 'initial' @10 _all_;
set c2 c3;
put 'mid' @10 _all_;
set c3 c2;
put 'last' @10 _all_;
run;

I find it hard to explain why the first and the third obs has missing value(name at obs 1 and age at obs 4). Maybe it's caused by SAS reset all variables in PDV to missing value every time SET reads a new dataset and finishes reading a dataset?

Quentin · Posted 07-09-2025 09:20 AM

I agree with my good friend @yabwon more than copilot, but still want to add something / try explaining differently.

When you use a SET statement, the variables read in on that SET statement are automatically retained.

When you use a SET statement to concatenate datasets e.g.:

set c2 c3;

you would not want a value read from the last record of c2 to be retained, and be applied to every record in c3. Therefore the SET statement sensibly has a rule: whenever it reads from a new dataset, it resets the variables read by the SET statement to missing before reading the first record of the new dataset. (It doesn't actually reset the full PDV).

Note that the rule about when to reset variables to missing applies to each SET statement independently.

I would say:

During compile time, both SET statements are compiled, and both SET statements are going to read in the variables NAME and AGE. So they are retained.

_N_=1

1) Before reading the first record from C2, the first SET statement resets NAME and AGE to missing. Then it reads the first record of C2, which gives Name the value 'Alfred'.

2) Before reading the first record from C3, the second SET statement resets NAME and AGE to missing. Then it reads the first record of C3, which gives Age the value 14.

_N_=2

1) The first SET statement will read the second record from C2. Because this is not a new dataset (i.e. the prior record was also read from C2), no variables are reset to missing. The SET statement reads the second record of C2, gives Name the value 'Alice.'

2) the second SET statement will read the second record from C3. Because this is not a new dataset (i.e. the prior record was also read from C3), no variables are reset to missing. The SET statement reads the second record of C3, gives Age the value 13.

_N_=3

1) The first SET statement will read the third record from C2. Because this is not a new dataset (i.e. the prior record was also read from C2), no variables are reset to missing. The SET statement reads the second record of C2, gives Name the value 'Barbara.'

2) the second SET statement will read the third record from C3. Because this is not a new dataset (i.e. the prior record was also read from C3), no variables are reset to missing. The SET statement reads the third record of C3, gives Age the value 13.

_N_=4

1) The first SET statement will read the first record from C3. Because this *is* a new dataset (i.e. the prior record was read from C2), Name and Age are reset to missing. The SET statement reads the first record of C3, gives Age the value 14.

2) the second SET statement will read the first record from C2. Because this *is* a new dataset (i.e. the prior record was read from C3), Name and Age are set to missing. The SET statement reads the first record of C2, gives Name the value 'Alfred'.

View solution in original post

Patrick · Posted 07-09-2025 03:04 AM

@odahviing I simply copy/pasted your question as-is into Copilot. I've only "browsed" through the answer but it looks right to me and provides certainly much more detail than I could - and certainly explains things in much more detail than I could.

You're on the right track! The behaviour you're observing is indeed related to how the Program Data Vector (PDV) is managed in SAS when using the SET statement.

Here's a step-by-step explanation:

Creating c1, c2, and c3:
- c1 is created by copying all observations from sashelp.class.
- c2 is created by copying the first 3 observations from c1 and keeping only the name variable.
- c3 is created by copying the first 3 observations from c1 and keeping only the age variable.
Creating c_out:
- The PUT statement outputs the current values of all variables in the PDV.
- The first SET statement reads from c2 and c3. When it reads from c2, it initializes the name variable and sets age to missing. When it reads from c3, it initializes the age variable and sets name to missing.
- The second SET statement reads from c3 and c2. When it reads from c3, it initializes the age variable and sets name to missing. When it reads from c2, it initializes the name variable and sets age to missing.

The key point is that each SET statement reinitializes the PDV, setting variables not in the current dataset to missing. This is why you see missing values for name in the first observation and age in the fourth observation.

Here's a simplified breakdown of the c_out dataset creation:

First Observation:
- SET c2 c3: Reads from c2, initializes name, sets age to missing.
- SET c3 c2: Reads from c3, initializes age, sets name to missing.
Second Observation:
- SET c2 c3: Reads from c2, initializes name, sets age to missing.
- SET c3 c2: Reads from c2, initializes name, sets age to missing.
Third Observation:
- SET c2 c3: Reads from c3, initializes age, sets name to missing.
- SET c3 c2: Reads from c3, initializes age, sets name to missing.
Fourth Observation:
- SET c2 c3: Reads from c3, initializes age, sets name to missing.
- SET c3 c2: Reads from c2, initializes name, sets age to missing.

I hope this clarifies the behavior you're seeing! If you have any more questions or need further explanation, feel free to ask.

odahviing · Posted 07-09-2025 03:33 AM

Well, in fact, I still feel confused for the explanation "The key point is that each SET statement reinitializes the PDV, setting variables not in the current dataset to missing.". Because when I submit this code : data c_out; set c2; set c3; run; when processing set c3, variable name, which is not in c3, doesn't reset to missing. I think this contradict with what AI said. and the explanation of iterations also doesn't make sense, SAS doesn't set name and age to miss in iteration 2 and 3.

Kurt_Bremser · Posted 07-09-2025 05:52 AM

When you do

set c2 c3;

the DATA step compiler creates a read which reads all variables contained in either dataset; when this read then executes, it will "read" all variables not contained in the current dataset as missing. That's why it seems that values are not retained; in fact, they are retained, but overwritten with missing values.

PS after testing it is revealed that this only happens in the first iteration where a new dataset is read in a particular SET statement.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

yabwon · Posted 07-09-2025 05:11 AM

@Patrick The Copilot answer does not seem to be 100% correct.

From the explanation one may have impression that setting a variable not existing in a data set to missing is happening with every data read, but it happens only before first read.

Using C2, C3, and C2 again:

data XXX;
  put 'initial' @10 _all_;
  set c2 c3 c2 indsname=indsname;
  put 'mid' @10 _all_;
  if _N_=5 then do; name="XYZ";age=100; end;
  put 'last' @10 _all_;
  output;
  put;
run;

We can clearly see that after setting name to XYZ for _N_=5 the value remains in PDV not changed - retained - until we start reading C2 for the second time, log:

1          data XXX;
2            put 'initial' @10 _all_;
3            set c2 c3 c2 indsname=indsname;
4            put 'mid' @10 _all_;
5            if _N_=5 then do; name="XYZ";age=100; end;
6            put 'last' @10 _all_;
7            output;
8            put;
9          run;

initial  indsname=  Name=  Age=. _ERROR_=0 _N_=1
mid      indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=1
last     indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=1

initial  indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=2
mid      indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=2
last     indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=2

initial  indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=3
mid      indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=3
last     indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=3

initial  indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=4
mid      indsname=WORK.C3 Name=  Age=14 _ERROR_=0 _N_=4
last     indsname=WORK.C3 Name=  Age=14 _ERROR_=0 _N_=4

initial  indsname=WORK.C3 Name=  Age=14 _ERROR_=0 _N_=5
mid      indsname=WORK.C3 Name=  Age=13 _ERROR_=0 _N_=5
last     indsname=WORK.C3 Name=XYZ Age=100 _ERROR_=0 _N_=5

initial  indsname=WORK.C3 Name=XYZ Age=100 _ERROR_=0 _N_=6
mid      indsname=WORK.C3 Name=XYZ Age=13 _ERROR_=0 _N_=6
last     indsname=WORK.C3 Name=XYZ Age=13 _ERROR_=0 _N_=6

initial  indsname=WORK.C3 Name=XYZ Age=13 _ERROR_=0 _N_=7
mid      indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=7
last     indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=7

initial  indsname=WORK.C2 Name=Alfred Age=. _ERROR_=0 _N_=8
mid      indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=8
last     indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=8

initial  indsname=WORK.C2 Name=Alice Age=. _ERROR_=0 _N_=9
mid      indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=9
last     indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=9

initial  indsname=WORK.C2 Name=Barbara Age=. _ERROR_=0 _N_=10
NOTE: There were 3 observations read from the data set WORK.C2.
NOTE: There were 3 observations read from the data set WORK.C3.
NOTE: There were 3 observations read from the data set WORK.C2.
NOTE: The data set WORK.XXX has 9 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

I think the process goes a bit differently (log below):

0) Compilation phase sets up PDV with NAME and AGE and initialise them to missing. In this case the RETAIN statement added by me just replaces initialization values with "#########" and 666 just to highlight what will happen in the next step.

_N_=1

1) the first set statement want to read C2 (see value indsname=set1), but it sees that AGE is not present in C2 so changes it to missing, and next reads first observation from C2 (obs1=1), what makes NAME=Alfred.

2) the second set statement want to read C3 (see value indsname=set2), but it sees that MANE is not present in C3 so changes it to missing, and next reads first observation from C3 (obs2=1), what makes AGE=14.

3) content of PDV is outputed: NAME="" AGE=14.

_N_=2

0) values of PDV variables are retained,

1) the first set statement reads C2 (see value indsname=set1), variable AGE is not present in C2 so its value stays unchanged, and observation 2 from C2 (obs1=2) is read, what makes NAME=Alice.

2) the second set statement reads C3 (see value indsname=set2), variable NAME is not present in C3 so its value stays unchanged, and observation 2 from C3(obs1=2) is read, what makes AGE=13.

3) content of PDV is outputed: NAME="Alice" AGE=13.

_N_=3

0) values of PDV variables are retained,

1) the first set statement reads C2 (see value indsname=set1), variable AGE is not present in C2 so its value stays unchanged, and observation 3 from C2 (obs1=3) is read, what makes NAME=Barbara.

2) the second set statement reads C3 (see value indsname=set2), variable NAME is not present in C3 so its value stays unchanged, and observation 3 from C3(obs1=3) is read, what makes AGE=13 (this is new 13, because values are 14,13,13, would be more visible if third value be for example 12).

3) content of PDV is outputed: NAME="Barbara" AGE=13.

_N_=4

0) values of PDV variables are retained,

1) the first set statement want to read C3 (see value indsname=set1), but it sees that NAME is not present in C3 so changes it to missing (from previous value "Barbara"), and next reads first observation from C3 (obs1=1), what makes AGE=14.

2) the second set statement want to read C2 (see value indsname=set2), but it sees that AGE is not present in C2 so changes it to missing, and next reads first observation from C2 (obs2=1), what makes NAME=Alfred.

3) content of PDV is outputed: NAME="Alfred" AGE=.

_N_=5

... etc. etc. ...

1          %let putForm=@10 name= $char8. @24 age= 8. @32 obs1= @40 set1=$char8. @54 obs2= @62 set2=$char8. @76_N_=;
2          data c_out;
3          length name $ 8 age 8;
4          retain name "#########" age 666;
5          
6          put 'initial' &putForm.;
7          set c2 c3 curobs=obs1 indsname=set1;
8          put 'mid' &putForm.;
9          set c3 c2 curobs=obs2 indsname=set2;
10         put 'last' &putForm.;
11         put;
12         run;

initial  name=######## age=666 obs1=.  set1=         obs2=.  set2=         _N_=1
mid      name=Alfred   age=.   obs1=1  set1=WORK.C2  obs2=.  set2=         _N_=1
last     name=         age=14  obs1=1  set1=WORK.C2  obs2=1  set2=WORK.C3  _N_=1

initial  name=         age=14  obs1=1  set1=WORK.C2  obs2=1  set2=WORK.C3  _N_=2
mid      name=Alice    age=14  obs1=2  set1=WORK.C2  obs2=1  set2=WORK.C3  _N_=2
last     name=Alice    age=13  obs1=2  set1=WORK.C2  obs2=2  set2=WORK.C3  _N_=2

initial  name=Alice    age=13  obs1=2  set1=WORK.C2  obs2=2  set2=WORK.C3  _N_=3
mid      name=Barbara  age=13  obs1=3  set1=WORK.C2  obs2=2  set2=WORK.C3  _N_=3
last     name=Barbara  age=13  obs1=3  set1=WORK.C2  obs2=3  set2=WORK.C3  _N_=3

initial  name=Barbara  age=13  obs1=3  set1=WORK.C2  obs2=3  set2=WORK.C3  _N_=4
mid      name=         age=14  obs1=1  set1=WORK.C3  obs2=3  set2=WORK.C3  _N_=4
last     name=Alfred   age=.   obs1=1  set1=WORK.C3  obs2=1  set2=WORK.C2  _N_=4

initial  name=Alfred   age=.   obs1=1  set1=WORK.C3  obs2=1  set2=WORK.C2  _N_=5
mid      name=Alfred   age=13  obs1=2  set1=WORK.C3  obs2=1  set2=WORK.C2  _N_=5
last     name=Alice    age=13  obs1=2  set1=WORK.C3  obs2=2  set2=WORK.C2  _N_=5

initial  name=Alice    age=13  obs1=2  set1=WORK.C3  obs2=2  set2=WORK.C2  _N_=6
mid      name=Alice    age=13  obs1=3  set1=WORK.C3  obs2=2  set2=WORK.C2  _N_=6
last     name=Barbara  age=13  obs1=3  set1=WORK.C3  obs2=3  set2=WORK.C2  _N_=6

initial  name=Barbara  age=13  obs1=3  set1=WORK.C3  obs2=3  set2=WORK.C2  _N_=7
NOTE: There were 3 observations read from the data set WORK.C2.
NOTE: There were 3 observations read from the data set WORK.C3.
NOTE: There were 3 observations read from the data set WORK.C3.
NOTE: There were 3 observations read from the data set WORK.C2.
NOTE: The data set WORK.C_OUT has 6 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

odahviing · Posted 07-09-2025 08:26 PM

Indeed, this explanations is more reasonable. I now know how to use indsname= & curobs= to make the logic of SAS clear, and a better method to show results I want to know in log, thanks!

Quentin · Posted 07-09-2025 09:20 AM

I agree with my good friend @yabwon more than copilot, but still want to add something / try explaining differently.

When you use a SET statement, the variables read in on that SET statement are automatically retained.

When you use a SET statement to concatenate datasets e.g.:

set c2 c3;

you would not want a value read from the last record of c2 to be retained, and be applied to every record in c3. Therefore the SET statement sensibly has a rule: whenever it reads from a new dataset, it resets the variables read by the SET statement to missing before reading the first record of the new dataset. (It doesn't actually reset the full PDV).

Note that the rule about when to reset variables to missing applies to each SET statement independently.

I would say:

During compile time, both SET statements are compiled, and both SET statements are going to read in the variables NAME and AGE. So they are retained.

_N_=1

1) Before reading the first record from C2, the first SET statement resets NAME and AGE to missing. Then it reads the first record of C2, which gives Name the value 'Alfred'.

2) Before reading the first record from C3, the second SET statement resets NAME and AGE to missing. Then it reads the first record of C3, which gives Age the value 14.

_N_=2

1) The first SET statement will read the second record from C2. Because this is not a new dataset (i.e. the prior record was also read from C2), no variables are reset to missing. The SET statement reads the second record of C2, gives Name the value 'Alice.'

2) the second SET statement will read the second record from C3. Because this is not a new dataset (i.e. the prior record was also read from C3), no variables are reset to missing. The SET statement reads the second record of C3, gives Age the value 13.

_N_=3

1) The first SET statement will read the third record from C2. Because this is not a new dataset (i.e. the prior record was also read from C2), no variables are reset to missing. The SET statement reads the second record of C2, gives Name the value 'Barbara.'

2) the second SET statement will read the third record from C3. Because this is not a new dataset (i.e. the prior record was also read from C3), no variables are reset to missing. The SET statement reads the third record of C3, gives Age the value 13.

_N_=4

1) The first SET statement will read the first record from C3. Because this *is* a new dataset (i.e. the prior record was read from C2), Name and Age are reset to missing. The SET statement reads the first record of C3, gives Age the value 14.

2) the second SET statement will read the first record from C2. Because this *is* a new dataset (i.e. the prior record was read from C3), Name and Age are set to missing. The SET statement reads the first record of C2, gives Name the value 'Alfred'.

odahviing · Posted 07-09-2025 08:32 PM

"whenever it reads from a new dataset, it resets the variables read by the SET statement to missing before reading the first record of the new dataset. (It doesn't actually reset the full PDV)." very very simple and concise, I'll write it down in my SAS learning notebook. thanks

Use multiple SET STATEMENTS with multiple datasets to read data, but output missing value.

Re: Use multiple SET STATEMENTS with multiple datasets to read data, but output missing value.

Re: Use multiple SET STATEMENTS with multiple datasets to read data, but output missing value.

Re: Use multiple SET STATEMENTS with multiple datasets to read data, but output missing value.

Re: Use multiple SET STATEMENTS with multiple datasets to read data, but output missing value.

Re: Use multiple SET STATEMENTS with multiple datasets to read data, but output missing value.

Re: Use multiple SET STATEMENTS with multiple datasets to read data, but output missing value.

Re: Use multiple SET STATEMENTS with multiple datasets to read data, but output missing value.

Re: Use multiple SET STATEMENTS with multiple datasets to read data, but output missing value.

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away