Hi,
I really don't understand why those 2 datasteps are acting differently.
I am using SAS 9.4.
It seems that when I declare variables using dataset with OBS=0 it declares all variables as RETAIN variables.
Let's say we have 2 datasets:
data d1;
infile datalines;
input ID Name1$30.;
datalines;
1 john
3 mike
4 tony
2 isaac
;
data d2;
infile datalines;
input ID Name2$30.;
datalines;
9 marc
2 blake
;
And then I have 2 options of declaring variables from DATA2.
DATA new;
SET d1;
FORMAT name2 $30.;
	
	IF _N_ = 2 THEN DO;
		name2 = 'new name';
	END;
RUN;
DATA new2;
SET d1 d2(obs=0);
	
	IF _N_ = 2 THEN DO;
		name2 = 'new name2';
	END;
RUN;
I get following results:
Dataset NEW:
ID Name1 name2
1 john
3 mike new name
4 tony
2 isaac
But dataset NEW2:
ID Name2 name2
1 john
3 mike new name2
4 tony new name2
2 isaac new name2
My question is WHY is second option (using obs=0) of declaration of variables acting as if I would define variables as RETAIN?
Is that a SAS feature or am I missing something here?
The reason I am using obs=0 is sometimes I want to add many variables and instead of listing all of them this seems to be a fast way but I do not want them to be as a RETAIN variables.
Any explanation would be much appreciated.
All variables that come from datasets are retained (which really means they are not set to missing when a new iteration starts). The values of the variables are only changed when the next observation is read. That is how SAS is able to do a one many merge.
Note that when the dataset does not contribute to a new BY group then the variables are set to missing.
So you might try finding a common variable you can merge "by". But you will still see the value being retained within the BY group if the other dataset(s) contribute multiple observations for the same BY group.
Or just add you own code to clear the variables. An easy way is to add these two statements to the end of the data step.
output;
call missing(of _all_);If the idea was just to add the variables then you might do something like this instead.
data want ;
  if 0 then set for_variable_defintions ;
  call missing(of _all_);
  set real_dataset ;
  retain newvar 0 ;
run;
When you use the _ALL_ variable list it will include only the variables that the SAS compiler has already seen. So in the code above that means only the variables defined in the set statement that will never execute because it is the THEN clause of an IF statement that is always false. So in particular the variable NEWVAR that does want to be retained is not yet defined and so is not set to missing by the call missing() function call.
Even though you set OBS=0 descriptor information for that data set is created. When you submit a SAS Data step it is first compiled and then executed. During the compilation PDV and Descriptor portion are set.
For more information how data step is processed got through this: https://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000985872.htm
In NEW2, since you declared name2 as part of a data set, name2 will only get refreshed when a value is read from a dataset.
Since this does not happen, there is no refresh.
The refresh part makes sense but looking at the functionality using SET with 2 datasets seems a wrong way to go in my case.
E.g. If we look only at the variable 'Name2' then observations that belong to Data1 will have name2 acting as a retain variable and observations of Data2 will be refreshed for each observation.
So If i would say
IF id=2 THEN name2 = 'Tim';
Then it could happen that observations from Data1 which have ID different than 2 could have value of name2 equal to 'Tim'. (would need to change the order in Data1 and put ID=2 observation at the beginning).
I did another test and used the
BY id;
option (and sort it prior of course) and then it does not retain values anymore.
I have to admit I am a bit confused about the rules here. I would expect it to refresh the whole observation including the variables that are not in the current observation.
All variables that come from datasets are retained (which really means they are not set to missing when a new iteration starts). The values of the variables are only changed when the next observation is read. That is how SAS is able to do a one many merge.
Note that when the dataset does not contribute to a new BY group then the variables are set to missing.
So you might try finding a common variable you can merge "by". But you will still see the value being retained within the BY group if the other dataset(s) contribute multiple observations for the same BY group.
Or just add you own code to clear the variables. An easy way is to add these two statements to the end of the data step.
output;
call missing(of _all_);If the idea was just to add the variables then you might do something like this instead.
data want ;
  if 0 then set for_variable_defintions ;
  call missing(of _all_);
  set real_dataset ;
  retain newvar 0 ;
run;
When you use the _ALL_ variable list it will include only the variables that the SAS compiler has already seen. So in the code above that means only the variables defined in the set statement that will never execute because it is the THEN clause of an IF statement that is always false. So in particular the variable NEWVAR that does want to be retained is not yet defined and so is not set to missing by the call missing() function call.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
