Hi everyone,
This post is a follow-up of the previous one with more detail -honestly, I'm not entirely sure what happened to the previous question. The problematic piece of code is this one:
_________
data test1;
input visit $16.;
datalines;
First
Second
;
data nulll;
input index;
datalines;
.
;
data test2;
stop;
set nulll;
run;
data test;
set test2 test1;
if index(VISIT, 'Fir')^=0 then do;
index=100;
end;
proc print data=test;
run;
__________
The output variable "index" contains two values of 100 whereas the logic implies that only one should be there.
On the other hand if we split the last data step in two it works properly (that is, firstly set test1 and test2 to get test and then set test to do the procedures) but I'm afraid to post the full code as it may get flagged as spam.
Can someone please explain what the issue is - I believe it's something to do with setting the datasets and performing operations on the data in the same step.
Thanks in advance,
teli4
Variables from incoming datasets are automatically retained. index is in dataset test2, so it is retained. With the first observation from test1, index is set to 100, and keeps that value, since it is never overwritten.
Can you describe why all the messing around with a 0 observation data set is necessary.
@teli4 wrote:
Hello there! The names of the given variables and the ones that should be obtained have to be taken from an external file and are fixed - the way this dummy dataset works is it ensures that the variables are properly named. Of course, the code above is simplified as much as I could and is just there to give a brief overview of the problem. Also, thanks for taking the time to consider this question!
This sounds to me like you have a process that is flawed at the start. Why would variables that should be there not be in the data set? Or possibly be named incorrectly?
You might also consider MERGE instead of SET for such a marginal purpose.
Variables from incoming datasets are automatically retained. index is in dataset test2, so it is retained. With the first observation from test1, index is set to 100, and keeps that value, since it is never overwritten.
Thank You very much for taking the time to answer! However, as I am at a beginner level, I'm a bit confused - with the set procedure I ensure that all variable from both data sets will be included (I'm assuming that's what you meant with retaining the variable index). However, shouldn't the logical operator "if" ensure that the condition is checked for each row, that is, for each observation of the variable 'visit'? That would imply that one of those logical checks would fail and one will not, so the value will be overwritten, no?
Your IF does not have an ELSE branch, so only the the action following the THEN branch will be executed, once. Since the resulting variable is retained, the value does not change thereafter.
Thanks! So that means that the "if" operator does not check each row (i.e. each observation of the particular variable) if an "else" statement is not provided but rather uses only the outcome of the first true logical result and plugs it all the way down?
The IF statement checks its condition in EVERY observation. Since only a THEN branch is supplied, only that action can be executed. The result of that execution (which happens once) is kept because of the automatic retain, caused by the resulting variable being present in an incoming dataset.
But why does that happen when the 2 datasets (test1 and test2) are set AND the if condition is used altogether in the same data step but the whole things works properly when we split the above code by first setting test1 and test2 in one dataset and then applying the IF condition in the new data set (that is, use 2 steps)? What I mean is why does the code
_________________________________
data test1;
input visit $16.;
datalines;
First
Second
;
data nulll;
input index;
datalines;
.
;
data test2;
stop;
set nulll;
run;
data test;
set test2 test1;
/*Here is the difference - I've split the data step in the code into 2 data sets */
data test;
set test;
if index(VISIT, 'Fir')^=0 then do;
index=100;
end;
proc print data=test;
run;
work as intended?
When you set both datasets, and use the condition, the following happens:
If you do it in separate steps, in the second step
Thank You very much, much appreciated. One last question before I accept this as the as the solution - would you say that the second approach is better (i.e. setting the datasets and then using IF) when not using an ELSE statement?
Once again, thanks! I'm sorry if I took too much out of your time.
If you want something to happen when a condition is not met, then you have to use the ELSE. Don't rely on "it's going to happen on its own". See Maxim 31.
Whether you combine steps for efficiency, or keep them separate for coding clarity, is up to you.
Before you wonder about the effects of a missing ELSE, you need to get familiar with the workings of the DATA step as such. When are variables set to missing, and which variables will be set to missing at that moment.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.