I have encountered the following situation a handful of times and it has always confused me. As I read through a datastep I notice that a variable is used before it is declared or assigned an initial value. That is, the first mention of a variable as I read the code from top to bottom is in a statement that assumes the variable already has a value. I think I remember reading that the datastep does some pre-processing, perhaps in which all variables are created, before any statements are executed.Would someone please explain when referring to a variable before it is declared is allowed in a datastep and how to correctly think about this situation.
Thanks!
Adam Black
There are a lot more ways to create a variable in SAS than in other programming languages. Perhaps this reference will help
As Doc implied, that's seems like an easy question but it really isn't. If you are at the point of " I think I remember reading that the data step does some pre-processing <more>", I think you should look at ... "Executable and Declarative Statements" ... SAS(R) 9.3 Statements: Reference ... to get some understanding of what can be done during data step execution versus what occurs prior to execution. There's a classic paper (a bit dated but still worth a read) "The SAS Supervisor" ...
http://www.lexjansen.com/nesug/nesug88/sas_supervisor.pdf ... that has a great explanation as to what happens prior to and at execution of a data step.
Also worth a look is "DATA Step Statements by Category" ... SAS(R) 9.3 Statements: Reference
Thank you for the references! I have found "The SAS Supervisor..." paper particularly helpful. I was aware of the different ways to declare variables in SAS but did not understand how to think about undeclared variables as in the following example.
data output;
if new_var = . then put "new_var exists but was never declared";
run;
It sounds like in this example the SAS supervisor creates new_var and initializes it to missing at compile time. Then the if statement is performed during execution. This seems like dangerous behavior to me. I could imagine that a typo in a variable name would be a difficult error to find since it would not create an warning. Instead SAS automatically defines a new variable.
Thanks for the help.
Are you sure the variable wasn't coming from an existing data set such as a SET, MERGE or UPDATE statement?
Provide some examples of the code in question from the Data statement to the questionable line(s) of code.
Here is an example where a variable is referred to before it is given a value (it is never given a value as evidence by "un-init" message).
If is implicitly declared by the SAS data step compiler. Is this the kind of thing you are talking about? If you could give example that would be most helpful.
Thank you all for your help. Here are a couple examples of what I am talking about. The simplest example is the following.
data output;
if new_var = . then put "new_var exists and was never declared";
run;
A more complicated example comes from a problem I was trying to solve involving a sohisticated merge. Imagine we have a dataset with babies and the days they were born. We also have a dataset with doctors containing flags for the days they worked at the hospital. I wanted to create a dataset that would list all the possible baby-doctor combinations such that the doctor might have delivered the baby. ie. The doctor worked on the baby's birthday. Below is the solution which I adapted from code someone posted online in response to this question.
data babies;
input baby_name $ birth_day birth_day_name $;
datalines;
Jake 1 day1
Sonny 4 day4
North 5 day5
Apple 6 day6
;
run;
data doctors;
input DrLastname $ day1 day2 day3 day4 day5 day6;
datalines;
Jones 1 0 0 1 1 1
Lewis 1 1 1 0 0 1
Smith 0 1 1 1 0 1
;
run;
data babies_doctors_array;
array drnames[3] $10 _temporary_;
array drdays[3,6] _temporary_;
/* load doctors dataset into temp arrays */
if _n_=1 then do i = 1 to nobs_doctors;
set doctors point=i nobs=nobs_doctors;
array days day1-day6;
drnames=DrLastname;
do j = 1 to dim(days);
drdays[i,j]=days
end;
end;
/* go through babies to find doctors that worked on thei birthday*/
set babies;
do k = 1 to nobs_doctors;
if drdays[k,birth_day]=1 then do;
babys_doctor = drnames
output;
end;
end;
keep baby_name birth_day babys_doctor;
run;
proc print data=babies_doctors_array; run;
The variable nobs_doctors is used in the do loop before the set statement in which it is declared.
The most recent case of this I've encountered that prompted me to start this discussion looks like it is a coding error to me. Here is a really stripped down version of the code.
data raw;
format dos date9.;
input id dos mmddyy. comp1 comp2 comp3;
datalines;
1 121299 1 0 0
1 121299 0 1 0
1 101103 0 1 0
2 030400 1 1 0
2 030400 0 0 0
2 040400 0 0 1
3 041190 0 1 0
4 092090 0 0 1
4 051589 0 1 0
5 040300 0 0 0
5 071710 1 0 0
5 070899 0 1 0
6 030299 0 1 0
7 121200 1 0 0
;
run;
proc print data=raw;run;
proc sort data=raw;
by id dos;
run;
data fin;
set raw;
by id dos;
/* not sure about using compsum before it is defined */
if compsum = 0 then no_comps = 1;
compsum = sum(comp1, comp2, comp3);
run;
This just looks like a mistake to me and illustrates why I think this behavior is dangerous. It makes this kind of coding error hard to catch.
Thanks again for all your help.
-Adam
Hi ... one comment on your statement "The variable nobs_doctors is used in the do loop before the set statement in which it is declared."
This is related to COMPILE versus EXECUTION and this works fine ...
data all;
input x @@;
datalines;
1 2 3 4 5 6 7 8 9 10
;
data even;
do _n_=1 to howmany;
set all nobs=howmany;
if ^mod(x,2) then output;
end;
run;
since a value is assigned to the variable HOWMANY prior to data step execution.
It might be worth mentioning that the variable HOWMANY is defined/declared by its first use and in this case gets the proper data type numeric. In this example I is implicitly declared character and SAS doesn't want that.
As to @adam_black second example I understand how that can viewed as an error but I don't know if SAS can detect that unless you could turn off implicit declaration (may be possible I don't know). I know you can turn the UN-INIT note to an error with NOTE2ERR option.
In the following code...
data all;
input x @@;
datalines;
1 2 3 4 5 6 7 8 9 10
;
data even;
do _n_=1 to howmany;
set all nobs=howmany;
if ^mod(x,2) then output;
end;
run;
I would love it if you could explain the control flow of the second data step.Namely, is there an implied loop created by the set statement or is the implied loop overridden by the outer do loop? Thanks again for your help!
Hi. The OUTER loop rules. There's one pass through the data step since all the observations in the data set are read within the loop. The data step still returns to the beginning one more time to see if there are any more observations to be read. Since there are none, the data step is finished. You can see that by modifying the code as follows and looking at the LOG (you'll see two AT START and one AT END) ...
data even;
PUT "AT START";
do _n_=1 to howmany;
set all nobs=howmany;
if ^mod(x,2) then output;
end;
PUT "AT END";
run;
However, there's really no reason to do what I did in my example. Without the outer do loop, the data step would work the same way. I only did that to show you that even though the SET statement with NOBS=HOWMANY occurs after the start of the loop that uses the variable HOWMANY, the loop still works (example of what happens prior to execution versus what happens during execution of a data step).
One situation where you might use actually use a loop within a data step is that it's one way to separate statements that need only to be executed once rather than on each pass through the data step. The execute once statements occur outside the loop while those executed as each observation is read occur within the loop.
Another great use of loops to read data is the DOW, especially handy when looking at groups of observations with your data with BY-GROUP processing. If you don't know about the DOW, there's no need to explain it here since there are a number of really good papers to read ...
HOW to DOW (if you have never read a Paul Dorfman paper, it's a treat)
http://support.sas.com/resources/papers/proceedings12/156-2012.pdf
an earlier version of some material in the above ...
The DOW-Loop Unrolled (another one by Paul Dorfman, why not read the best)
http://analytics.ncsu.edu/sesug/2007/SD08.pdf
The above includes the following which is what I meant by isolating the repeated execution of some statements from others to be executed once ...
The DOW-loop (Whitlock DO-loop) is a nested repetitive DATA step programming
structure, intentionally organized in order to allow for programmatically and logically
natural isolation of DO-loop instructions related to a certain break-event from actions
performed before and after the loop, and without resorting to superfluous conditional
statements.
Just search for "SAS DOW" and you'll find more.
And about this ...
/* not sure about using compsum before it is defined */
if compsum = 0 then no_comps = 1;
compsum = sum(comp1, comp2, comp3);
The statement with the SUM function is just an assignment statement (and not at all DANGEROUS) ... as explained in on-line HELP ...
Syntax
variable=expression;
Arguments
variable
names a new or existing variable.
Range variable can be a variable name, array reference, or SUBSTR function.
Tip Variables that are created by the Assignment statement are not automatically retained.
expression
is any SAS expression.
Tip expression can contain the variable that is used on the left side of the equal sign. When a variable appears on both sides of a statement, the original value on the right side is used to evaluate the expression, and the result is stored in the variable on the left side of the equal sign. For more information, see Expressions in SAS Language Reference: Concepts.
Prior to the assignment statement, the value of COMPSUM is MISSING, not 0 so NO_COMPS is never assigned a value.
This works fine ...
data fin;
set raw;
compsum = sum(of comp:);
run;
Values on the RIGHT of the = are known (variables or constants). Values on the LEFT of the = can be new or already known variables. If you are still squeamish about declaring variables, you could always assign an initial value in a RETAIN statement (DECLARATIVE, happens once, not EXECUTABLE) ...
retain comp_sum 0;
A RETAIN statement has other consequences, but in your case you are assigning a new value with every pass through the data step and the retaining of the variable value is of no consequence.
In response to "Values on the RIGHT of the = are known (variables or constants). Values on the LEFT of the = can be new or already known variables."
The behavior that I think is odd is that SAS allows new variables on the RIGHT side of the =.
For example..
data out;
new_var1 = new_var2;
run;
NOTE: Variable new_var2 is uninitialized.
NOTE: The data set WORK.OUT has 1 observations and 2 variables.
SAS does print a note telling me that new_var is uninitialized but allows it nevertheless. This note could get lost in the log a large program making a variable name typo a hard error to find.
Part of the convenience of SAS data step coding.
Here is a contrived example that takes advantage of that.
data new ;
lag_age = age ;
set sashelp.class ;
put _n_ age lag_age ;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.