How SAS processes jobs is the responsibility of the SAS Supervisor and an understanding of its function is important.
While the details of how it works have changed over time, much of the basics of the SAS Supervisor have been reasonably consistent over time.
This article contains:
THE SAS SUPERVISOR
Don Henderson & Merry Rabb
This paper was originally presented many, many SUGIs ago, specifically, SUGI 83 in New Orleans. It was one of the presentations at the very first Tutorials section. It has been available online as a scanned image thanks to NESUG. That image was converted to text using OCR so that it could be published on the sasCommunity.org site in a searchable form and has been moved to the SAS Communities Library.
This tutorial discusses the functions of the SAS Supervisor during the execution of a SAS DATA Step program and is a repeat presentation of a paper given in the Tutorial and the Advanced Tutorial sessions of SUGI 12. The functions of the SAS Supervisor can be categorized as follows:
The actions of the Supervisor during both the compile and execution phases of a SAS job will be illustrated.
When a SAS DATA Step program is written, the DATA Step6 module must be integrated within the structure of the SAS System. This integration is done by the SAS Supervisor. Gaining a more complete understanding of what the Supervisor does and how our program is controlled by it is crucial to using the SAS System more effectively.
There are distinct compile and execute steps for all SAS jobs. This fact is not readily apparent since a single program, the SAS Supervisor, handles the compile and execution (including linkage-editing) steps of a SAS job. There is a distinct compile step and execution step for each DATA or PROC step in a SAS job. The DATA and PROC steps are compiled and executed independently according to their sequence in the program. In particular, the first DATA/PROC step is compiled and then executed; this is then followed by the compilation and the execution for the next DATA/PROC step, etc. The SAS Supervisor controls this processing.
The SAS programmer has tools that allows him or her to take full advantage of the compile/ execute structure for SAS jobs. For example, through the use of the Macro Language, the programmer has control over the sequence of DATA/PROC steps seen by the Supervisor and of the statements contained within each step. There are other tools and techniques which are available to exercise control over Supervisor functions within a given DATA Step, such as conditional execution of a read operation, or reading data within a loop. The following sections discuss the actions of the Supervisor during compilation and execution of a DATA Step, and the coding techniques that can be used to control or override the Supervisors default actions.
During the compilation of a DATA Step, the Supervisor creates both permanent and transient (in that they disappear after the compilation or execution of the current DATA Step) entities. The primary permanent entity is the directory or header portion of the SAS data set (the data is added to the data set at execution time). The transient entities include a variety of buffers, flags and work areas which, at execution time, control the creation of the desired output. The following is a partial list of the more important actions taken by the SAS Supervisor during the compilation of a DATA Step:
The last four actions in the above list will be discussed in the following subsections.
The Program Data Vector (PDV) is a logical buffer which includes all variables referenced either explicitly or implicitly in the DATA Step, it is used at execution time as the location where the working values of variables are stored as they are processed by the DATA Step "program." The PDV is created at compile time by the SAS Supervisor. Variables are added to the PDV sequentially as they are encountered during the parsing and interpretation of SAS source statements. The following rules are used in defining the variables and their attributes to the PDV:
The use of these rules is illustrated for a sample program in Figure 1.
The specification of the list of variables to be copied from the PDV to the output SAS data set is best illustrated by another logical buffer called the DROP/KEEP Table (DKT) that has a one-to- one relationship to the PDV in that it contains a column for each variable in the PDV. It contains a row for each output data set. However, unlike the PDV, the elements of the DKT can only take the values of "D" or "K," for drop and keep. Furthermore, its values are supplied at compile time and can not be altered during the execution phase of a DATA Step. The Supervisor uses the following rules in setting DKT values:
The use of these rules is illustrated in Figure 2.
The specification of the variables that are to be initialized to missing between every execution of the DATA Step program by the SAS Supervisor is also illustrated by a buffer with a one-to- one correspondence to the PDV. The elements of this Initialize To Missing Vector (ITMV) can take three possible values:
These values, like the values in the DKT, are defined at compile time and can not be changed at execution time. The ITMV values for all variables are initially set to "Y" and are changed to "N" for the following situations:
All variables which are referenced in SET, MERGE or UPDATE statements will have ITMV values set to "N", or "R" according to the following rules:
These rules are illustrated in Figure 3.
In addition to the above buffers or vectors, other flag variables are created during the compile phase of a DATA Step program. The Data Step Failed Flag (DSFF) and the End Data Step Flag (EDSF) are created at compile time; their values are supplied at execution time. The Output Statement Present Flag (OSPF) is created and its value is supplied at compile time. OSPF is set to "Y" if there is any output statement present in the DATA Step program, otherwise it is set to "N." The DSFF, EDSF and OSPF are all used by the SAS Supervisor during the execution phase to control DATA Step processing. Values for the PDV, DKT, ITMV, DSFF, EDSF and OSPF for a sample program at the completion of the compile phase are illustrated in Figure 4.
It should be noted that the above represents only a subset of the SAS Supervisor compile time functions. All of the following statements (non- executable or information statements) do all of their work at compile time:
Because these statements have their effect at compile time, their location within the DATA Step code is irrelevant; they may be placed at the beginning, at the end, or anywhere within the DATA Step program. An exception to this is the LENGTH statement, which should always be placed at the beginning of the DATA Step. This ensures that variables are added to the PDV by the reference on the LENGTH statement. These points should be kept in mind when writing and debugging SAS programs.
Once the DATA Step has been successfully compiled and all of the above described buffers and flags have been created, the execution phase of the DATA Step can begin. This is illustrated by the simple program flow in Figure 5.
The SAS DATA Step can be viewed as a subroutine which is executed repeatedly by the SAS Supervisor, usually until there is no more input data. In a typical SAS job, the Supervisor does the following:
The details of what happens during the execution of the DATA Step program (step 2 above) is controlled by the user in their SAS code. The details of how the Supervisor performs steps 1,3 and 4, as described above, will be discussed in this section along with a description of how the buffers and flags (created during the compile phase) are used. It should be remembered that the actions of the SAS Supervisor within the execution phase of a DATA Step are geared towards one goal: the repeated execution of a DATA Step program. In other words, the DATA Step program can be viewed as the inside of a read-write loop.
The SAS Supervisor performs initialization before every execution of our DATA Step program using the PDV and the ITMV as follows:
The DATA Step program is then executed (called). The programming statements that comprise the DATA Step are executed, supplying values for the variables in the PDV.
Once the DATA Step program has finished, control is returned to the SAS Supervisor which decides whether to copy the contents of the PDV to the output SAS data set. The OSPF, DSFF, DKT and PDV are used to do this as follows:
The OUTPUT routine, which is also invoked when an OUTPUT statement is executed from within the DATA Step program, can be described as follows:
The value for the OSPF is set at compile time. Values for DSFF and EDSF are set at execution time. The setting of these flags is discussed in the following paragraphs which also addresses the looping or repeated execution of the DATA Step program done by the SAS Supervisor.
On referring to Figure 5, the question arises as to how the SAS Supervisor knows when to stop executing the DATA Step program. The more detailed flow diagram given in Figure 6 is a more accurate representation of execution time processing, which can be described as follows:
In writing a SAS DATA Step program, it is crucial to keep in mind the statements that return control to the Supervisor and how they impact the values of the DSFF and EDSF flags. Execution of the following statements all cause an immediate return to the SAS Supervisor with the indicated values for the flags:
|IF false <expression>||Y||N|
|Failed read operation(i.e.
INPUT, SET, MERGE or UPDATE)
On reviewing the above table and the rules for the SAS Supervisor OUTPUT, it is clear that when OSPF="Y", DELETE, a false subsetting IF <expression> and RETURN are equivalent, since the default Supervisor OUTPUT is dependent on OSPF="N" and DSFF="N". When OSPF="Y", the value of DSFF has no impact on the Supervisor's default OUTPUT. This is not readily apparent without an understanding of how the Supervisor works during DATA Step execution.
The SAS read operations SET and MERGE perform two general actions when executed:
These actions are performed according to a set of rules, depending on which type of read operation is being performed and whether or not a BY statement is present. The rules for each type of statement are examined below.
When a SET statement references more than one SAS data set, and no BY statement is present, the data sets listed on the SET statement are concatenated. The SET statement performs the following actions when executed:
When a SET statement referencing more than one SAS data set has a BY statement associated with it, the data sets listed on the SET statement are interleaved. The SET statement performs the following actions when executed:
When a MERGE statement with no BY statement is present, the observations in the data sets listed on the MERGE statement are merged one-to-one. The MERGE statement performs the following actions when executed:
When a MERGE statement with a BY statement is executed, the observations in the data sets listed are merged according to the values of the variables on the BY statement.
The MERGE statement performs the following actions when executed:
Understanding these rules can be helpful when writing or debugging a SAS program with complex DATA Step code. The following sections discuss how the SAS programmer can take control from the SAS Supervisor within a DATA Step, and how the rules governing the actions of the Supervisor and of the read operations apply under those circumstances.
SAS programmers can take control from the SAS Supervisor within a DATA Step by placing the read operation statement inside a DO loop. When the SAS Supervisor does the looping, the read operation executes once for each execution of the DATA Step. There is some overhead involved in returning control to the Supervisor each time. Therefore, the programmer might consider reading large data sets within a DO loop in order to increase the efficiency of the DATA Step. There might be other reasons for reading data within a loop, such as when a data set must be searched to find a specific observation.
The program shown in Figure 7 merges two SAS data sets in one execution of the DATA Step. Because control is not returned to the Supervisor each time, an OUTPUT statement must be present inside the loop. If no OUTPUT statement were present, the output SAS data set would have only one observation. It is also important to remember that the value of _N_ is only incremented when the Supervisor begins a new execution of the DATA Step, thus the value of _N_ in the PDV will be equal to 1 for every observation processed.
Looping also has an effect on how variables in the PDV are initialized to missing. Variables with ITMV values of "Y" are only initialized to missing when program control is returned to the SAS Supervisor. These variables will NOT be initialized to missing as long as the loop is executing. Note that looping has no effect on the initialization of variables with ITMV values of "R" or "N".
It is important to remember which statements automatically return control to the Supervisor when executed. These statements should not be placed within the DO loop where the data is being read, since their execution will cause an exit from the DO loop.
Any SAS read operation can be executed conditionally. For example, suppose the programmer has a SAS data set with a single observation that contains a constant. This constant value is needed for each execution of the DATA Step. The SAS data set can be read by executing the SET statement only once, as shown in Figure 8.
Since variables read from a SET statement referencing a single SAS data set have ITMV values set to "N", the constant will not be initialized to missing on subsequent executions of the DATA Step, therefore a RETAIN statement listing the variables in the data set OVERALL is not necessary.
Any SAS data set may be referenced at compile time only. For example, suppose the programmer wants to add variables to the PDV, even though those variables will not be read in this DATA Step. This can be accomplished by "executing" the read operation based on a condition that will never be true. For example, at compile time the statement "IF 0 THEN SET INVDESC;" adds the variables in INVDESC to the PDV. At execution time no data will ever be read since 0 is false. This can be a useful technique for creating a "shell" of a SAS data set, where values are to be added later using FSEDIT or UPDATE.
Using the SET statement at compile time only can also be used in conjunction with the NOBS and POINT options.
The program in Figure 9 sets a macro variable whose value is the number of observations in data set INVDESC. The SET statement is never executed because the "IF 0" condition is never true. However, the value for the NOBS= variable (N_OBS) is supplied at compile time. The only executable statements in the DATA Step are "CALL SYMPUT" and "STOP."
These examples illustrate that by gaining a more complete understanding of what the SAS Supervisor is doing at both compile and execution time, SAS programmers can make more informed decisions as to what they can let the SAS Supervisor do and what they should control themselves. This understanding should also permit the development of more flexible and efficient SAS programs
Feel free to contact @DonH if you have questions.