I often use the keep and drop statements in the data step. They can be placed on the first line or later on before the run statement. (see below) I was wondering if there is a difference based on placement. I know SAS runs the first line it reads and then subsequent lines are in order. However, since technically the want data set has not been created then it seems it can't run the keep statement when it reads the first line. It matters when combining large datasets.
data want (keep = var 1 var 2);
set have1 have2 have3;
run;
OR
data want;
set have1 have2 have3;
keep var 1 var 2;
run;
Your second program has the wrong syntax.
Either it should be this which is no different to your first program and drops the variables while writing the output dataset
data want;
keep var 1 var 2; '
set have1 have2 have3;
run;
Or perhaps you intended this:
data want;
set have1 have2 have3 (keep = var 1 var 2);
run;
This drops while reading the input dataset so is more efficient.
Your second program has the wrong syntax.
Either it should be this which is no different to your first program and drops the variables while writing the output dataset
data want;
keep var 1 var 2; '
set have1 have2 have3;
run;
Or perhaps you intended this:
data want;
set have1 have2 have3 (keep = var 1 var 2);
run;
This drops while reading the input dataset so is more efficient.
@axescot78 wrote:
I often use the keep and drop statements in the data step. They can be placed on the first line or later on before the run statement. (see below) I was wondering if there is a difference based on placement. I know SAS runs the first line it reads and then subsequent lines are in order. However, since technically the want data set has not been created then it seems it can't run the keep statement when it reads the first line. It matters when combining large datasets.
data want (keep = var 1 var 2); set have1 have2 have3; run;
OR
data want; set have1 have2 have3; keep = var 1 var 2; run;
When you use the Keep or Drop on the DATA statement you are only applying to the specific data set when it is written.
The data step can create more than one data set so this can be handy such as pseudo code:
data want (keep = var 1 var 2) want2 (drop= var1 var2) ; set have1 have2 have3; if <some condition goes here> then output want; else output want2; run;
An example: I deal with some data sources that have some issues with data entry. So when I read the text data I separate out records that have certain issues, such as an "end date" before a "start date". I only write enough variables to the output set to help write the problem record report (keep= ). The records without any problems go into the working data set with all of the varaibles.
KEEP statements are never "RUN". They are what the documentation calls non-executable. They are only "COMPILED". That is they change the definition of what is going to happen when the data step actually runs but there is no executable code generated for them. You cannot put a KEEP statement inside an IF/THEN statement and have it keep different variables based on different values of the data. So where you place a KEEP statement in the data step does not really matter. Although if you are using variable lists (X1-X5 A--B C: _ALL_ _NUMERIC_ _CHARACTER_ etc. ) in your KEEP statement the set of variables available to the compiler can be different based on where you place the statement. But that is not about when it runs.
There are some subtle differences between using the KEEP statement and the KEEP= dataset option. For the simple single output dataset like your example the two forms you show will create the same result. And you could even consider the third options of add the KEEP= option to the input dataset.
data want1(keep=name);
set sashelp.class;
run;
data want2;
set sashelp.class;
keep name;
run;
data want3;
set sashelp.class(keep=name);
run;
But logically there is a difference.
The step above that is making WANT3 is the most different. By adding the KEEP= option on the input dataset the other variables (AGE, HEIGHT, etc) are never even brought into the data step.
The difference between the first two is subtler. Using the KEEP statement means that the only those variables will be written to ANY of the output datasets the DATA step is producing. Having the KEEP= option on the output dataset means than even though the data step was prepared to send the other variables to WANT1 SAS will ignore those variable will actually writing the data in WANT1.
Consider the impact of adding a RENAME statement to your code so that NAME is changed to OLD_NAME.
1028 data want1(keep=name); 1029 set sashelp.class; 1030 rename name=old_name; 1031 run; WARNING: The variable name in the DROP, KEEP, or RENAME list has never been referenced. NOTE: There were 19 observations read from the data set SASHELP.CLASS. NOTE: The data set WORK.WANT1 has 19 observations and 0 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 1032 1033 1034 data want2; 1035 set sashelp.class; 1036 keep name; 1037 rename name=old_name; 1038 run; NOTE: There were 19 observations read from the data set SASHELP.CLASS. NOTE: The data set WORK.WANT2 has 19 observations and 1 variables.
Since the variable that is being prepared by the data step to send to the output dataset now have OLD_NAME instead of NAME the KEEP= option on the first dataset did not find any such variable and we ended up with a dataset with ZERO variables.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.