Solved: keep/drop statement placement, what's the difference?

axescot78 · Posted 09-01-2021 12:18 AM

I often use the keep and drop statements in the data step. They can be placed on the first line or later on before the run statement. (see below) I was wondering if there is a difference based on placement. I know SAS runs the first line it reads and then subsequent lines are in order. However, since technically the want data set has not been created then it seems it can't run the keep statement when it reads the first line. It matters when combining large datasets.

data want (keep = var 1 var 2);
	set have1 have2 have3;
run;

OR

data want;
	set have1 have2 have3;
	keep var 1 var 2; 
run;

SASKiwi · Posted 09-01-2021 12:46 AM

Your second program has the wrong syntax.

Either it should be this which is no different to your first program and drops the variables while writing the output dataset

data want;
  keep var 1 var 2; '
  set have1 have2 have3; 
run;

Or perhaps you intended this:

data want;
	set have1 have2 have3 (keep = var 1 var 2); 
run;

This drops while reading the input dataset so is more efficient.

View solution in original post

SASKiwi · Posted 09-01-2021 12:46 AM

Your second program has the wrong syntax.

Either it should be this which is no different to your first program and drops the variables while writing the output dataset

data want;
  keep var 1 var 2; '
  set have1 have2 have3; 
run;

Or perhaps you intended this:

data want;
	set have1 have2 have3 (keep = var 1 var 2); 
run;

This drops while reading the input dataset so is more efficient.

axescot78 · Posted 09-01-2021 12:25 PM

that second option didn't drop any of the columns. it kept all of it.

ballardw · Posted 09-01-2021 11:02 AM

@axescot78 wrote:

I often use the keep and drop statements in the data step. They can be placed on the first line or later on before the run statement. (see below) I was wondering if there is a difference based on placement. I know SAS runs the first line it reads and then subsequent lines are in order. However, since technically the want data set has not been created then it seems it can't run the keep statement when it reads the first line. It matters when combining large datasets.
data want (keep = var 1 var 2);
	set have1 have2 have3;
run;
OR
data want;
	set have1 have2 have3; keep = var 1 var 2; 
run;

When you use the Keep or Drop on the DATA statement you are only applying to the specific data set when it is written.

The data step can create more than one data set so this can be handy such as pseudo code:

data want (keep = var 1 var 2)
        want2 (drop= var1 var2)
;
	set have1 have2 have3;
    if <some condition goes here> then output want;
    else output want2;
run;

An example: I deal with some data sources that have some issues with data entry. So when I read the text data I separate out records that have certain issues, such as an "end date" before a "start date". I only write enough variables to the output set to help write the problem record report (keep= ). The records without any problems go into the working data set with all of the varaibles.

Tom · Posted 09-01-2021 01:59 PM

KEEP statements are never "RUN". They are what the documentation calls non-executable. They are only "COMPILED". That is they change the definition of what is going to happen when the data step actually runs but there is no executable code generated for them. You cannot put a KEEP statement inside an IF/THEN statement and have it keep different variables based on different values of the data. So where you place a KEEP statement in the data step does not really matter. Although if you are using variable lists (X1-X5 A--B C: _ALL_ _NUMERIC_ _CHARACTER_ etc. ) in your KEEP statement the set of variables available to the compiler can be different based on where you place the statement. But that is not about when it runs.

There are some subtle differences between using the KEEP statement and the KEEP= dataset option. For the simple single output dataset like your example the two forms you show will create the same result. And you could even consider the third options of add the KEEP= option to the input dataset.

data want1(keep=name);
  set sashelp.class;
run;

data want2;
  set sashelp.class;
  keep name;
run;

data want3;
  set sashelp.class(keep=name);
run;

But logically there is a difference.

The step above that is making WANT3 is the most different. By adding the KEEP= option on the input dataset the other variables (AGE, HEIGHT, etc) are never even brought into the data step.

The difference between the first two is subtler. Using the KEEP statement means that the only those variables will be written to ANY of the output datasets the DATA step is producing. Having the KEEP= option on the output dataset means than even though the data step was prepared to send the other variables to WANT1 SAS will ignore those variable will actually writing the data in WANT1.

Consider the impact of adding a RENAME statement to your code so that NAME is changed to OLD_NAME.

1028  data want1(keep=name);
1029    set sashelp.class;
1030    rename name=old_name;
1031  run;

WARNING: The variable name in the DROP, KEEP, or RENAME list has never been referenced.
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.WANT1 has 19 observations and 0 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


1032
1033
1034  data want2;
1035    set sashelp.class;
1036    keep name;
1037    rename name=old_name;
1038  run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.WANT2 has 19 observations and 1 variables.

Since the variable that is being prepared by the data step to send to the output dataset now have OLD_NAME instead of NAME the KEEP= option on the first dataset did not find any such variable and we ended up with a dataset with ZERO variables.

keep/drop statement placement, what's the difference?

Re: keep/drop statement placement, what's the difference?

Re: keep/drop statement placement, what's the difference?

Re: keep/drop statement placement, what's the difference?

Re: keep/drop statement placement, what's the difference?

Re: keep/drop statement placement, what's the difference?

keep/drop statement placement, what's the difference?

Re: keep/drop statement placement, what's the difference?

Re: keep/drop statement placement, what's the difference?

Re: keep/drop statement placement, what's the difference?

Re: keep/drop statement placement, what's the difference?

Re: keep/drop statement placement, what's the difference?

Click image to register for webinar

Classroom Training Available!