UNDERSTANDING DROP/KEEP WITH/IN PDV

A-junamu_- · Posted 07-07-2022 09:12 PM

I've seen so many different combinations of drop/keep statement, that my head is spinning. I need to understand the drop/keep statement and how to effectively use it. I understand the drop/keep after or in the data statement or at the end is the same thing, just a matter of processing time. Everything else needs clarity especially how compile time affects this.

For this example in the picture, SAS reads the 8 columns from the input table(customer) then creates two columns in the output table(Canada & US) and then drops the other columns(state & zip)?

If the two columns(state & zip) will be dropped at the beginning from the data step, why include it in the set statement?

For the drop country listed at the bottom, does it make a difference if dropped earlier or at the data statement?

I've seen the drop country added several ways and still used in the code to process other commands(like below). So what is the difference?

data Canada;
set customer;
drop country;
if country = "Canada" then output Canada;
else if country = "US" then output us;
run;

data Canada;
drop country;
set customer;
if country = "Canada" then output Canada;
else if country = "US" then output us;
run;

Any help is appreciated.

Tom · Posted 07-07-2022 10:35 PM

DROP and KEEP are data step STATEMENTS. Together they determine which of the variables in the PDV (the variable that exist while the data step is running) are going to be written to any output datasets.

DROP= and KEEP= are dataset OPTIONS.

If you use them on an INPUT dataset then they determine which variables will be read from that dataset (and so contribute to the set of variables in the data step aka the PDV).

If you use them on an OUTPUT dataset then they determine which of the variables (of those that ARE AVAILABLE TO BE WRITTEN) will actually be written to that particular dataset.

If the code in your first picture the variables STATE and ZIP will not be written to CANADA and the variable PROVINCE will not be written to US. And the variable COUNTRY will not be written ANY of the datasets since it is not available for writing since it is flagged to be dropped by the DROP statement.

How could you drop COUNTRY "earlier"? what would that even mean? If you don't read it in from CUSTOMER then it will not exist for the IF statement to use.

Note that since a DROP statement is not EXECUTABLE like an assignment statement or an IF statement then it does not matter where in the data step you put the DROP statement. You could move it to before the SET statement or before the IF statement and the effect is the same. Personally I put DROP and KEEP statements at the bottom of the data step because then I know I am done writing the code so I know all of the variables I used and which ones need to be included in the output dataset(s).

A-junamu_- · Posted 07-12-2022 06:28 PM

Which one are you referring to? Look at the two examples I showed. I've seen one of those used before and the statement still executed. Why did it execute if dropped earlier?

For the 1st code set, is it because the if statement is referencing the input table and the drop statement is referencing the output table so the column is not affected?

Here's another example:

Tom · Posted 07-12-2022 08:00 PM

I cannot understand what you are asking about.

In the new photograph the DROP statements means that END_DATE will NOT be written to STORM_COMPLETE. Where you place that DROP statement within the DATA step makes no difference since it is not "executed". Instead it is just instructions to the data step compiler for how to structure the dataset. Just like the LABEL statement or a FORMAT statement or an INFORMAT statement. It does not "occur" any earlier or later. It just helps define what the data step does in terms of which variables are available to be written out to the output dataset(s).

PS It is really hard to respond to photographs of code.

Tom · Posted 07-12-2022 08:12 PM

Maybe it would help to make physical model you can try yourself.

Take a deck of playing cards and divide it into four stacks by SUIT. So you have a CLUB, DIAMOND, HEART and SPADE stack. That is your input dataset HAVE with four variables.

Now run this program.

data want;
  set have;
  drop club ;
run;

So first you pick up one card from each stack and put them into your hand. The hand represents the variables that are active while the data step is running (also called the Program Data Vector or the PDV). Next place the DIAMOND, HEART and SPACE cards into three new stacks that will represent the WANT dataset. You will have to discard the CLUB card since that variable is being dropped so there is no stack to place it into. Repeat until you run out of cards in the original four stacks.

How would you modify this example to simulate the effect of a DROP= dataset option on the input dataset?
How would you modify this example to simulate the effect of a DROP= dataset option on the output dataset?

If you want to get tricky how would you modify it to represent the RENAME statement or the RENAME= dataset option on either input or output dataset?

UNDERSTANDING DROP/KEEP WITH/IN PDV

Re: UNDERSTANDING DROP/KEEP WITH/IN PDV

Re: UNDERSTANDING DROP/KEEP WITH/IN PDV

Re: UNDERSTANDING DROP/KEEP WITH/IN PDV

Re: UNDERSTANDING DROP/KEEP WITH/IN PDV

UNDERSTANDING DROP/KEEP WITH/IN PDV

Re: UNDERSTANDING DROP/KEEP WITH/IN PDV

Re: UNDERSTANDING DROP/KEEP WITH/IN PDV

Re: UNDERSTANDING DROP/KEEP WITH/IN PDV

Re: UNDERSTANDING DROP/KEEP WITH/IN PDV

Registration is open

SAS Training: Just a Click Away