Solved: Efficiency of using Keep function in set statement

daszlosek · Posted 05-24-2017 11:13 AM

Hello, all, I was trying to concatenate some variables, but I noticed it was taking quite some time to run the code. I tried using a keep statement in the set statement to decrease the time, and it took less CPU time but more real time (to my surprise). I assumed that since the keep statement pulls only the variable of interest, it would take less time (and it does take less CPU time!). Just for the heck of it, I ran both codes 10 times and the averages are shown below the code snippets. This is more of a curiosity than an actual problem.

So I guess my overall question is, why does it take more real-time?

/*Code Snippet 1*/
data want;
set Lib.have;
cat_var = catt(var1,"_",var2,"_",var3,"_",var4);
run;
/* CPU 9.62
   Real-time 1.21 
*/

/*Code Snippet 2*/
data want;
set Lib.have(keep= var1 var2 var3 var4);
cat_var = catt(var1,"_",var2,"_",var3,"_",var4);
run;

/* CPU 5.17
   Real-time 2.35 
*/

mkeintz · Posted 05-24-2017 01:34 PM

If you're reading from a SAS data set file, then all the data is read from disk to the input buffer (i.e. total volume of disk input is unchanged no matter what variables are in the keep list). So the keep option doesn't prevent any data from being read from disk. But, once data is in the input buffer, variable values are copied to the program data vector. That's where the burden is presumably reduced by use keep.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

RW9 · Posted 05-24-2017 11:28 AM

Well, I am just guessing here, CPU time is less, as less columns need to be processed - makes sense. Actual time could be any number of things, but I guess the write to new dataset is taking longer as it has to write a new header file, whereas in the first example its merely a copy of the existing one. Just a guess.

As a tip, catx might be more approprate there:

cat_var = catx('_',var1,var2,var3,var4);

Or if they are the only four:

cat_var = catx('_',of var:);

mkeintz · Posted 05-24-2017 11:37 AM

I'd suggest using data _NULL_ instead of data WANT for your comparisons - it takes away the confounding effect of output operations. Or at least don't forget to "proc delete data=want;run;" prior to each of your two tests.

And you are running this on a single-user machine, getting consistent repeated results, right? With no other processes competing for computer resources?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

daszlosek · Posted 05-24-2017 12:03 PM

There is some slight competition with R , but the averages have very small standard deviations (on the order of 0.01 seconds).

Tom · Posted 05-24-2017 12:00 PM

Not sure why you are seeing the times you have.

Also let's assume that the calculated variable is NOT in the source data set in the first case to eliminate the difference between the two cases. (Note that it is definitely NOT in the effective source data in the second case because even it it was it was dropped by not being included in the KEEP= dataset option).

When you ask SAS to use the KEEP= (or DROP=) dataset option you are telling it to

Read the data from the dataset
Copy only some of the variables into the PDV (program data vector).

So I am surprised that the second one doesn't have more CPU, but perhaps the COPY step is not getting counted as CPU cycles somehow?

Another difference that exist between the two steps is that in the first you could be writing many other variables into the OUTPUT dataset. That should increase CPU and real time.

Do you have the COMPRESS option turned on? Perhaps for the second step SAS noticed that you have too few variables for the COMPRESS option to be useful and so turned it off.

Also if you are pulling your source data from an external database then SAS will push the selection of the variables to transfer to SAS into the database query. That will drop out of CPU calculations but might introduce a delay that will count in the real time.

daszlosek · Posted 05-24-2017 12:13 PM

Ah, I believe I was misinterpreting how the keep function works. I thought that the keep statement made it only read 'some of the data' from the variables I have selected. What you are saying is that the program actually reads the entire data then pulls out those of interest and copies them, which is actually more computationally expensive. correct? Here, I thought it would be less expensive due to not pulling out all other variables.

You bring up a good point about the COMPRESS option, and I would have to agree that is probably was turned off due to too few variables.

mkeintz · Posted 05-24-2017 01:34 PM

If you're reading from a SAS data set file, then all the data is read from disk to the input buffer (i.e. total volume of disk input is unchanged no matter what variables are in the keep list). So the keep option doesn't prevent any data from being read from disk. But, once data is in the input buffer, variable values are copied to the program data vector. That's where the burden is presumably reduced by use keep.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

daszlosek · Posted 05-24-2017 03:48 PM

Very interesting! Thank you for the continued knowledge.

Efficiency of using Keep function in set statement

Re: Efficiency of using Keep function in set statement

Re: Efficiency of using Keep function in set statement

Re: Efficiency of using Keep function in set statement

Re: Efficiency of using Keep function in set statement

Re: Efficiency of using Keep function in set statement

Re: Efficiency of using Keep function in set statement

Re: Efficiency of using Keep function in set statement

Re: Efficiency of using Keep function in set statement

Classroom Training Available!