DATA Step, Macro, Functions and more

Efficiency of using Keep function in set statement

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 76
Accepted Solution

Efficiency of using Keep function in set statement

Hello, all, I was trying to concatenate some variables, but I noticed it was taking quite some time to run the code. I tried using a keep statement in the set statement to decrease the time, and it took less CPU time but more real time (to my surprise). I assumed that since the keep statement pulls only the variable of interest, it would take less time (and it does take less CPU time!). Just for the heck of it, I ran both codes 10 times and the averages are shown below the code snippets. This is more of a curiosity than an actual problem. 

 

So I guess my overall question is, why does it take more real-time? 

 

 

/*Code Snippet 1*/
data want;
set Lib.have;
cat_var = catt(var1,"_",var2,"_",var3,"_",var4);
run;
/* CPU 9.62
Real-time 1.21 */ /*Code Snippet 2*/ data want; set Lib.have(keep= var1 var2 var3 var4); cat_var = catt(var1,"_",var2,"_",var3,"_",var4); run;

/* CPU 5.17
Real-time 2.35 */

 

 


Accepted Solutions
Solution
‎05-24-2017 03:48 PM
Trusted Advisor
Posts: 1,018

Re: Efficiency of using Keep function in set statement

[ Edited ]
Posted in reply to daszlosek

If you're reading from a SAS data set file, then all the data is read from disk to the input buffer (i.e. total volume of disk input is unchanged no matter what variables are in the keep list).  So the keep option doesn't prevent any data from being read from disk.   But, once data is in the input buffer, variable values are copied to the program data vector.  That's where the burden is presumably reduced by use keep.

View solution in original post


All Replies
Super User
Super User
Posts: 7,942

Re: Efficiency of using Keep function in set statement

Posted in reply to daszlosek

Well, I am just guessing here, CPU time is less, as less columns need to be processed - makes sense.  Actual time could be any number of things, but I guess the write to new dataset is taking longer as it has to write a new header file, whereas in the first example its merely a copy of the existing one.  Just a guess. 

 

As a tip, catx might be more approprate there:

cat_var = catx('_',var1,var2,var3,var4);

 Or if they are the only four:

cat_var = catx('_',of var:);

 

Trusted Advisor
Posts: 1,018

Re: Efficiency of using Keep function in set statement

Posted in reply to daszlosek

I'd suggest using data _NULL_ instead of data WANT for your comparisons - it takes away the confounding effect of output operations.  Or at least don't forget to "proc delete data=want;run;" prior to each of your two tests.

 

And you are running this on a single-user machine, getting consistent repeated results, right?   With no other processes competing for computer resources?

 

Frequent Contributor
Posts: 76

Re: Efficiency of using Keep function in set statement

There is some slight competition with R , but the averages have very small standard deviations (on the order of 0.01 seconds).
Super User
Super User
Posts: 7,039

Re: Efficiency of using Keep function in set statement

[ Edited ]
Posted in reply to daszlosek

Not sure why you are seeing the times you have.

Also let's assume that the calculated variable is NOT in the source data set in the first case to eliminate the difference between the two cases.  (Note that it is definitely NOT in the effective source data in the second case because even it it was it was dropped by not being included in the KEEP= dataset option).

 

When you ask SAS to use the KEEP= (or DROP=) dataset option you are telling it to

  1. Read the data from the dataset
  2. Copy only some of the variables into the PDV (program data vector).

So I am surprised that the second one doesn't have more CPU, but perhaps the COPY step is not getting counted as CPU cycles somehow?

 

Another difference  that exist between the two steps is that in the first you could be writing many other variables into the OUTPUT dataset.  That should increase CPU and real time.  

 

Do you have the COMPRESS option turned on?  Perhaps for the second step SAS noticed that you have too few variables for the COMPRESS option to be useful and so turned it off.

 

Also if you are pulling your source data from an external database then SAS will push the selection of the variables to transfer to SAS into the database query.  That will drop out of CPU calculations but might introduce a delay that will count in the real time.

Frequent Contributor
Posts: 76

Re: Efficiency of using Keep function in set statement

Ah, I believe I was misinterpreting how the keep function works. I thought that the keep statement made it only read 'some of the data' from the variables I have selected. What you are saying is that the program actually reads the entire data then pulls out those of interest and copies them, which is actually more computationally expensive. correct? Here, I thought it would be less expensive due to not pulling out all other variables.



You bring up a good point about the COMPRESS option, and I would have to agree that is probably was turned off due to too few variables.


Solution
‎05-24-2017 03:48 PM
Trusted Advisor
Posts: 1,018

Re: Efficiency of using Keep function in set statement

[ Edited ]
Posted in reply to daszlosek

If you're reading from a SAS data set file, then all the data is read from disk to the input buffer (i.e. total volume of disk input is unchanged no matter what variables are in the keep list).  So the keep option doesn't prevent any data from being read from disk.   But, once data is in the input buffer, variable values are copied to the program data vector.  That's where the burden is presumably reduced by use keep.

Frequent Contributor
Posts: 76

Re: Efficiency of using Keep function in set statement

Very interesting! Thank you for the continued knowledge.
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 157 views
  • 0 likes
  • 4 in conversation