BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
daszlosek
Quartz | Level 8

Hello, all, I was trying to concatenate some variables, but I noticed it was taking quite some time to run the code. I tried using a keep statement in the set statement to decrease the time, and it took less CPU time but more real time (to my surprise). I assumed that since the keep statement pulls only the variable of interest, it would take less time (and it does take less CPU time!). Just for the heck of it, I ran both codes 10 times and the averages are shown below the code snippets. This is more of a curiosity than an actual problem. 

 

So I guess my overall question is, why does it take more real-time? 

 

 

/*Code Snippet 1*/
data want;
set Lib.have;
cat_var = catt(var1,"_",var2,"_",var3,"_",var4);
run;
/* CPU 9.62
Real-time 1.21 */ /*Code Snippet 2*/ data want; set Lib.have(keep= var1 var2 var3 var4); cat_var = catt(var1,"_",var2,"_",var3,"_",var4); run;

/* CPU 5.17
Real-time 2.35 */

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

If you're reading from a SAS data set file, then all the data is read from disk to the input buffer (i.e. total volume of disk input is unchanged no matter what variables are in the keep list).  So the keep option doesn't prevent any data from being read from disk.   But, once data is in the input buffer, variable values are copied to the program data vector.  That's where the burden is presumably reduced by use keep.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

7 REPLIES 7
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well, I am just guessing here, CPU time is less, as less columns need to be processed - makes sense.  Actual time could be any number of things, but I guess the write to new dataset is taking longer as it has to write a new header file, whereas in the first example its merely a copy of the existing one.  Just a guess. 

 

As a tip, catx might be more approprate there:

cat_var = catx('_',var1,var2,var3,var4);

 Or if they are the only four:

cat_var = catx('_',of var:);

 

mkeintz
PROC Star

I'd suggest using data _NULL_ instead of data WANT for your comparisons - it takes away the confounding effect of output operations.  Or at least don't forget to "proc delete data=want;run;" prior to each of your two tests.

 

And you are running this on a single-user machine, getting consistent repeated results, right?   With no other processes competing for computer resources?

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
daszlosek
Quartz | Level 8
There is some slight competition with R , but the averages have very small standard deviations (on the order of 0.01 seconds).
Tom
Super User Tom
Super User

Not sure why you are seeing the times you have.

Also let's assume that the calculated variable is NOT in the source data set in the first case to eliminate the difference between the two cases.  (Note that it is definitely NOT in the effective source data in the second case because even it it was it was dropped by not being included in the KEEP= dataset option).

 

When you ask SAS to use the KEEP= (or DROP=) dataset option you are telling it to

  1. Read the data from the dataset
  2. Copy only some of the variables into the PDV (program data vector).

So I am surprised that the second one doesn't have more CPU, but perhaps the COPY step is not getting counted as CPU cycles somehow?

 

Another difference  that exist between the two steps is that in the first you could be writing many other variables into the OUTPUT dataset.  That should increase CPU and real time.  

 

Do you have the COMPRESS option turned on?  Perhaps for the second step SAS noticed that you have too few variables for the COMPRESS option to be useful and so turned it off.

 

Also if you are pulling your source data from an external database then SAS will push the selection of the variables to transfer to SAS into the database query.  That will drop out of CPU calculations but might introduce a delay that will count in the real time.

daszlosek
Quartz | Level 8
Ah, I believe I was misinterpreting how the keep function works. I thought that the keep statement made it only read 'some of the data' from the variables I have selected. What you are saying is that the program actually reads the entire data then pulls out those of interest and copies them, which is actually more computationally expensive. correct? Here, I thought it would be less expensive due to not pulling out all other variables.



You bring up a good point about the COMPRESS option, and I would have to agree that is probably was turned off due to too few variables.


mkeintz
PROC Star

If you're reading from a SAS data set file, then all the data is read from disk to the input buffer (i.e. total volume of disk input is unchanged no matter what variables are in the keep list).  So the keep option doesn't prevent any data from being read from disk.   But, once data is in the input buffer, variable values are copied to the program data vector.  That's where the burden is presumably reduced by use keep.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
daszlosek
Quartz | Level 8
Very interesting! Thank you for the continued knowledge.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 3298 views
  • 0 likes
  • 4 in conversation