BookmarkSubscribeRSS Feed
Suryak
Calcite | Level 5
options stimer;

data du;
do until(age=15);
set sashelp.class;
end;
run;


data ifst;
set sashelp.class;
if age=15 then output;
run;
5 REPLIES 5
mkeintz
PROC Star

This is likely faster:

 

data wh15;
  set sashelp.class;
  where age=15;
run;

This is because the WHERE statement exports the filtering to the data engine.  Unlike your two alternatives, the data step never sees the unwanted observations, so never needs to populate the program data vector with them.

 

Of course you will need a lot larger dataset than sashelp.class to find any difference.

 

Also if you are doing a test of input speed, don't contaminate the test with (implied) output operations.  Use data _null_, as in:

 

data _null_;
do until(age=15);
set sashelp.class;
end;
run;


data _null_;
set sashelp.class;
if age=15 then output;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
PGStats
Opal | Level 21

INCORRECT STATETENTS

 

Also note that both steps are not equivalent:

 

If age=15 doesn't exist in sashelp.class,
dataset du will contain the last observation of sashelp.class
but dataset ifst will be empty.

 

If age=15 occurs more than once in sashelp.class,
dataset du will contain the first such observation
but dataset ifst will contain all of them.

 

EDIT: See @Quentin's comment below. My statements above were INCORRECT. I should have tested. Sorry.

PG
Quentin
Super User

@PGStats , they looks equivalent to me.  I don't think either of your statements about what happens if age=15 doesn't exist, or if there are multiple records with age=15, are correct.

 

1
2    data du;
3    do until(age=99);
4    set sashelp.class;
5    end;
6    run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.DU has 0 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.04 seconds
      cpu time            0.00 seconds


7
8
9    data ifst;
10   set sashelp.class;
11   if age=99 then output;
12   run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.IFST has 0 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


13
14
15   data du;
16   do until(age=15);
17   set sashelp.class;
18   end;
19   run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.DU has 4 observations and 5 variables.
NOTE: Compressing data set WORK.DU increased size by 100.00 percent.
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


20
21
22   data ifst;
23   set sashelp.class;
24   if age=15 then output;
25   run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.IFST has 4 observations and 5 variables.
NOTE: Compressing data set WORK.IFST increased size by 100.00 percent.
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

 

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Tom
Super User Tom
Super User

There might be a little extra overhead to setup the DO loop, but really both are reading the same observations and performing the same comparison of AGE to the constant 15. 

It is probably going to be even better to use a WHERE statement, especially if your source dataset is some remote database since then SAS will not even transfer the other observations from the database to SAS.

 

Which one to use is based on WHY you are doing it and what the rest of the data step is actually doing.

You probably want to use the version that makes understanding, creating, validating, and maintaining the code the easiest.  

 

SASKiwi
PROC Star

In my experience, if you are doing simple, sequential DATA step reads, experimenting with DO loops is a waste of time - stick with just a SET statement as adding DO's will have minimal impact on performance. If you want to optimize your code focus on other areas, especially any really slow processes.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 416 views
  • 2 likes
  • 6 in conversation