Hi ,
I need to create one dataset (only female) from multiple datasets ( all contains same variables ) I received. They are large files and I will be using same code to subset. What is the most efficient way to do this?
I thought of two ways to do this
method1 : subset each dataset and concatenate 10 smaller datasets
data want1; (repeat 10 times)
set have1;
if female;
run
data want_final
set want1 want2 want3 want4 want5 want6 want7 want8 want9 want10;
run;
method2; subset all at once
data final;
set have1 have2 have3 have4 have5 have6 have7 have8 have9 have10;
if female;
run;
I am actually not sure if method2 is even correct way to subset data.
I am open to any suggestions. Thanks!!
method 2 works well;
data w;
set sashelp.class sashelp.class;
if sex='F';
run;
method 2 works well;
data w;
set sashelp.class sashelp.class;
if sex='F';
run;
Method 2 should work.
If the individual datasets were large it might be worth subsetting each one using a where dataset option similar to:
data final; set have1 (where=(female=1)) have2 (where=(female=1)) have3 (where=(female=1)) have4 (where=(female=1)) have5 (where=(female=1)) have6 (where=(female=1)) have7 (where=(female=1)) have8 (where=(female=1)) have9 (where=(female=1)) have10 (where=(female=1)) ; run;
I may be a little more specific if female is numeric and has values other than zero but the general approach is valid.
I get to use this sort of logic with data sets from different layouts where the combination of variables is somewhat different in some sets such as a code value was changed from "3" to "03". Moderately complex where options can be applied but function calls may be problematic.
could you do this...
data final;
set have1-have10(where=(female eq 1));
run;
Is there performance advantage to where data set option vs. where statement?
I don''t think so. In the test below, the difference in the run times for the two cases are within the margin of confounding variations dictated by competing processes other than SAS. On the other hand, in both cases there's a great deal of difference between whether the subsetting condition is true or false because if true, the record has to be moved from the buffer - and otherwise it has not.
Also, IF evaluates this simple condition about 3 times faster than WHERE does it in the buffer, which is why IF works much faster if this advantage is not offset by not having to move many records from the buffer (if WHERE evaluates the condition false). The advantage of IF grows with the complexity of the condition to be evaluated.
Conversely, the advantage of WHERE grows with the relative number of unmoved records plus the record length, since the longer the record, the more work it takes to move it. Not surprisingly, IF performs about the same regardless of whether the condition is true or false, as it works only after every record has been already moved, anyway.
1 data sex ; 2 retain sex "F" ; 3 do _n_ = 1 to 1E8 ; 4 output ; 5 end ; 6 run ; NOTE: The data set WORK.SEX has 100000000 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 3.22 seconds
7 data _null_ ; 8 set sex (where = (sex eq "F")) ; 9 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. WHERE sex='F'; NOTE: DATA statement used (Total process time): real time 9.38 seconds
10 data _null_ ; 11 set sex ; where sex eq "F" ; 12 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. WHERE sex='F'; NOTE: DATA statement used (Total process time): real time 9.48 seconds
13 data _null_ ; 14 set sex (where = (sex ne "F")) ; 15 run ; NOTE: There were 0 observations read from the data set WORK.SEX. WHERE sex not = 'F'; NOTE: DATA statement used (Total process time): real time 1.85 seconds cpu time 1.85 seconds
16 data _null_ ; 17 set sex ; where sex ne "F" ; 18 run ; NOTE: There were 0 observations read from the data set WORK.SEX. WHERE sex not = 'F'; NOTE: DATA statement used (Total process time): real time 1.86 seconds
19 data _null_ ; 20 set sex ; if sex eq "F" ; 21 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. NOTE: DATA statement used (Total process time): real time 3.52 seconds cpu time 3.52 seconds 22 data _null_ ; 23 set sex ; if sex ne "F" ; 24 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. NOTE: DATA statement used (Total process time): real time 3.47 seconds
I've also repeated the test for 100 million records, and the results are, relatively speaking, the same. I recall many spears broken over WHERE over IF on SAS-L of yore and elsewhere; yet in the end, it's merely the balance between the time needed to evaluate a subsetting condition (IF is faster than WHERE the more, the more complex the condition is) and the amount of work needed to move stuff from the buffer (depends on how often WHERE evaluates the condition false and the record length).
Paul D.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.