BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
avepo
Fluorite | Level 6

Hi ,

I need to create one dataset (only female)  from multiple datasets ( all contains same variables ) I  received. They are large files and I will be using same code to subset. What is the most efficient way to do this?

 

I thought of two ways to do this 

 

method1 : subset each dataset and concatenate 10 smaller datasets

data want1; (repeat 10 times)

set have1;

if female;

run 

 

data want_final

set want1 want2 want3 want4 want5 want6 want7 want8 want9 want10;

run;

 

method2; subset all at once

 

data final;

set have1 have2 have3 have4 have5 have6 have7 have8 have9 have10;

if female;

run;

 

I am actually not sure if method2 is even correct way to subset data.

 

I am  open to any suggestions. Thanks!!

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

method 2 works well;

data w;
set sashelp.class sashelp.class;
if sex='F';
run;

View solution in original post

4 REPLIES 4
novinosrin
Tourmaline | Level 20

method 2 works well;

data w;
set sashelp.class sashelp.class;
if sex='F';
run;
ballardw
Super User

Method 2 should work.

If the individual datasets were large it might be worth subsetting each one using a where dataset option similar to:

data final;
   set 
      have1  (where=(female=1))
      have2  (where=(female=1))
      have3  (where=(female=1))
      have4  (where=(female=1))
      have5  (where=(female=1))
      have6  (where=(female=1))
      have7  (where=(female=1))
      have8  (where=(female=1))
      have9  (where=(female=1))
      have10 (where=(female=1))
   ;
run;

I may be a little more specific if female is numeric and has values other than zero but the general approach is valid.

 

I get to use this sort of logic with data sets from different layouts where the combination of variables is somewhat different in some sets such as a code value was changed from "3" to "03". Moderately complex where options can be applied but function calls may be problematic.

data_null__
Jade | Level 19

could you do this...

 

data final;
   set have1-have10(where=(female eq 1));
   run;

Is there performance advantage to where data set option vs. where statement?

 

hashman
Ammonite | Level 13

@data_null__:

 

I don''t think so. In the test below, the difference in the run times for the two cases are within the margin of confounding variations dictated by competing processes other than SAS. On the other hand, in both cases there's a great deal of difference between whether the subsetting condition is true or false because if true, the record has to be moved from the buffer - and otherwise it has not.

 

Also, IF evaluates this simple condition about 3 times faster than WHERE does it in the buffer, which is why IF works much faster if this advantage is not offset by not having to move many records from the buffer (if WHERE evaluates the condition false). The advantage of IF grows with the complexity of the condition to be evaluated.

 

Conversely, the advantage of WHERE grows with the relative number of unmoved records plus the record length, since the longer the record, the more work it takes to move it. Not surprisingly, IF performs about the same regardless of whether the condition is true or false, as it works only after every record has been already moved, anyway.  

 

1    data sex ;
2      retain sex "F" ;
3      do _n_ = 1 to 1E8 ;
4        output ;
5      end ;
6    run ;
NOTE: The data set WORK.SEX has 100000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           3.22 seconds

7 data _null_ ; 8 set sex (where = (sex eq "F")) ; 9 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. WHERE sex='F'; NOTE: DATA statement used (Total process time): real time 9.38 seconds
10 data _null_ ; 11 set sex ; where sex eq "F" ; 12 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. WHERE sex='F'; NOTE: DATA statement used (Total process time): real time 9.48 seconds
13 data _null_ ; 14 set sex (where = (sex ne "F")) ; 15 run ; NOTE: There were 0 observations read from the data set WORK.SEX. WHERE sex not = 'F'; NOTE: DATA statement used (Total process time): real time 1.85 seconds cpu time 1.85 seconds
16 data _null_ ; 17 set sex ; where sex ne "F" ; 18 run ; NOTE: There were 0 observations read from the data set WORK.SEX. WHERE sex not = 'F'; NOTE: DATA statement used (Total process time): real time 1.86 seconds
19 data _null_ ; 20 set sex ; if sex eq "F" ; 21 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. NOTE: DATA statement used (Total process time): real time 3.52 seconds cpu time 3.52 seconds 22 data _null_ ; 23 set sex ; if sex ne "F" ; 24 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. NOTE: DATA statement used (Total process time): real time 3.47 seconds

I've also repeated the test for 100 million records, and the results are, relatively speaking, the same. I recall many spears broken over WHERE over IF on SAS-L of yore and elsewhere; yet in the end, it's merely the balance between the time needed to evaluate a subsetting condition (IF is faster than WHERE the more, the more complex the condition is) and the amount of work needed to move stuff from the buffer (depends on how often WHERE evaluates the condition false and the record length).

 

Paul D.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 1000 views
  • 3 likes
  • 5 in conversation