DATA Step, Macro, Functions and more

Subsetting from multiple dataset

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 11
Accepted Solution

Subsetting from multiple dataset

Hi ,

I need to create one dataset (only female)  from multiple datasets ( all contains same variables ) I  received. They are large files and I will be using same code to subset. What is the most efficient way to do this?

 

I thought of two ways to do this 

 

method1 : subset each dataset and concatenate 10 smaller datasets

data want1; (repeat 10 times)

set have1;

if female;

run 

 

data want_final

set want1 want2 want3 want4 want5 want6 want7 want8 want9 want10;

run;

 

method2; subset all at once

 

data final;

set have1 have2 have3 have4 have5 have6 have7 have8 have9 have10;

if female;

run;

 

I am actually not sure if method2 is even correct way to subset data.

 

I am  open to any suggestions. Thanks!!

 

 


Accepted Solutions
Solution
2 weeks ago
Super User
Posts: 2,068

Re: Subsetting from multiple dataset

method 2 works well;

data w;
set sashelp.class sashelp.class;
if sex='F';
run;

View solution in original post


All Replies
Solution
2 weeks ago
Super User
Posts: 2,068

Re: Subsetting from multiple dataset

method 2 works well;

data w;
set sashelp.class sashelp.class;
if sex='F';
run;
Super User
Posts: 13,942

Re: Subsetting from multiple dataset

Method 2 should work.

If the individual datasets were large it might be worth subsetting each one using a where dataset option similar to:

data final;
   set 
      have1  (where=(female=1))
      have2  (where=(female=1))
      have3  (where=(female=1))
      have4  (where=(female=1))
      have5  (where=(female=1))
      have6  (where=(female=1))
      have7  (where=(female=1))
      have8  (where=(female=1))
      have9  (where=(female=1))
      have10 (where=(female=1))
   ;
run;

I may be a little more specific if female is numeric and has values other than zero but the general approach is valid.

 

I get to use this sort of logic with data sets from different layouts where the combination of variables is somewhat different in some sets such as a code value was changed from "3" to "03". Moderately complex where options can be applied but function calls may be problematic.

Respected Advisor
Posts: 3,867

Re: Subsetting from multiple dataset

could you do this...

 

data final;
   set have1-have10(where=(female eq 1));
   run;

Is there performance advantage to where data set option vs. where statement?

 

Frequent Contributor
Posts: 112

Re: Subsetting from multiple dataset

Posted in reply to data_null__

@data_null__:

 

I don''t think so. In the test below, the difference in the run times for the two cases are within the margin of confounding variations dictated by competing processes other than SAS. On the other hand, in both cases there's a great deal of difference between whether the subsetting condition is true or false because if true, the record has to be moved from the buffer - and otherwise it has not.

 

Also, IF evaluates this simple condition about 3 times faster than WHERE does it in the buffer, which is why IF works much faster if this advantage is not offset by not having to move many records from the buffer (if WHERE evaluates the condition false). The advantage of IF grows with the complexity of the condition to be evaluated.

 

Conversely, the advantage of WHERE grows with the relative number of unmoved records plus the record length, since the longer the record, the more work it takes to move it. Not surprisingly, IF performs about the same regardless of whether the condition is true or false, as it works only after every record has been already moved, anyway.  

 

1    data sex ;
2      retain sex "F" ;
3      do _n_ = 1 to 1E8 ;
4        output ;
5      end ;
6    run ;
NOTE: The data set WORK.SEX has 100000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           3.22 seconds

7 data _null_ ; 8 set sex (where = (sex eq "F")) ; 9 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. WHERE sex='F'; NOTE: DATA statement used (Total process time): real time 9.38 seconds
10 data _null_ ; 11 set sex ; where sex eq "F" ; 12 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. WHERE sex='F'; NOTE: DATA statement used (Total process time): real time 9.48 seconds
13 data _null_ ; 14 set sex (where = (sex ne "F")) ; 15 run ; NOTE: There were 0 observations read from the data set WORK.SEX. WHERE sex not = 'F'; NOTE: DATA statement used (Total process time): real time 1.85 seconds cpu time 1.85 seconds
16 data _null_ ; 17 set sex ; where sex ne "F" ; 18 run ; NOTE: There were 0 observations read from the data set WORK.SEX. WHERE sex not = 'F'; NOTE: DATA statement used (Total process time): real time 1.86 seconds
19 data _null_ ; 20 set sex ; if sex eq "F" ; 21 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. NOTE: DATA statement used (Total process time): real time 3.52 seconds cpu time 3.52 seconds 22 data _null_ ; 23 set sex ; if sex ne "F" ; 24 run ; NOTE: There were 100000000 observations read from the data set WORK.SEX. NOTE: DATA statement used (Total process time): real time 3.47 seconds

I've also repeated the test for 100 million records, and the results are, relatively speaking, the same. I recall many spears broken over WHERE over IF on SAS-L of yore and elsewhere; yet in the end, it's merely the balance between the time needed to evaluate a subsetting condition (IF is faster than WHERE the more, the more complex the condition is) and the amount of work needed to move stuff from the buffer (depends on how often WHERE evaluates the condition false and the record length).

 

Paul D.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 89 views
  • 3 likes
  • 5 in conversation