topic Re: Extracting subset of observations without merging in SAS Data Management

Extracting subset of observations without merging

Timmay — Tue, 02 Feb 2016 00:00:30 GMT

Hi, so I have two SAS datasets, one of which is approximately 780GB and the other with approximately 5700 observations. In the smaller dataset, the 5700 observations have a unique identifier also present in the 780GB dataset. I was wondering if there was a way to extract all the observations in the large dataset that match the identifiers from the smaller dataset (there are multiple observations for each identifier in this dataset). I have tried merging the two using an inner join; however, it took SAS over 72 hours to complete this. Is there a way to reference the variable in the smaller dataset in a where statement or any other way to trim down the length of time it takes to perform this function? Thank you in advance.

Re: Extracting subset of observations without merging

SASKiwi — Tue, 02 Feb 2016 01:27:56 GMT

Sounds like the ideal candidate for a hash table solution:

data Want;
  attrib ID length = $8; * Define key variable.;
  drop rc;
* Load the small dataset as a hash table.;
  if _n_ = 1 then do;
    declare hash h(dataset: "WORK.Small");
    h.defineKey("ID");
    h.defineDone();
    call missing(ID);
  end;
* Read in large table;
	set Large;  
  rc = h.find(); * Search in data for match.; 
  if rc = 0; * Keep only rows where ID matches.;  
run;

Re: Extracting subset of observations without merging

LinusH — Tue, 02 Feb 2016 05:45:49 GMT

If this type of query will occur frequently it would be wise to index on the id column, and use SQL join. This will prevent a full table scan and sorting.

Re: Extracting subset of observations without merging

Kurt_Bremser — Tue, 02 Feb 2016 09:29:31 GMT

With 5700 obs in the smaller dataset it should be possible to create a format that yields "yes" for any of the 5700 identifiers, and "no" for all other values.

Then you only need to do a single sequential pass through the large dataset and look if put(identifier,format.) = "yes".

Re: Extracting subset of observations without merging

RW9 — Tue, 02 Feb 2016 10:12:09 GMT

Would this not also yeild one pass through the data&colon;

data _null_;
  set small_dataset end=last;
  if _n_=1 then call execute('data want;  set big_dataset; select(id);');
  call execute(' when ('||strip(id)||') output;');
  if last then call execute(' otherwise; end; run;');
run;

I.e. a big datastep is generated from the small one, that code is then run on the dataset once.

Re: Extracting subset of observations without merging

Kurt_Bremser — Tue, 02 Feb 2016 11:22:49 GMT

@RW9 wrote:

Would this not also yeild one pass through the data&colon;
data _null_;
  set small_dataset end=last;
  if _n_=1 then call execute('data want;  set big_dataset; select(id);');
  call execute(' when ('||strip(id)||') output;');
  if last then call execute(' otherwise; end; run;');
run;
I.e. a big datastep is generated from the small one, that code is then run on the dataset once.

Nice. Still another method to avoid the merge.

Re: Extracting subset of observations without merging

SASKiwi — Tue, 02 Feb 2016 18:54:23 GMT

Love the way there are so many methods for processing big data in SAS. It would be interesting to see some of these compared.

Re: Extracting subset of observations without merging

Kurt_Bremser — Wed, 03 Feb 2016 07:37:02 GMT

I guess that anything that allows you to only perform one sequential read through a big dataset will only be bottlenecked by the I/O throughput, therefore being equal in terms of performance. What then guides the decision will be personal programming preferences and which method is most easily unterstood by the next one to maintain the code.

Re: Extracting subset of observations without merging

RW9 — Wed, 03 Feb 2016 09:32:08 GMT

Just to note, in my opinion from:

"What then guides the decision will be personal programming preferences and which method is most easily unterstood by the next one to maintain the code."

The personal programming preference is very low on the list. In terms of importance:

1) Documentation - Functional Design Spec, Testing document, User Guides etc.

2) Good programming practice

3) Company standards

4) Personal preference

A programmers afterlife consists of meeting all those programmers who have had to use your code after you, be prepared 🐵

Re: Extracting subset of observations without merging

Michelle — Thu, 04 Feb 2016 22:55:28 GMT

@RW9 said:

A programmers afterlife consists of meeting all those programmers who have had to use your code after you, be prepared )

Ahahaha, I am laughing so hard I am crying. Or maybe I am just crying... The struggle is real.