topic Re: Efficient ways to extract data in SAS Procedures

Efficient ways to extract data

Kiteulf — Wed, 08 Sep 2021 09:06:14 GMT

I have a dataset A that contains about 25 000 records

I have a dataset B that contains about 14 million records

I want some data in dataset B, where A.ID = B.ID, however it takes ages to pull the data.

So far I am using/tried

1. Proc sql : where B.ID in(select ID from A)

(where A only contains distinct numbers of ID)

2.A macro where the smallest A.ID is written and I pull the B.IDs that are greater than that one into a temp file and using that for extraction.

I wish there were a way to write all the A.IDs to a macro and use that, as it looks like my SAS server is having trouble with two tables at the same time

Re: Efficient ways to extract data

Kurt_Bremser — Wed, 08 Sep 2021 09:25:39 GMT

This is THE task for which you use a hash object:

data want;
set b; /* big dataset */
if _n_ = 1
then do;
  declare hash a (dataset:"a"); /* small dataset */
  a.definekey("id");
  a.definedone();
end;
if a.check() = 0; /* found an entry */
run;

No sorting is done (except in-memory for the small table), and the hash object discards duplicates in A by default.

Re: Efficient ways to extract data

Kiteulf — Wed, 08 Sep 2021 09:47:13 GMT

Could I use : to choose the column b and c from the big dataset?

data want;
set b; /* big dataset */
if _n_ = 1
then do;
  declare hash a (dataset:"a"); /* small dataset */
  a.definekey("id");
  a.definedata ("id", "b", "c") ; 
  a.definedone();
end;
if a.check() = 0; /* found an entry */
run;

Re: Efficient ways to extract data

Kurt_Bremser — Wed, 08 Sep 2021 10:01:48 GMT

My code will include ALL variables from the big dataset B and filter on ID's found in dataset A.

the DEFINEDATA() method is used to retrieve values from the hash table to the PDV, in which case you also must use FIND() instead of CHECK().

The code you posted would only be valid if dataset A contained the variables b and c, and you wanted those variables added to those retrieved from dataset B; you must also define those variables in the PDV (usually done with a LENGTH statement); additionally, replace CHECK() with FIND().

So please be more specific: which variables are contained in which dataset?

Re: Efficient ways to extract data

Kiteulf — Wed, 08 Sep 2021 10:05:01 GMT

A contains ID

B contains

ID, DATE, AMOUNT

pluss a lot more that I don't care about.

Re: Efficient ways to extract data

Kurt_Bremser — Wed, 08 Sep 2021 10:08:35 GMT

So if you only want those three variables and filter on A, my code would need an additional KEEP= dataset option:

data want;
set b (keep=id date amount); /* big dataset */
if _n_ = 1
then do;
  declare hash a (dataset:"a"); /* small dataset */
  a.definekey("id");
  a.definedone();
end;
if a.check() = 0; /* found an entry */
run;