topic Re: Selecting "NOT" Data ("Not Join") From a Dataset in SAS Programming

Selecting "NOT" Data ("Not Join") From a Dataset

BCNAV — Thu, 19 Apr 2018 17:17:26 GMT

Sorry for the weird title....here goes

I have a large dataset (call it A here) with a ton of clients, all who have a unique ID in variable client_id

I have another dataset (call it B here) that has one variable in it called client_id as well

I want to be able to join A and B so that the result has all client_ids that DO NOT include those in B

Is there an easy way to do this? I though of using :

data egtask.test;

set IBS_COMB.DATA;

where client_id not in("N02240","N79761");

run;

But if the list is long it would take a while to enter, and if the list changes this is not efficient.

Thanks

-Bill

Re: Selecting "NOT" Data ("Not Join") From a Dataset

Astounding — Thu, 19 Apr 2018 17:56:35 GMT

SQL makes this easier:

proc sql;

create table want as select client_id from a

where client_id not in (select client_id from b);

quit;

Re: Selecting "NOT" Data ("Not Join") From a Dataset

antonbcristina — Thu, 19 Apr 2018 18:04:17 GMT

You're on the right track, except that instead of hard coding the IDs you're better off using them directly from the data set. The merge below makes use of the IN= dataset option which creates a temporary variable (I named it b).

b=1 if dataset dataB contributes to a merged observation

b=0 if dataset dataB does not contribute to the merged observation

By selecting only the observations where b is not equal to 1, you get all IDs that are not in dataB.

data allAs;

merge dataA dataB(in=b);

by client_id;

if b^=1;

run;

Re: Selecting "NOT" Data ("Not Join") From a Dataset

ChrisNZ — Thu, 19 Apr 2018 23:45:57 GMT

Just for reference, a join is usually faster than a nested select clause.

                                              
data A; do CLIENT_ID= 1 to 1e8; output; end; run;

data B; do CLIENT_ID= 1 to 1e6; output; end; run;
      
proc sql;                    * real time   42.04 seconds;
  create table WANT as 
  select CLIENT_ID 
  from A
  where CLIENT_ID not in (select CLIENT_ID from B);
quit;

proc sql;                    * real time   31.78 seconds;
  create table WANT as 
  select a.CLIENT_ID 
  from A
  left join B
  on a.CLIENT_ID =b.CLIENT_ID
  where b.CLIENT_ID is missing;
quit;

data WANT;                   * real time   14.32 seconds;
  merge A B(in=B);
  by CLIENT_ID;
  if ^B;
run;

Re: Selecting "NOT" Data ("Not Join") From a Dataset

Patrick — Sat, 21 Apr 2018 03:26:12 GMT

@BCNAV

If all data resides within SAS then I normally like to use a hash lookup as this avoids the need for any sorting of the large table.

Below code is untested but should be o.k.

data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'B(keep=client_id)')
      h1.defineKey('client_id');
      h1.defineDone();
    end;
  set a;
  if h1.check() ne 0;
run;

Re: Selecting "NOT" Data ("Not Join") From a Dataset

Tom — Sat, 21 Apr 2018 12:06:22 GMT

Not sure that any solution is going to same much time for this problem.

You have a LARGE datasets and you want to eliminate a small number of records. The result is still going to be a LARGE dataset.

Just the I/O alone to duplicate your large dataset will take time.

Re: Selecting "NOT" Data ("Not Join") From a Dataset

Patrick — Sat, 21 Apr 2018 14:12:40 GMT

@Tom

If the hash approach is viable then there should be significantly less I/O as no sorting is required.

The "extreme" approach would be to use a modify with a hash lookup and only delete the records logically. That would of course be much faster especially if the number of records to be deleted is only a rather small percentage of the total records in the table.