Solved: Re: Selecting "NOT" Data ("Not Join") From a Dataset

BCNAV · Posted 04-19-2018 01:16 PM

Sorry for the weird title....here goes

I have a large dataset (call it A here) with a ton of clients, all who have a unique ID in variable client_id

I have another dataset (call it B here) that has one variable in it called client_id as well

I want to be able to join A and B so that the result has all client_ids that DO NOT include those in B

Is there an easy way to do this? I though of using :

data egtask.test;

set IBS_COMB.DATA;

where client_id not in("N02240","N79761");

run;

But if the list is long it would take a while to enter, and if the list changes this is not efficient.

Thanks

-Bill

Astounding · Posted 04-19-2018 01:56 PM

SQL makes this easier:

proc sql;

create table want as select client_id from a

where client_id not in (select client_id from b);

quit;

View solution in original post

Astounding · Posted 04-19-2018 01:56 PM

SQL makes this easier:

proc sql;

create table want as select client_id from a

where client_id not in (select client_id from b);

quit;

antonbcristina · Posted 04-19-2018 02:04 PM

You're on the right track, except that instead of hard coding the IDs you're better off using them directly from the data set. The merge below makes use of the IN= dataset option which creates a temporary variable (I named it b).

b=1 if dataset dataB contributes to a merged observation

b=0 if dataset dataB does not contribute to the merged observation

By selecting only the observations where b is not equal to 1, you get all IDs that are not in dataB.

data allAs;

merge dataA dataB(in=b);

by client_id;

if b^=1;

run;

ChrisNZ · Posted 04-19-2018 07:39 PM

Just for reference, a join is usually faster than a nested select clause.

                                              
data A; do CLIENT_ID= 1 to 1e8; output; end; run;

data B; do CLIENT_ID= 1 to 1e6; output; end; run;
      
proc sql;                    * real time   42.04 seconds;
  create table WANT as 
  select CLIENT_ID 
  from A
  where CLIENT_ID not in (select CLIENT_ID from B);
quit;

proc sql;                    * real time   31.78 seconds;
  create table WANT as 
  select a.CLIENT_ID 
  from A
  left join B
  on a.CLIENT_ID =b.CLIENT_ID
  where b.CLIENT_ID is missing;
quit;

data WANT;                   * real time   14.32 seconds;
  merge A B(in=B);
  by CLIENT_ID;
  if ^B;
run;

High-Performance SAS Coding - Third Edition

Patrick · Posted 04-20-2018 11:26 PM

@BCNAV

If all data resides within SAS then I normally like to use a hash lookup as this avoids the need for any sorting of the large table.

Below code is untested but should be o.k.

data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'B(keep=client_id)')
      h1.defineKey('client_id');
      h1.defineDone();
    end;
  set a;
  if h1.check() ne 0;
run;

Tom · Posted 04-21-2018 08:06 AM

Not sure that any solution is going to same much time for this problem.

You have a LARGE datasets and you want to eliminate a small number of records. The result is still going to be a LARGE dataset.

Just the I/O alone to duplicate your large dataset will take time.

Patrick · Posted 04-21-2018 10:12 AM

@Tom

If the hash approach is viable then there should be significantly less I/O as no sorting is required.

The "extreme" approach would be to use a modify with a hash lookup and only delete the records logically. That would of course be much faster especially if the number of records to be deleted is only a rather small percentage of the total records in the table.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away