Selecting "NOT" Data ("Not Join") From a Dataset

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 92
Accepted Solution

Selecting "NOT" Data ("Not Join") From a Dataset

[ Edited ]

Sorry for the weird title....here goes

 

I have a large dataset (call it A here) with a ton of clients, all who have a unique ID in variable client_id

 

I have another dataset (call it B here) that has one variable in it called client_id as well

 

I want to be able to join A and B so that the result has all client_ids that DO NOT include those in B

 

Is there an easy way to do this? I though of using :

 

data egtask.test;

set IBS_COMB.DATA;

where client_id not in("N02240","N79761");

run;

 

But if the list is long it would take a while to enter, and if the list changes this is not efficient.

 

Thanks

-Bill

 

 


Accepted Solutions
Solution
‎05-09-2018 10:01 AM
Super User
Posts: 6,781

Re: Selecting "NOT" Data ("Not Join") From a Dataset

SQL makes this easier:

 

proc sql;

create table want as select client_id from a

where client_id not in (select client_id from b);

quit;

View solution in original post


All Replies
Solution
‎05-09-2018 10:01 AM
Super User
Posts: 6,781

Re: Selecting "NOT" Data ("Not Join") From a Dataset

SQL makes this easier:

 

proc sql;

create table want as select client_id from a

where client_id not in (select client_id from b);

quit;

SAS Employee
Posts: 24

Re: Selecting "NOT" Data ("Not Join") From a Dataset

You're on the right track, except that instead of hard coding the IDs you're better off using them directly from the data set. The merge below makes use of the IN= dataset option which creates a temporary variable (I named it b). 

b=1 if dataset dataB contributes to a merged observation

b=0 if dataset dataB does not contribute to the merged observation

 

By selecting only the observations where b is not equal to 1, you get all IDs that are not in dataB.

 

data allAs;

   merge dataA dataB(in=b);

   by client_id;

   if b^=1;

 

run;

PROC Star
Posts: 2,369

Re: Selecting "NOT" Data ("Not Join") From a Dataset

[ Edited ]

Just for reference, a join is usually faster than a nested select clause.

 

                                              
data A; do CLIENT_ID= 1 to 1e8; output; end; run;

data B; do CLIENT_ID= 1 to 1e6; output; end; run;
      
proc sql;                    * real time   42.04 seconds;
  create table WANT as 
  select CLIENT_ID 
  from A
  where CLIENT_ID not in (select CLIENT_ID from B);
quit;

proc sql;                    * real time   31.78 seconds;
  create table WANT as 
  select a.CLIENT_ID 
  from A
  left join B
  on a.CLIENT_ID =b.CLIENT_ID
  where b.CLIENT_ID is missing;
quit;

data WANT;                   * real time   14.32 seconds;
  merge A B(in=B);
  by CLIENT_ID;
  if ^B;
run;

 

 

 

Respected Advisor
Posts: 4,736

Re: Selecting "NOT" Data ("Not Join") From a Dataset

@BCNAV

If all data resides within SAS then I normally like to use a hash lookup as this avoids the need for any sorting of the large table.

 

Below code is untested but should be o.k.

data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'B(keep=client_id)')
      h1.defineKey('client_id');
      h1.defineDone();
    end;
  set a;
  if h1.check() ne 0;
run;
Super User
Super User
Posts: 8,115

Re: Selecting "NOT" Data ("Not Join") From a Dataset

Not sure that any solution is going to same much time for this problem.

You have a LARGE datasets and you want to eliminate a small number of records. The result is still going to be a LARGE dataset.

Just the I/O alone to duplicate your large dataset will take time.

Respected Advisor
Posts: 4,736

Re: Selecting "NOT" Data ("Not Join") From a Dataset

@Tom

If the hash approach is viable then there should be significantly less I/O as no sorting is required. 

The "extreme" approach would be to use a modify with a hash lookup and only delete the records logically. That would of course be much faster especially if the number of records to be deleted is only a rather small percentage of the total records in the table.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 187 views
  • 3 likes
  • 6 in conversation