BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BCNAV
Quartz | Level 8

Sorry for the weird title....here goes

 

I have a large dataset (call it A here) with a ton of clients, all who have a unique ID in variable client_id

 

I have another dataset (call it B here) that has one variable in it called client_id as well

 

I want to be able to join A and B so that the result has all client_ids that DO NOT include those in B

 

Is there an easy way to do this? I though of using :

 

data egtask.test;

set IBS_COMB.DATA;

where client_id not in("N02240","N79761");

run;

 

But if the list is long it would take a while to enter, and if the list changes this is not efficient.

 

Thanks

-Bill

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

SQL makes this easier:

 

proc sql;

create table want as select client_id from a

where client_id not in (select client_id from b);

quit;

View solution in original post

6 REPLIES 6
Astounding
PROC Star

SQL makes this easier:

 

proc sql;

create table want as select client_id from a

where client_id not in (select client_id from b);

quit;

antonbcristina
SAS Employee

You're on the right track, except that instead of hard coding the IDs you're better off using them directly from the data set. The merge below makes use of the IN= dataset option which creates a temporary variable (I named it b). 

b=1 if dataset dataB contributes to a merged observation

b=0 if dataset dataB does not contribute to the merged observation

 

By selecting only the observations where b is not equal to 1, you get all IDs that are not in dataB.

 

data allAs;

   merge dataA dataB(in=b);

   by client_id;

   if b^=1;

 

run;

ChrisNZ
Tourmaline | Level 20

Just for reference, a join is usually faster than a nested select clause.

 

                                              
data A; do CLIENT_ID= 1 to 1e8; output; end; run;

data B; do CLIENT_ID= 1 to 1e6; output; end; run;
      
proc sql;                    * real time   42.04 seconds;
  create table WANT as 
  select CLIENT_ID 
  from A
  where CLIENT_ID not in (select CLIENT_ID from B);
quit;

proc sql;                    * real time   31.78 seconds;
  create table WANT as 
  select a.CLIENT_ID 
  from A
  left join B
  on a.CLIENT_ID =b.CLIENT_ID
  where b.CLIENT_ID is missing;
quit;

data WANT;                   * real time   14.32 seconds;
  merge A B(in=B);
  by CLIENT_ID;
  if ^B;
run;

 

 

 

Patrick
Opal | Level 21

@BCNAV

If all data resides within SAS then I normally like to use a hash lookup as this avoids the need for any sorting of the large table.

 

Below code is untested but should be o.k.

data want;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'B(keep=client_id)')
      h1.defineKey('client_id');
      h1.defineDone();
    end;
  set a;
  if h1.check() ne 0;
run;
Tom
Super User Tom
Super User

Not sure that any solution is going to same much time for this problem.

You have a LARGE datasets and you want to eliminate a small number of records. The result is still going to be a LARGE dataset.

Just the I/O alone to duplicate your large dataset will take time.

Patrick
Opal | Level 21

@Tom

If the hash approach is viable then there should be significantly less I/O as no sorting is required. 

The "extreme" approach would be to use a modify with a hash lookup and only delete the records logically. That would of course be much faster especially if the number of records to be deleted is only a rather small percentage of the total records in the table.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 3969 views
  • 3 likes
  • 6 in conversation