Sorry for the weird title....here goes
I have a large dataset (call it A here) with a ton of clients, all who have a unique ID in variable client_id
I have another dataset (call it B here) that has one variable in it called client_id as well
I want to be able to join A and B so that the result has all client_ids that DO NOT include those in B
Is there an easy way to do this? I though of using :
data egtask.test;
set IBS_COMB.DATA;
where client_id not in("N02240","N79761");
run;
But if the list is long it would take a while to enter, and if the list changes this is not efficient.
Thanks
-Bill
SQL makes this easier:
proc sql;
create table want as select client_id from a
where client_id not in (select client_id from b);
quit;
SQL makes this easier:
proc sql;
create table want as select client_id from a
where client_id not in (select client_id from b);
quit;
You're on the right track, except that instead of hard coding the IDs you're better off using them directly from the data set. The merge below makes use of the IN= dataset option which creates a temporary variable (I named it b).
b=1 if dataset dataB contributes to a merged observation
b=0 if dataset dataB does not contribute to the merged observation
By selecting only the observations where b is not equal to 1, you get all IDs that are not in dataB.
data allAs;
merge dataA dataB(in=b);
by client_id;
if b^=1;
run;
Just for reference, a join is usually faster than a nested select clause.
data A; do CLIENT_ID= 1 to 1e8; output; end; run;
data B; do CLIENT_ID= 1 to 1e6; output; end; run;
proc sql; * real time 42.04 seconds;
create table WANT as
select CLIENT_ID
from A
where CLIENT_ID not in (select CLIENT_ID from B);
quit;
proc sql; * real time 31.78 seconds;
create table WANT as
select a.CLIENT_ID
from A
left join B
on a.CLIENT_ID =b.CLIENT_ID
where b.CLIENT_ID is missing;
quit;
data WANT; * real time 14.32 seconds;
merge A B(in=B);
by CLIENT_ID;
if ^B;
run;
If all data resides within SAS then I normally like to use a hash lookup as this avoids the need for any sorting of the large table.
Below code is untested but should be o.k.
data want;
if _n_=1 then
do;
dcl hash h1(dataset:'B(keep=client_id)')
h1.defineKey('client_id');
h1.defineDone();
end;
set a;
if h1.check() ne 0;
run;
Not sure that any solution is going to same much time for this problem.
You have a LARGE datasets and you want to eliminate a small number of records. The result is still going to be a LARGE dataset.
Just the I/O alone to duplicate your large dataset will take time.
If the hash approach is viable then there should be significantly less I/O as no sorting is required.
The "extreme" approach would be to use a modify with a hash lookup and only delete the records logically. That would of course be much faster especially if the number of records to be deleted is only a rather small percentage of the total records in the table.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.