Sorry for the weird title....here goes
I have a large dataset (call it A here) with a ton of clients, all who have a unique ID in variable client_id
I have another dataset (call it B here) that has one variable in it called client_id as well
I want to be able to join A and B so that the result has all client_ids that DO NOT include those in B
Is there an easy way to do this? I though of using :
data egtask.test;
set IBS_COMB.DATA;
where client_id not in("N02240","N79761");
run;
But if the list is long it would take a while to enter, and if the list changes this is not efficient.
Thanks
-Bill
SQL makes this easier:
proc sql;
create table want as select client_id from a
where client_id not in (select client_id from b);
quit;
SQL makes this easier:
proc sql;
create table want as select client_id from a
where client_id not in (select client_id from b);
quit;
You're on the right track, except that instead of hard coding the IDs you're better off using them directly from the data set. The merge below makes use of the IN= dataset option which creates a temporary variable (I named it b).
b=1 if dataset dataB contributes to a merged observation
b=0 if dataset dataB does not contribute to the merged observation
By selecting only the observations where b is not equal to 1, you get all IDs that are not in dataB.
data allAs;
merge dataA dataB(in=b);
by client_id;
if b^=1;
run;
Just for reference, a join is usually faster than a nested select clause.
data A; do CLIENT_ID= 1 to 1e8; output; end; run;
data B; do CLIENT_ID= 1 to 1e6; output; end; run;
proc sql; * real time 42.04 seconds;
create table WANT as
select CLIENT_ID
from A
where CLIENT_ID not in (select CLIENT_ID from B);
quit;
proc sql; * real time 31.78 seconds;
create table WANT as
select a.CLIENT_ID
from A
left join B
on a.CLIENT_ID =b.CLIENT_ID
where b.CLIENT_ID is missing;
quit;
data WANT; * real time 14.32 seconds;
merge A B(in=B);
by CLIENT_ID;
if ^B;
run;
If all data resides within SAS then I normally like to use a hash lookup as this avoids the need for any sorting of the large table.
Below code is untested but should be o.k.
data want;
if _n_=1 then
do;
dcl hash h1(dataset:'B(keep=client_id)')
h1.defineKey('client_id');
h1.defineDone();
end;
set a;
if h1.check() ne 0;
run;
Not sure that any solution is going to same much time for this problem.
You have a LARGE datasets and you want to eliminate a small number of records. The result is still going to be a LARGE dataset.
Just the I/O alone to duplicate your large dataset will take time.
If the hash approach is viable then there should be significantly less I/O as no sorting is required.
The "extreme" approach would be to use a modify with a hash lookup and only delete the records logically. That would of course be much faster especially if the number of records to be deleted is only a rather small percentage of the total records in the table.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.