04-19-2018 01:16 PM - edited 04-19-2018 01:17 PM
Sorry for the weird title....here goes
I have a large dataset (call it A here) with a ton of clients, all who have a unique ID in variable client_id
I have another dataset (call it B here) that has one variable in it called client_id as well
I want to be able to join A and B so that the result has all client_ids that DO NOT include those in B
Is there an easy way to do this? I though of using :
where client_id not in("N02240","N79761");
But if the list is long it would take a while to enter, and if the list changes this is not efficient.
04-19-2018 02:04 PM
You're on the right track, except that instead of hard coding the IDs you're better off using them directly from the data set. The merge below makes use of the IN= dataset option which creates a temporary variable (I named it b).
b=1 if dataset dataB contributes to a merged observation
b=0 if dataset dataB does not contribute to the merged observation
By selecting only the observations where b is not equal to 1, you get all IDs that are not in dataB.
merge dataA dataB(in=b);
04-19-2018 07:39 PM - edited 04-19-2018 07:45 PM
Just for reference, a join is usually faster than a nested select clause.
data A; do CLIENT_ID= 1 to 1e8; output; end; run; data B; do CLIENT_ID= 1 to 1e6; output; end; run; proc sql; * real time 42.04 seconds; create table WANT as select CLIENT_ID from A where CLIENT_ID not in (select CLIENT_ID from B); quit; proc sql; * real time 31.78 seconds; create table WANT as select a.CLIENT_ID from A left join B on a.CLIENT_ID =b.CLIENT_ID where b.CLIENT_ID is missing; quit; data WANT; * real time 14.32 seconds; merge A B(in=B); by CLIENT_ID; if ^B; run;
04-20-2018 11:26 PM
If all data resides within SAS then I normally like to use a hash lookup as this avoids the need for any sorting of the large table.
Below code is untested but should be o.k.
data want; if _n_=1 then do; dcl hash h1(dataset:'B(keep=client_id)') h1.defineKey('client_id'); h1.defineDone(); end; set a; if h1.check() ne 0; run;
04-21-2018 08:06 AM
Not sure that any solution is going to same much time for this problem.
You have a LARGE datasets and you want to eliminate a small number of records. The result is still going to be a LARGE dataset.
Just the I/O alone to duplicate your large dataset will take time.
04-21-2018 10:12 AM
If the hash approach is viable then there should be significantly less I/O as no sorting is required.
The "extreme" approach would be to use a modify with a hash lookup and only delete the records logically. That would of course be much faster especially if the number of records to be deleted is only a rather small percentage of the total records in the table.
Need further help from the community? Please ask a new question.