Solved: How to remove all the duplicated ids from one dataset according to uni...

SeaMoon_168 · Posted 01-20-2025 07:01 PM

Here are two simple sample datasets. Data 1 has duplicated IDs A B and C.

id	weight
A	30
A	36
B	21
B	25
B	30
C	40
C	41
C	43
C	40

Data 2 has unique id

id

A

I want to remove all duplicated ids (A) in Data 1 based on the unique ids in Data 2 (A). So the result is

id	weight
B	21
B	25
B	30
C	40
C	41
C	43
C	40

This is a very simple example. However, my Data 1 may have thousands of duplicated IDs and Data 2 may have hundreds of unique IDs. Many thanks!

Tom · Posted 01-20-2025 07:12 PM

Not sure how the duplicates play into this. Sounds like you just want the observations from the first dataset where the ID is NOT in the second dataset.

If they are sorted it is a simple merge.

data want;
  merge data1 data2(in=in2);
  by id;
  if not in2;
run;

If not then it might be simpler to use PROC SQL (since it will sort for you).

proc sql;
create table want as 
select * from data1
where id not in (select id from data2)
;
quit;

View solution in original post

Tom · Posted 01-20-2025 07:12 PM

Not sure how the duplicates play into this. Sounds like you just want the observations from the first dataset where the ID is NOT in the second dataset.

If they are sorted it is a simple merge.

data want;
  merge data1 data2(in=in2);
  by id;
  if not in2;
run;

If not then it might be simpler to use PROC SQL (since it will sort for you).

proc sql;
create table want as 
select * from data1
where id not in (select id from data2)
;
quit;

mkeintz · Posted 01-21-2025 12:58 PM

I dataset1 is not sorted, and you want to preserve the original order of dataset1, then you can code a single-step solution by storing dataset2 in a hash object.

data want;
  set dataset1;
  if _n_=1 then do;
    declare hash h (dataset:'dataset2');
      h.definekey('id');
      h.definedone();
  end;
  if h.check()^=0 then output;  /*If not found in the hash object*/
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

How to remove all the duplicated ids from one dataset according to unique ids in another data

Re: How to remove all the duplicated ids from one dataset according to unique ids in another data

Re: How to remove all the duplicated ids from one dataset according to unique ids in another data

Re: How to remove all the duplicated ids from one dataset according to unique ids in another data

How to remove all the duplicated ids from one dataset according to unique ids in another data

Re: How to remove all the duplicated ids from one dataset according to unique ids in another data

Re: How to remove all the duplicated ids from one dataset according to unique ids in another data

Re: How to remove all the duplicated ids from one dataset according to unique ids in another data

Registration is open

SAS Training: Just a Click Away