BookmarkSubscribeRSS Feed
CatPaws
Calcite | Level 5

I have two data sets with the same variables, but different observations. I need to know if any observations in data set 1 are in data set 2. How do I do this? Do I merge them first?

5 REPLIES 5
Astounding
PROC Star

A few details would be helpful.

 

Could data set 1 contain two identical observations?  How would you like to handle that?

 

Do you need to identify observations that are 100% identical, or just largely identical?

 

CatPaws
Calcite | Level 5
Data set 1 would not have identical observations within the dataset. I need to identify obervations that are 100% identical between data sets. For example, I need to know if there is an oberservation in data set 1 that is also in data set 2 or vice versa.
LinusH
Tourmaline | Level 20

PROC COMPARE is one option.

Another option is to put your full observation in one varible - and convert it to a hash, using MD5 or SHA.

Based on that you can use either data step merge or SQL inner join.

Data never sleeps
Ksharp
Super User
data have1;
 set sashelp.class;
run;

data have2;
 set sashelp.class end=last;
 output;
 if last then do;name='xxxx';output;end;
run;


proc sql;
create table obs_in_both as 
select * from have1
intersect
select * from have2
;
quit;
mkeintz
PROC Star

This will output all observations in B that match any observation in A, which satisfies your criterion as long as neither dataset has duplicates, and A and B have the same variables. 

 

data a b;
  set sashelp.class;
  if mod(_n_,3)=0 then output a b;
  else if mod(_n_,3)=1 then output a;
  else output b;
run;

data both;
  set b;
  if _n_=1 then do;
    declare hash ha (dataset:'a');
      ha.definekey(all:'Y');
      ha.definedone();
  end;
  if ha.find()=0;
run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------