Solved: Join and remove duplicates

ifti_ch2002 · Posted 05-17-2020 11:07 AM

I have following data

data have1;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
   datalines;  
1 4-7-2003 
1 4-7-2009 
1 4-7-2010 
1 4-7-2011 
1 4-7-2012 
1 4-7-2013 
1 4-7-2014 
1 4-7-2015 
1 4-7-2016 
1 4-7-2018 
2 4-7-2005 
2 4-7-2011 
2 4-7-2012 
2 4-7-2013 
2 4-7-2014 
2 4-7-2015 
2 4-7-2016 
2 4-7-2017 
2 4-7-2018 
2 4-7-2019 
2 4-7-2020 
;       
run;

And

data have2;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
   datalines;  
1 4-7-2003 
2 4-7-2005 
;       
run;

I need the ouput as following

data want;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
   datalines;  
1 4-7-2009 
1 4-7-2010 
1 4-7-2011 
1 4-7-2012 
1 4-7-2013 
1 4-7-2014 
1 4-7-2015 
1 4-7-2016 
1 4-7-2018 
2 4-7-2011 
2 4-7-2012 
2 4-7-2013 
2 4-7-2014 
2 4-7-2015 
2 4-7-2016 
2 4-7-2017 
2 4-7-2018 
2 4-7-2019 
2 4-7-2020 
;       
run;

I need to check by id number, if ID matches in both tables, and if the data in have1 has not exact match in have 2.

ed_sas_member · Posted 05-17-2020 11:13 AM

Hi @ifti_ch2002

Please try this to remove observations of have2 from have1:

data want;
	merge have1 (in=x) have2 (in=y);
	by IDnumber date;
	if x and not y;
run;

Best,

View solution in original post

ed_sas_member · Posted 05-17-2020 11:13 AM

Hi @ifti_ch2002

Please try this to remove observations of have2 from have1:

data want;
	merge have1 (in=x) have2 (in=y);
	by IDnumber date;
	if x and not y;
run;

Best,

ed_sas_member · Posted 05-17-2020 11:17 AM

Another option could be the use of a SET operator in PROC SQL:

proc sql;
	create table want as
	select IDnumber, date from have1
	except
	select IDnumber, date from have2;
quit;

or a hash:

data want;
	if _n_=1 then do;
		declare hash h (dataset:'have2');
		h.definekey('IDnumber', 'date');
		h.definedata('IDnumber', 'date');
		h.definedone();
	end;
	set have1;
	if h.find() ne 0 then output;
run;

Best,

ifti_ch2002 · Posted 05-17-2020 11:35 AM

Thanks, I spent hours on Join.

ed_sas_member · Posted 05-17-2020 11:36 AM

You're welcome @ifti_ch2002

novinosrin · Posted 05-17-2020 03:57 PM

Hi @ifti_ch2002 The effort -"Thanks, I spent hours on Join" is not a bad idea nor a wrong approach. The intricate details to pay attention to in my humble opinion is the risk of a cartesian join triggered by an inequality operator. However, in your case that concern doesn't surface for the reason, the combination of an equality and inequality operator makes it rather straight forward look-up as the SQL optimizer determines the best join algorithm principally on the equality operator.

@ChrisNZ aka Mr Perfect eloquently writes the distinction between JOINS,Subqueries going above and beyond from a performance standpoint and similar practical usage have often been noticed in posts by @PGStats aka Pierre with no peers who is above all. Please do peek into their posts when you have the time in grasping the concepts.

Kind Regards!



data have1;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
   format date mmddyy10.;
   datalines;  
1 4-7-2003 
1 4-7-2009 
1 4-7-2010 
1 4-7-2011 
1 4-7-2012 
1 4-7-2013 
1 4-7-2014 
1 4-7-2015 
1 4-7-2016 
1 4-7-2018 
2 4-7-2005 
2 4-7-2011 
2 4-7-2012 
2 4-7-2013 
2 4-7-2014 
2 4-7-2015 
2 4-7-2016 
2 4-7-2017 
2 4-7-2018 
2 4-7-2019 
2 4-7-2020 
;       
run; 

 

data have2;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
     format date mmddyy10.;
   datalines;  
1 4-7-2003 
2 4-7-2005 
;       
run; 

proc sql;
create table  want as
select a.*
from have1 a inner join have2 b
on a.IDnumber=b.IDnumber
and a.date ne b.date
order by a.IDnumber, a.date;
quit;

proc print noobs;run;

Ksharp · Posted 05-18-2020 07:54 AM


data have1;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
   format date mmddyy10.;
   datalines;  
1 4-7-2003 
1 4-7-2009 
1 4-7-2010 
1 4-7-2011 
1 4-7-2012 
1 4-7-2013 
1 4-7-2014 
1 4-7-2015 
1 4-7-2016 
1 4-7-2018 
2 4-7-2005 
2 4-7-2011 
2 4-7-2012 
2 4-7-2013 
2 4-7-2014 
2 4-7-2015 
2 4-7-2016 
2 4-7-2017 
2 4-7-2018 
2 4-7-2019 
2 4-7-2020 
;       
run; 

 

data have2;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
     format date mmddyy10.;
   datalines;  
1 4-7-2003 
2 4-7-2005 
;       
run; 

proc sql;
create table  want as
select * from have1 
except
select * from have2;
quit;

Join and remove duplicates

Re: Join and remove duplicates

Re: Join and remove duplicates

Re: Join and remove duplicates

Re: Join and remove duplicates

Re: Join and remove duplicates

Re: Join and remove duplicates

Re: Join and remove duplicates

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away