topic Re: Join and remove duplicates in SAS Programming

Join and remove duplicates

ifti_ch2002 — Sun, 17 May 2020 15:07:42 GMT

I have following data

data have1;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
   datalines;  
1 4-7-2003 
1 4-7-2009 
1 4-7-2010 
1 4-7-2011 
1 4-7-2012 
1 4-7-2013 
1 4-7-2014 
1 4-7-2015 
1 4-7-2016 
1 4-7-2018 
2 4-7-2005 
2 4-7-2011 
2 4-7-2012 
2 4-7-2013 
2 4-7-2014 
2 4-7-2015 
2 4-7-2016 
2 4-7-2017 
2 4-7-2018 
2 4-7-2019 
2 4-7-2020 
;       
run;

And

data have2;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
   datalines;  
1 4-7-2003 
2 4-7-2005 
;       
run;

I need the ouput as following

data want;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
   datalines;  
1 4-7-2009 
1 4-7-2010 
1 4-7-2011 
1 4-7-2012 
1 4-7-2013 
1 4-7-2014 
1 4-7-2015 
1 4-7-2016 
1 4-7-2018 
2 4-7-2011 
2 4-7-2012 
2 4-7-2013 
2 4-7-2014 
2 4-7-2015 
2 4-7-2016 
2 4-7-2017 
2 4-7-2018 
2 4-7-2019 
2 4-7-2020 
;       
run;

I need to check by id number, if ID matches in both tables, and if the data in have1 has not exact match in have 2.

Re: Join and remove duplicates

ed_sas_member — Sun, 17 May 2020 15:25:51 GMT

Hi @ifti_ch2002

Please try this to remove observations of have2 from have1:

data want;
	merge have1 (in=x) have2 (in=y);
	by IDnumber date;
	if x and not y;
run;

Best,

Re: Join and remove duplicates

ed_sas_member — Sun, 17 May 2020 15:17:35 GMT

Another option could be the use of a SET operator in PROC SQL:

proc sql;
	create table want as
	select IDnumber, date from have1
	except
	select IDnumber, date from have2;
quit;

or a hash:

data want;
	if _n_=1 then do;
		declare hash h (dataset:'have2');
		h.definekey('IDnumber', 'date');
		h.definedata('IDnumber', 'date');
		h.definedone();
	end;
	set have1;
	if h.find() ne 0 then output;
run;

Best,

Re: Join and remove duplicates

ifti_ch2002 — Sun, 17 May 2020 15:35:49 GMT

Thanks, I spent hours on Join.

Re: Join and remove duplicates

ed_sas_member — Sun, 17 May 2020 15:36:56 GMT

You're welcome @ifti_ch2002

Re: Join and remove duplicates

novinosrin — Sun, 17 May 2020 20:02:10 GMT

Hi @ifti_ch2002 The effort -"Thanks, I spent hours on Join" is not a bad idea nor a wrong approach. The intricate details to pay attention to in my humble opinion is the risk of a cartesian join triggered by an inequality operator. However, in your case that concern doesn't surface for the reason, the combination of an equality and inequality operator makes it rather straight forward look-up as the SQL optimizer determines the best join algorithm principally on the equality operator.

@ChrisNZ aka Mr Perfect eloquently writes the distinction between JOINS,Subqueries going above and beyond from a performance standpoint and similar practical usage have often been noticed in posts by @PGStats aka Pierre with no peers who is above all. Please do peek into their posts when you have the time in grasping the concepts.

Kind Regards!



data have1;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
   format date mmddyy10.;
   datalines;  
1 4-7-2003 
1 4-7-2009 
1 4-7-2010 
1 4-7-2011 
1 4-7-2012 
1 4-7-2013 
1 4-7-2014 
1 4-7-2015 
1 4-7-2016 
1 4-7-2018 
2 4-7-2005 
2 4-7-2011 
2 4-7-2012 
2 4-7-2013 
2 4-7-2014 
2 4-7-2015 
2 4-7-2016 
2 4-7-2017 
2 4-7-2018 
2 4-7-2019 
2 4-7-2020 
;       
run; 

 

data have2;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
     format date mmddyy10.;
   datalines;  
1 4-7-2003 
2 4-7-2005 
;       
run; 

proc sql;
create table  want as
select a.*
from have1 a inner join have2 b
on a.IDnumber=b.IDnumber
and a.date ne b.date
order by a.IDnumber, a.date;
quit;

proc print noobs;run;

Re: Join and remove duplicates

Ksharp — Mon, 18 May 2020 11:54:12 GMT


data have1;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
   format date mmddyy10.;
   datalines;  
1 4-7-2003 
1 4-7-2009 
1 4-7-2010 
1 4-7-2011 
1 4-7-2012 
1 4-7-2013 
1 4-7-2014 
1 4-7-2015 
1 4-7-2016 
1 4-7-2018 
2 4-7-2005 
2 4-7-2011 
2 4-7-2012 
2 4-7-2013 
2 4-7-2014 
2 4-7-2015 
2 4-7-2016 
2 4-7-2017 
2 4-7-2018 
2 4-7-2019 
2 4-7-2020 
;       
run; 

 

data have2;
infile datalines missover; 
   input IDnumber @3 date  mmddyy10.;
     format date mmddyy10.;
   datalines;  
1 4-7-2003 
2 4-7-2005 
;       
run; 

proc sql;
create table  want as
select * from have1 
except
select * from have2;
quit;