Solved: Re: Merge tables, but delete duplicates

SURIM · Posted 12-03-2019 02:10 AM

Hi!

This question probably has an easy answer, but I can't get my head around it.

I want to merge table A and table B, by a key id.

There will be a lot of duplicate rows in both tables.

If a key id is in table A, I want to delete all key id:s in table B in the merge. If the key id is only in table A or only in table B, they should stay.

Example:

Table A		Table B
key_id	date	key_id	date
12345	2018-07-15	12345	2017-08-14
12345	2018-12-06	12345	2018-01-09
84881	2019-01-01	44566	2018-03-05

Expected outcome after merge:

Table_all
key_id	date
12345	2018-07-15
12345	2018-12-06
84881	2019-01-01
44566	2018-03-05

Thoughts?

Kind Regards,

Susannah

Shmuel · Posted 12-03-2019 02:59 AM

Let's do it in steps as you defined it:

1) "If a key id is in table A, I want to delete all key id:s in table B in the merge" = "If the key id is only in table B, they should stay.":

proc sql;
   create table want1 as select *
   from tableB where key_id not in (select distict key_id from tableA)
   order by key_d;

2) Key_id in table A and not in table B

/* continued to previous code */

create table want2 as select * 
from tableA where key_id not in (select distinct key_id from tableB)
order by key_d;

3) final step:

quit;   /* quit sql */

data want;
 merge want1 want2;
  by key_id;
run;

View solution in original post

Patrick · Posted 12-03-2019 02:19 AM

So with the sample data you've posted how would the desired result look like? A table with 2 columns and 0 rows?

Please try to post sample data which covers all use cases and then show us the desired result based on the sample data.

Also: Please post sample data as working data step code so we have something to work with without having to prepare everything first before actually dealing with your question. That is of course only if you'd like to get working and tested code from us and not only a description of how to do things.

SURIM · Posted 12-03-2019 02:39 AM

Hello! I've updated with expected outcome. Thanks!

Kurt_Bremser · Posted 12-03-2019 02:49 AM

@SURIM wrote:
Hello! I've updated with expected outcome. Thanks!

The expected outcome does not match your rule AT ALL. Please review your logic. My code does what your logic says.

See Maxim 42.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SURIM · Posted 12-03-2019 02:57 AM

Thanks for your post, I have updated the logic. Sorry about that I almost always use merge as a left join. No need to shame us SAS newbies. I am just trying to learn.

Kurt_Bremser · Posted 12-03-2019 03:09 AM

According to your revised logic:

"If a key id is in table A, I want to delete all key id:s in table B in the merge. If the key id is only in table A or only in table B, they should stay."

key_id 12345 should be removed, as it is present in both tables.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 12-03-2019 02:22 AM

proc sql;
create table want as
  select * from table_b
  where table_b.key_id not in (select distinct key_id from table_a)
;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SURIM · Posted 12-03-2019 02:41 AM

I tried this, but it deleted all of the id:s that were in both tables. That is why I think this is tricky...

Patrick · Posted 12-03-2019 02:45 AM

The logic implemented by @Kurt_Bremser does what you describe (at least how I understand your description). If that's not what you want then please reformulate the logic required and show us the desired result based on the sample data so we have a better chance to understand what you need.

Jagadishkatam · Posted 12-03-2019 02:22 AM

Please try the below untested code

proc sort data=tableA;
by key_id;
run;

proc sort data=tableB;
by key_id;
run;

data all;
merge tableA(in=a) tableB(in=b);
by key_id;
if b and not a;
run;

Thanks,
Jag

SURIM · Posted 12-03-2019 02:55 AM

Sorry, I have updated with new logic. I think this might work if only changed a little bit.

Patrick · Posted 12-03-2019 03:09 AM

@SURIM

For the sample data you've posted a simple merge will return the desired result.

data have1;
  input key_id date :yymmdd10.;
  format date yymmdd10.;
datalines;	
12345 2018-07-15 
12345 2018-12-06 
84881 2019-01-01 
;

data have2;
  input key_id date :yymmdd10.;
  format date yymmdd10.;
datalines;	
12345 2017-08-14
12345 2018-01-09
44566 2018-03-05
;

data want;
  merge have2 (in=in2) have1 (in=in1);
  by key_id;
run;

proc print;
run;

However: Things won't work with above code if for the same key there are more rows in Have2 than in Have1. Is this possible?

...and that's why I'm asking for representative sample data as this allows us to provide code to you even if your description of the required logic isn't perfect.

So, is below possible in your data? And if possible do I understand right that you still only would want 2 rows for key 12345 as a result?

data have1;
  input key_id date :yymmdd10.;
  format date yymmdd10.;
datalines;	
12345 2018-07-15 
12345 2018-12-06 
84881 2019-01-01 
;

data have2;
  input key_id date :yymmdd10.;
  format date yymmdd10.;
datalines;	
12345 2017-08-14
12345 2018-01-09
12345 2019-01-01
44566 2018-03-05
;

...and here the code if above is a possible use case.

proc sql; 
  create table want as
    select *
    from have1
    union corr all
    select * 
    from have2 h2
    where not exists
      ( select *
        from have1 h1
        where h1.key_id = h2.key_id
      )
  ;
quit;

s_lassen · Posted 12-03-2019 03:22 AM

One way to get only the original rows from a, even if there are more rows in table B than in table A could be this:

data have1;
  input key_id date :yymmdd10.;
  format date yymmdd10.;
datalines;	
12345 2018-07-15 
12345 2018-12-06 
84881 2019-01-01 
;run;

data have2;
  input key_id date :yymmdd10.;
  format date yymmdd10.;
datalines;	
12345 2017-08-14
12345 2018-01-09
12345 2018-01-10
44566 2018-03-05
;run;

data want;
  a=0; 
  merge have1(in=a) have2(in=b);
  by key_id;
  if first.key_id then do;
    if a then want='a';
	else want='b';
	end;
  retain want;
  if want='a' then
    if a;
  drop want;
run;

Shmuel · Posted 12-03-2019 02:59 AM

Let's do it in steps as you defined it:

1) "If a key id is in table A, I want to delete all key id:s in table B in the merge" = "If the key id is only in table B, they should stay.":

proc sql;
   create table want1 as select *
   from tableB where key_id not in (select distict key_id from tableA)
   order by key_d;

2) Key_id in table A and not in table B

/* continued to previous code */

create table want2 as select * 
from tableA where key_id not in (select distinct key_id from tableB)
order by key_d;

3) final step:

quit;   /* quit sql */

data want;
 merge want1 want2;
  by key_id;
run;

Shmuel · Posted 12-03-2019 03:03 AM

May be you prefer
order by key_id and date ?!

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!