topic Re: PROC SQL SELF JOIN remove duplicates in SAS Programming

PROC SQL SELF JOIN remove duplicates

r-lee-484 — Thu, 20 Jul 2023 19:57:49 GMT

Please see Example 1 and Result Output Page 1 in this link: https://support.sas.com/resources/papers/proceedings/proceedings/sugi29/065-29.pdf

The following SQL procedure provides Result Output on page 1:

proc sql;
select a.*,
from abuse a, abuse b
where a.o_ssn=b.v_ssn and
a.v_ssn=b.o_ssn and
a.date=b.date;
quit;

I would like to keep only cases 1 and 5 instead of cases 1,2,5,6. What code would I need to write to do that?

Thanks for the help!

Re: PROC SQL SELF JOIN remove duplicates

ballardw — Thu, 20 Jul 2023 20:04:57 GMT

The order of output from SQL is not guaranteed. You should provide some actual variable values indicating which observations you want.

Better is to provide some rule(s) related to the content of a data set. I can write code that will keep just those observations but isn't extensible to any other project or data set.

Re: PROC SQL SELF JOIN remove duplicates

r-lee-484 — Thu, 20 Jul 2023 20:13:11 GMT

I have a dataset with about 5,000,000 observations. I won't be able to individually select which observations to keep. Is there a way I can code for a rule to just keep 1 set of matches?

Re: PROC SQL SELF JOIN remove duplicates

ballardw — Thu, 20 Jul 2023 20:17:13 GMT

@r-lee-484 wrote:

I have a dataset with about 5,000,000 observations. I won't be able to individually select which observations to keep. Is there a way I can code for a rule to just keep 1 set of matches?

I repeat: what are the rules involved with which observations you want to keep.

I set of matches to what?

Perhaps it is time to provide an example of data similar to what you currently have what you want for the output given that starting data.

There are many ways to get "1 set" of somethings. How a "set" is defined is critical in any of the processes.

Re: PROC SQL SELF JOIN remove duplicates

HB — Thu, 20 Jul 2023 22:00:19 GMT

Try this:


data abuse;
	infile cards dlm=" 	";
	input case o_ssn:$11. v_ssn:$11. date:$10.;
datalines;
1 000-00-0001 000-00-0002 01/01/2000
2 000-00-0002 000-00-0001 01/01/2000
3 000-00-0003 000-00-0004 01/01/2000
4 000-00-0005 000-00-0006 01/01/2000
5 000-00-0007 000-00-0008 01/01/2000
6 000-00-0008 000-00-0007 01/01/2000
;
run;


proc sql;
create table no_reverse_match as 
select 
a.case, a.o_ssn, a.v_ssn

from 
abuse a inner join abuse b
on a.o_ssn = b.v_ssn and a.v_ssn = b.o_ssn and a.o_ssn < a.v_ssn 

where  
a.date=b.date;
quit;

Re: PROC SQL SELF JOIN remove duplicates

r-lee-484 — Tue, 25 Jul 2023 15:00:50 GMT

Thank you very much, that code works to remove reverse matches and to retain 1 set of matches.

On another note, what code should I write to filter out any paired matches? From the dataset "abuse" I only want to keep cases 3 and 4, and I don't want to keep case 1,2,5,6?

Thanks for your help!

Re: PROC SQL SELF JOIN remove duplicates

mkeintz — Wed, 26 Jul 2023 01:39:13 GMT

@r-lee-484 wrote:

Thank you very much, that code works to remove reverse matches and to retain 1 set of matches.

On another note, what code should I write to filter out any paired matches? From the dataset "abuse" I only want to keep cases 3 and 4, and I don't want to keep case 1,2,5,6?

In other words, you want every BUT the paired matches, correct? Then you could just use the EXCEPT set operator:


proc sql;
create table unmatched_cases  as 
  select c.case, c.o_ssn, c.v_ssn  
  from abuse as c
  except

  select  a.case, a.o_ssn, a.v_ssn
  from 
    abuse a inner join abuse b
    on a.o_ssn = b.v_ssn 
       and a.v_ssn = b.o_ssn 
  where  
    a.date=b.date;
quit;

There is also straightforward DATA step code for this task:

data un_matched_cases;
  set abuse;
  if _n_=1 then do;
    declare hash h (dataset:'abuse (keep=date o_ssn v_ssn)');
      h.definekey('date','o_ssn','v_ssn');
      h.definedone();
  end;
  if h.check(key:date,key:v_ssn,key:o_ssn)^=0;
run;

Note order of keys for the hash object h is DATE,O_SSN,V_SSN. That means the check method - h.check(...) - for "reverse match" must present the arguments ordered as DATE,V_SSN,O_SSN.

Re: PROC SQL SELF JOIN remove duplicates

r-lee-484 — Wed, 26 Jul 2023 13:54:15 GMT

This works thank you!