Solved: Re: Find Duplicates Based On Multiple Fields No Equal In One Field

Ins_Analytic · Posted 07-20-2018 02:11 PM

I have a fairly large table (30 fields by ~10M+ rows). I need to find duplicates based on a grouping of 20 of the fields. I was able to do this with:

proc sql;
     create table count as
     select *, 1 as line
     from orig_data
;
     create table dup_id as
     select *, sum(line) as linesum
     from count
     group by {list 20 fields}
;
quit;

Any rows with linesum>1 are flagged to have dups based on the 20 field grouping and I could check them out.

My issue is that this group by statement does not exclude any of the other 10 fields from the table, within the grouping. For 9 of them that is fine, but I need identify a grouping where one particular field (PIN) is different from the rest of the grouping above. So essentially, I need a way to identify where those 20 fields are identical but PIN is different.

I think I can accomplish this with a sort and data step using first. logic, but I wanted to see if this is the preferred approach? Is there another way?

Thanks in advance.

Urban_Science · Posted 07-20-2018 03:26 PM

You could also break it into multiple processing steps.

Group by the 20 variables counting the number of PID's in each set
Inner join that table to the original on the 20 variables, where the number of PID's is > 1

proc sql;
   create table countGroups as 
       select {20 columns}
          , count(distinct PID) as numPIDs
       from have
       group by {20 columns}
   ;
quit;

proc sql;
   create table want as 
      select t1.*
      from have as t1
      inner join countGroups as t2
         on t1.column1 = t2.column1
            and t1.column2 = t2.column2 
            ... 
            and t1.column20 = t2.column20
      where t2.numPIDs > 1
;
quit;

View solution in original post

Urban_Science · Posted 07-20-2018 02:43 PM

I'm not a fan of clever solutions, but here I go... you could try abusing proc sql with the combination of merging statistics with a having clause.

proc sql;
   create table want as 
       select {20 columns}
          , PID
          , count(*) as numRows
       from have
       group by {20 columns}
       having numRows > 1
   ;
quit;

That should return a table with all sets of 20 columns that are duplicated across multiple columns. It will throw a warning about having PID in the select statement and not in the group by, but it should still run.

Ins_Analytic · Posted 07-20-2018 03:03 PM

Thank you for your response, I appreciate you looking into this problem!

I am still getting the final grouping with the same PINs, unfortunately. For my results, I'm looking for some way to identify any lines that are identical based on the group of 20 columns but have different PINs.

Thanks!

Urban_Science · Posted 07-20-2018 03:26 PM

You could also break it into multiple processing steps.

Group by the 20 variables counting the number of PID's in each set
Inner join that table to the original on the 20 variables, where the number of PID's is > 1

proc sql;
   create table countGroups as 
       select {20 columns}
          , count(distinct PID) as numPIDs
       from have
       group by {20 columns}
   ;
quit;

proc sql;
   create table want as 
      select t1.*
      from have as t1
      inner join countGroups as t2
         on t1.column1 = t2.column1
            and t1.column2 = t2.column2 
            ... 
            and t1.column20 = t2.column20
      where t2.numPIDs > 1
;
quit;

ChrisNZ · Posted 07-22-2018 07:25 PM

Unsure I fully understand your need. Like this?

data HAVE;
  GR1=1; GR20=1; PIN=1; output;
  GR1=1; GR20=1; PIN=1; output;
  GR1=2; GR20=1; PIN=1; output;
  GR1=2; GR20=1; PIN=2; output;
  GR1=3; GR20=1; PIN=1; output;
run;
proc sql;
  create table WANT as
  select GR1, GR20, count(distinct PIN) as NBPINS
  from HAVE
  group by GR1, GR20
  having count(*) > 1;
quit;

GR1	GR20	NBPINS
1	1	1
2	1	2

High-Performance SAS Coding - Third Edition

Ins_Analytic · Posted 07-24-2018 08:26 AM

I need to find your top 2 lines where everything is the same. I need to know that CID (part of 20 grouping fields) has a duplicate on those 20 fields, with the same PIN as it's duplicate. So I need to extract the CID of the lines with duplicates on the 20 field group by, with the same PIN.

Patrick · Posted 07-22-2018 11:35 PM

@Ins_Analytic

If your source data is stored in a SAS table and not a data base then below code should be o.k.

proc sort data=have out=inter;
  by <col01 to col20> pin;
run;

data want;
  set inter;
  by <col01 to col20> pin;
  /* rows with identical col1 to col20 */
  if not (first.<col20> and last.<col20>) then
    do;
      /* rows with unique pin's within "identical" rows */
      if first.pin and last.pin then output;
    end;
run;

Reeza · Posted 07-20-2018 03:07 PM

Something like this maybe? Test it on some small data sets and let me know if it works.

proc sql;
create table want as
select <group 20 variables> , case when max(pin) ne min(pin) then 'Different Pins' else 'Same Pins' end as status
from have
group by <group 20 variables>
having count(*) > 1;
quit;

Ins_Analytic · Posted 07-24-2018 06:37 PM

I honestly think any of these solutions could have worked if I was savvy enough. For the structure of the data, this is the one that worked best for me.

Thank you everyone for the suggestions, I really appreciate the help.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!