BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ins_Analytic
Fluorite | Level 6

I have a fairly large table (30 fields by ~10M+ rows).  I need to find duplicates based on a grouping of 20 of the fields.  I was able to do this with:

 

proc sql;
     create table count as
     select *, 1 as line
     from orig_data
;
     create table dup_id as
     select *, sum(line) as linesum
     from count
     group by {list 20 fields}
;
quit;

 

Any rows with linesum>1 are flagged to have dups based on the 20 field grouping and I could check them out.

 

My issue is that this group by statement does not exclude any of the other 10 fields from the table, within the grouping.  For 9 of them that is fine, but I need identify a grouping where one particular field (PIN) is different from the rest of the grouping above.  So essentially, I need a way to identify where those 20 fields are identical but PIN is different.

 

I think I can accomplish this with a sort and data step using first. logic, but I wanted to see if this is the preferred approach?  Is there another way?

 

Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
Urban_Science
Quartz | Level 8

You could also break it into multiple processing steps.

  1. Group by the 20 variables counting the number of PID's in each set
  2. Inner join that table to the original on the 20 variables, where the number of PID's is > 1

 

proc sql;
   create table countGroups as 
       select {20 columns}
          , count(distinct PID) as numPIDs
       from have
       group by {20 columns}
   ;
quit;

proc sql;
create table want as
select t1.*
from have as t1
inner join countGroups as t2
on t1.column1 = t2.column1
and t1.column2 = t2.column2
...
and t1.column20 = t2.column20
where t2.numPIDs > 1
;
quit;

 

View solution in original post

8 REPLIES 8
Urban_Science
Quartz | Level 8

I'm not a fan of clever solutions, but here I go... you could try abusing proc sql with the combination of merging statistics with a having clause.

proc sql;
   create table want as 
       select {20 columns}
          , PID
          , count(*) as numRows
       from have
       group by {20 columns}
       having numRows > 1
   ;
quit;

That should return a table with all sets of 20 columns that are duplicated across multiple columns.  It will throw a warning about having PID in the select statement and not in the group by, but it should still run.  

Ins_Analytic
Fluorite | Level 6

Thank you for your response, I appreciate you looking into this problem! 

 

I am still getting the final grouping with the same PINs, unfortunately.  For my results, I'm looking for some way to identify any lines that are identical based on the group of 20 columns but have different PINs.

 

Thanks!

Urban_Science
Quartz | Level 8

You could also break it into multiple processing steps.

  1. Group by the 20 variables counting the number of PID's in each set
  2. Inner join that table to the original on the 20 variables, where the number of PID's is > 1

 

proc sql;
   create table countGroups as 
       select {20 columns}
          , count(distinct PID) as numPIDs
       from have
       group by {20 columns}
   ;
quit;

proc sql;
create table want as
select t1.*
from have as t1
inner join countGroups as t2
on t1.column1 = t2.column1
and t1.column2 = t2.column2
...
and t1.column20 = t2.column20
where t2.numPIDs > 1
;
quit;

 

ChrisNZ
Tourmaline | Level 20

Unsure I fully understand your need. Like this?

data HAVE;
  GR1=1; GR20=1; PIN=1; output;
  GR1=1; GR20=1; PIN=1; output;
  GR1=2; GR20=1; PIN=1; output;
  GR1=2; GR20=1; PIN=2; output;
  GR1=3; GR20=1; PIN=1; output;
run;
proc sql;
  create table WANT as
  select GR1, GR20, count(distinct PIN) as NBPINS
  from HAVE
  group by GR1, GR20
  having count(*) > 1;
quit;
GR1 GR20 NBPINS
1 1 1
2 1 2

 

Ins_Analytic
Fluorite | Level 6

I need to find your top 2 lines where everything is the same.  I need to know that CID (part of 20 grouping fields) has a duplicate on those 20 fields, with the same PIN as it's duplicate.  So I need to extract the CID of the lines with duplicates on the 20 field group by, with the same PIN.

Patrick
Opal | Level 21

@Ins_Analytic

If your source data is stored in a SAS table and not a data base then below code should be o.k.

proc sort data=have out=inter;
  by <col01 to col20> pin;
run;

data want;
  set inter;
  by <col01 to col20> pin;
  /* rows with identical col1 to col20 */
  if not (first.<col20> and last.<col20>) then
    do;
      /* rows with unique pin's within "identical" rows */
      if first.pin and last.pin then output;
    end;
run;
Reeza
Super User

Something like this maybe? Test it on some small data sets and let me know if it works. 

 

proc sql;
create table want as
select <group 20 variables> , case when max(pin) ne min(pin) then 'Different Pins' else 'Same Pins' end as status
from have
group by <group 20 variables>
having count(*) > 1;
quit;

 

Ins_Analytic
Fluorite | Level 6

I honestly think any of these solutions could have worked if I was savvy enough.  For the structure of the data, this is the one that worked best for me. 

 

Thank you everyone for the suggestions, I really appreciate the help.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 3888 views
  • 4 likes
  • 5 in conversation