SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ins_Analytic
Fluorite | Level 6

I have a fairly large table (30 fields by ~10M+ rows).  I need to find duplicates based on a grouping of 20 of the fields.  I was able to do this with:

 

proc sql;
     create table count as
     select *, 1 as line
     from orig_data
;
     create table dup_id as
     select *, sum(line) as linesum
     from count
     group by {list 20 fields}
;
quit;

 

Any rows with linesum>1 are flagged to have dups based on the 20 field grouping and I could check them out.

 

My issue is that this group by statement does not exclude any of the other 10 fields from the table, within the grouping.  For 9 of them that is fine, but I need identify a grouping where one particular field (PIN) is different from the rest of the grouping above.  So essentially, I need a way to identify where those 20 fields are identical but PIN is different.

 

I think I can accomplish this with a sort and data step using first. logic, but I wanted to see if this is the preferred approach?  Is there another way?

 

Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
Urban_Science
Quartz | Level 8

You could also break it into multiple processing steps.

  1. Group by the 20 variables counting the number of PID's in each set
  2. Inner join that table to the original on the 20 variables, where the number of PID's is > 1

 

proc sql;
   create table countGroups as 
       select {20 columns}
          , count(distinct PID) as numPIDs
       from have
       group by {20 columns}
   ;
quit;proc sql;   create table want as       select t1.*      from have as t1      inner join countGroups as t2         on t1.column1 = t2.column1            and t1.column2 = t2.column2             ...             and t1.column20 = t2.column20      where t2.numPIDs > 1;quit;

 

View solution in original post

8 REPLIES 8
Urban_Science
Quartz | Level 8

I'm not a fan of clever solutions, but here I go... you could try abusing proc sql with the combination of merging statistics with a having clause.

proc sql;
   create table want as 
       select {20 columns}
          , PID
          , count(*) as numRows
       from have
       group by {20 columns}
       having numRows > 1
   ;
quit;

That should return a table with all sets of 20 columns that are duplicated across multiple columns.  It will throw a warning about having PID in the select statement and not in the group by, but it should still run.  

Ins_Analytic
Fluorite | Level 6

Thank you for your response, I appreciate you looking into this problem! 

 

I am still getting the final grouping with the same PINs, unfortunately.  For my results, I'm looking for some way to identify any lines that are identical based on the group of 20 columns but have different PINs.

 

Thanks!

Urban_Science
Quartz | Level 8

You could also break it into multiple processing steps.

  1. Group by the 20 variables counting the number of PID's in each set
  2. Inner join that table to the original on the 20 variables, where the number of PID's is > 1

 

proc sql;
   create table countGroups as 
       select {20 columns}
          , count(distinct PID) as numPIDs
       from have
       group by {20 columns}
   ;
quit;proc sql;   create table want as       select t1.*      from have as t1      inner join countGroups as t2         on t1.column1 = t2.column1            and t1.column2 = t2.column2             ...             and t1.column20 = t2.column20      where t2.numPIDs > 1;quit;

 

ChrisNZ
Tourmaline | Level 20

Unsure I fully understand your need. Like this?

data HAVE;
  GR1=1; GR20=1; PIN=1; output;
  GR1=1; GR20=1; PIN=1; output;
  GR1=2; GR20=1; PIN=1; output;
  GR1=2; GR20=1; PIN=2; output;
  GR1=3; GR20=1; PIN=1; output;
run;
proc sql;
  create table WANT as
  select GR1, GR20, count(distinct PIN) as NBPINS
  from HAVE
  group by GR1, GR20
  having count(*) > 1;
quit;
GR1 GR20 NBPINS
1 1 1
2 1 2

 

Ins_Analytic
Fluorite | Level 6

I need to find your top 2 lines where everything is the same.  I need to know that CID (part of 20 grouping fields) has a duplicate on those 20 fields, with the same PIN as it's duplicate.  So I need to extract the CID of the lines with duplicates on the 20 field group by, with the same PIN.

Patrick
Opal | Level 21

@Ins_Analytic

If your source data is stored in a SAS table and not a data base then below code should be o.k.

proc sort data=have out=inter;
  by <col01 to col20> pin;
run;

data want;
  set inter;
  by <col01 to col20> pin;
  /* rows with identical col1 to col20 */
  if not (first.<col20> and last.<col20>) then
    do;
      /* rows with unique pin's within "identical" rows */
      if first.pin and last.pin then output;
    end;
run;
Reeza
Super User

Something like this maybe? Test it on some small data sets and let me know if it works. 

 

proc sql;
create table want as
select <group 20 variables> , case when max(pin) ne min(pin) then 'Different Pins' else 'Same Pins' end as status
from have
group by <group 20 variables>
having count(*) > 1;
quit;

 

Ins_Analytic
Fluorite | Level 6

I honestly think any of these solutions could have worked if I was savvy enough.  For the structure of the data, this is the one that worked best for me. 

 

Thank you everyone for the suggestions, I really appreciate the help.

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 4938 views
  • 4 likes
  • 5 in conversation