I have a fairly large table (30 fields by ~10M+ rows). I need to find duplicates based on a grouping of 20 of the fields. I was able to do this with:
proc sql;
create table count as
select *, 1 as line
from orig_data
;
create table dup_id as
select *, sum(line) as linesum
from count
group by {list 20 fields}
;
quit;
Any rows with linesum>1 are flagged to have dups based on the 20 field grouping and I could check them out.
My issue is that this group by statement does not exclude any of the other 10 fields from the table, within the grouping. For 9 of them that is fine, but I need identify a grouping where one particular field (PIN) is different from the rest of the grouping above. So essentially, I need a way to identify where those 20 fields are identical but PIN is different.
I think I can accomplish this with a sort and data step using first. logic, but I wanted to see if this is the preferred approach? Is there another way?
Thanks in advance.
You could also break it into multiple processing steps.
proc sql;
create table countGroups as
select {20 columns}
, count(distinct PID) as numPIDs
from have
group by {20 columns}
;
quit;
proc sql;
create table want as
select t1.*
from have as t1
inner join countGroups as t2
on t1.column1 = t2.column1
and t1.column2 = t2.column2
...
and t1.column20 = t2.column20
where t2.numPIDs > 1
;
quit;
I'm not a fan of clever solutions, but here I go... you could try abusing proc sql with the combination of merging statistics with a having clause.
proc sql;
create table want as
select {20 columns}
, PID
, count(*) as numRows
from have
group by {20 columns}
having numRows > 1
;
quit;
That should return a table with all sets of 20 columns that are duplicated across multiple columns. It will throw a warning about having PID in the select statement and not in the group by, but it should still run.
Thank you for your response, I appreciate you looking into this problem!
I am still getting the final grouping with the same PINs, unfortunately. For my results, I'm looking for some way to identify any lines that are identical based on the group of 20 columns but have different PINs.
Thanks!
You could also break it into multiple processing steps.
proc sql;
create table countGroups as
select {20 columns}
, count(distinct PID) as numPIDs
from have
group by {20 columns}
;
quit;
proc sql;
create table want as
select t1.*
from have as t1
inner join countGroups as t2
on t1.column1 = t2.column1
and t1.column2 = t2.column2
...
and t1.column20 = t2.column20
where t2.numPIDs > 1
;
quit;
Unsure I fully understand your need. Like this?
data HAVE;
GR1=1; GR20=1; PIN=1; output;
GR1=1; GR20=1; PIN=1; output;
GR1=2; GR20=1; PIN=1; output;
GR1=2; GR20=1; PIN=2; output;
GR1=3; GR20=1; PIN=1; output;
run;
proc sql;
create table WANT as
select GR1, GR20, count(distinct PIN) as NBPINS
from HAVE
group by GR1, GR20
having count(*) > 1;
quit;
GR1 | GR20 | NBPINS |
1 | 1 | 1 |
2 | 1 | 2 |
I need to find your top 2 lines where everything is the same. I need to know that CID (part of 20 grouping fields) has a duplicate on those 20 fields, with the same PIN as it's duplicate. So I need to extract the CID of the lines with duplicates on the 20 field group by, with the same PIN.
If your source data is stored in a SAS table and not a data base then below code should be o.k.
proc sort data=have out=inter;
by <col01 to col20> pin;
run;
data want;
set inter;
by <col01 to col20> pin;
/* rows with identical col1 to col20 */
if not (first.<col20> and last.<col20>) then
do;
/* rows with unique pin's within "identical" rows */
if first.pin and last.pin then output;
end;
run;
Something like this maybe? Test it on some small data sets and let me know if it works.
proc sql;
create table want as
select <group 20 variables> , case when max(pin) ne min(pin) then 'Different Pins' else 'Same Pins' end as status
from have
group by <group 20 variables>
having count(*) > 1;
quit;
I honestly think any of these solutions could have worked if I was savvy enough. For the structure of the data, this is the one that worked best for me.
Thank you everyone for the suggestions, I really appreciate the help.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.