- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a fairly large table (30 fields by ~10M+ rows). I need to find duplicates based on a grouping of 20 of the fields. I was able to do this with:
proc sql;
create table count as
select *, 1 as line
from orig_data
;
create table dup_id as
select *, sum(line) as linesum
from count
group by {list 20 fields}
;
quit;
Any rows with linesum>1 are flagged to have dups based on the 20 field grouping and I could check them out.
My issue is that this group by statement does not exclude any of the other 10 fields from the table, within the grouping. For 9 of them that is fine, but I need identify a grouping where one particular field (PIN) is different from the rest of the grouping above. So essentially, I need a way to identify where those 20 fields are identical but PIN is different.
I think I can accomplish this with a sort and data step using first. logic, but I wanted to see if this is the preferred approach? Is there another way?
Thanks in advance.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You could also break it into multiple processing steps.
- Group by the 20 variables counting the number of PID's in each set
- Inner join that table to the original on the 20 variables, where the number of PID's is > 1
proc sql;
create table countGroups as
select {20 columns}
, count(distinct PID) as numPIDs
from have
group by {20 columns}
;
quit;proc sql; create table want as select t1.* from have as t1 inner join countGroups as t2 on t1.column1 = t2.column1 and t1.column2 = t2.column2 ... and t1.column20 = t2.column20 where t2.numPIDs > 1;quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not a fan of clever solutions, but here I go... you could try abusing proc sql with the combination of merging statistics with a having clause.
proc sql;
create table want as
select {20 columns}
, PID
, count(*) as numRows
from have
group by {20 columns}
having numRows > 1
;
quit;
That should return a table with all sets of 20 columns that are duplicated across multiple columns. It will throw a warning about having PID in the select statement and not in the group by, but it should still run.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your response, I appreciate you looking into this problem!
I am still getting the final grouping with the same PINs, unfortunately. For my results, I'm looking for some way to identify any lines that are identical based on the group of 20 columns but have different PINs.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You could also break it into multiple processing steps.
- Group by the 20 variables counting the number of PID's in each set
- Inner join that table to the original on the 20 variables, where the number of PID's is > 1
proc sql;
create table countGroups as
select {20 columns}
, count(distinct PID) as numPIDs
from have
group by {20 columns}
;
quit;proc sql; create table want as select t1.* from have as t1 inner join countGroups as t2 on t1.column1 = t2.column1 and t1.column2 = t2.column2 ... and t1.column20 = t2.column20 where t2.numPIDs > 1;quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Unsure I fully understand your need. Like this?
data HAVE;
GR1=1; GR20=1; PIN=1; output;
GR1=1; GR20=1; PIN=1; output;
GR1=2; GR20=1; PIN=1; output;
GR1=2; GR20=1; PIN=2; output;
GR1=3; GR20=1; PIN=1; output;
run;
proc sql;
create table WANT as
select GR1, GR20, count(distinct PIN) as NBPINS
from HAVE
group by GR1, GR20
having count(*) > 1;
quit;
GR1 | GR20 | NBPINS |
1 | 1 | 1 |
2 | 1 | 2 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I need to find your top 2 lines where everything is the same. I need to know that CID (part of 20 grouping fields) has a duplicate on those 20 fields, with the same PIN as it's duplicate. So I need to extract the CID of the lines with duplicates on the 20 field group by, with the same PIN.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If your source data is stored in a SAS table and not a data base then below code should be o.k.
proc sort data=have out=inter;
by <col01 to col20> pin;
run;
data want;
set inter;
by <col01 to col20> pin;
/* rows with identical col1 to col20 */
if not (first.<col20> and last.<col20>) then
do;
/* rows with unique pin's within "identical" rows */
if first.pin and last.pin then output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Something like this maybe? Test it on some small data sets and let me know if it works.
proc sql;
create table want as
select <group 20 variables> , case when max(pin) ne min(pin) then 'Different Pins' else 'Same Pins' end as status
from have
group by <group 20 variables>
having count(*) > 1;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I honestly think any of these solutions could have worked if I was savvy enough. For the structure of the data, this is the one that worked best for me.
Thank you everyone for the suggestions, I really appreciate the help.