Duplicates by group

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 19
Accepted Solution

Duplicates by group

Hi there,

    This seems like a very simple problem, but I can't seem to resolve it on my own.  I need to delete a group when the groups values are duplicates of another.

 

data have;
input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;
Want:
Var 1 Var2 Var3
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
10 4 S
11 4 V
12 4 Z
 
How can I delete an entire group if the values between two groups match?  (If all values in var3 = all values in var3 by var2 then delete last.var2).
 
 
I’ve tried sql:
proc sql;
create table want as
select * from have
group by var3
having var2 = min(var2)
order by var1;
quit;
 
And it deletes the duplicate var2 group fine, but it also deletes any duplicate value of var3.
 
I’ve also tried proc sort with a data step including nodups and first and last, but I could never get it to come close to my desired output.
 
Where am I going wrong?

Accepted Solutions
Solution
‎04-13-2016 10:26 AM
Super User
Super User
Posts: 7,727

Re: Duplicates by group

Ah, I see.  The way I do it below, is to transpose up the data within each group, sort nodupkey the cat'd columns to get a disintct list, and then remove observations not having a group in the nodupkeylist.  Well, you will see the logic if you run it step by step.  Will have a think if there is an easier syntax for it.

data have;
input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;

proc sort data=have out=have2;
by var2 var3;
run;

proc transpose data=have2 out=t_have;
by var2;
var var3;
run;

data t_have;
set t_have (drop=_name_);
all=catx(',',of col:);
run;

proc sort data=t_have nodupkey;
by all;
run;

proc sql;
delete from HAVE
where VAR2 not in (select VAR2 from T_HAVE);
quit;

One good thing about this approach is it removes data from the original table, leaving sort as is.

View solution in original post


All Replies
Occasional Contributor
Posts: 19

Re: Duplicates by group

I forgot to mention, I am using SAS 9.2.
Super User
Super User
Posts: 7,727

Re: Duplicates by group

Sort the data, then drop not first records, then sort back into original order - note I set a variable N to keep original order:

data have;
  input var1 var2 var3 $;
  n=_n_;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;

proc sort data=have out=want;
  by var3 var2 var1;
run;

data want;
  set want;
  by var3 var2 var1;
  if not(first.var3) then delete;
run;

proc sort data=want;
  by n;
run;
Occasional Contributor
Posts: 19

Re: Duplicates by group

Thanks the quick response RW9! I applied your solution and am getting results that not only remove the duplicate group from var2 but also any duplicate var3 values. I will have var3 values that duplicate a lot so need to keep those. Is there a way to restrict the deletion to just the duplicated group?
Super User
Super User
Posts: 7,727

Re: Duplicates by group

Sorry, I can only work with what you post here, show the problem in data.  What defines the term group in your statement "Is there a way to restrict the deletion to just the duplicated group"?  I had thought by your original post that var3 indicated the group, and thus any subsequent appearances of that variable would be considered "duplicates".

Occasional Contributor
Posts: 19

Re: Duplicates by group

Sorry I wasn't clear.  Var2 is the variable that contains the groups (i.e. 1, 2, 3 and 4).  Var3 are the values for each group (i.e. Group 1 contains the values A, B and C).  I am hoping to find a solution that identifies that groups 1 and 3 both have the same set of values (A, B, C) and to then delete group 3. 

 

I hope that helps.  Does "show the problem in data" mean you would prefer to work with a table that has more relevant data instead of example data?  I can put something together if that's the case.

Solution
‎04-13-2016 10:26 AM
Super User
Super User
Posts: 7,727

Re: Duplicates by group

Ah, I see.  The way I do it below, is to transpose up the data within each group, sort nodupkey the cat'd columns to get a disintct list, and then remove observations not having a group in the nodupkeylist.  Well, you will see the logic if you run it step by step.  Will have a think if there is an easier syntax for it.

data have;
input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;

proc sort data=have out=have2;
by var2 var3;
run;

proc transpose data=have2 out=t_have;
by var2;
var var3;
run;

data t_have;
set t_have (drop=_name_);
all=catx(',',of col:);
run;

proc sort data=t_have nodupkey;
by all;
run;

proc sql;
delete from HAVE
where VAR2 not in (select VAR2 from T_HAVE);
quit;

One good thing about this approach is it removes data from the original table, leaving sort as is.

Occasional Contributor
Posts: 19

Re: Duplicates by group

Wow thank you RW9!  That is a very clever solution!  I applied it to my data and it works like a charm.  

 

If you have time to explain, how did "of col:" in the following code segment remove the duplicated group from var2?  It's fascinating.

data t_have;
set t_have (drop=_name_);
all=catx(',',of colSmiley Happy;
run;

 

 

 

 

Super User
Super User
Posts: 7,727

Re: Duplicates by group

You will normally see the of used in conjunction with array statements, so thats probably a good place to start.  Say we have:
array col{5} abc def asd iop yui;

What this is doing is creating a reference to those variables: abc def asd iop yui.

This reference is col and has 5 elements.

Now if I do:

catx(',',of col{*});

This wil concatenate all values in the array col with commas.

 

Much the same in the way I have used it in my code.  There is a special syntax for specifying a range of variables with the same prefix, so in my example each variable I am interested in is prefixed with COL, so I can refer to all of them as: COL:

I can refer to this in the same way as if it was an array, so we end up with:

all=catx(',',of COLSmiley Happy;

Or to use long notation:

all=catx(',',COL1,COL2,COL3...);

 

Occasional Contributor
Posts: 19

Re: Duplicates by group

Thanks for the explanation RW9.

Trusted Advisor
Posts: 1,116

Re: Duplicates by group

Maybe I've overlooked something, but it seems to me that the data step creating variable ALL could be omitted. Variable ALL could be replaced by variable list COL: in the subsequent PROC SORT step.

PROC Star
Posts: 1,146

Re: Duplicates by group

I'm pondering it, but I DO NOT consider this a simple problem! Hopefully, more soon.

 

Tom

Occasional Contributor
Posts: 19

Re: Duplicates by group

That's actually very validating to hear Tom!
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 734 views
  • 2 likes
  • 4 in conversation