Solved: Duplicates by group

sas-inquirer · Posted 04-13-2016 08:56 AM

Hi there,

This seems like a very simple problem, but I can't seem to resolve it on my own. I need to delete a group when the groups values are duplicates of another.

data have;
input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;

Want:

Var 1 Var2 Var3

1 1 A

2 1 B

3 1 C

4 2 A

5 2 M

6 2 T

10 4 S

11 4 V

12 4 Z

How can I delete an entire group if the values between two groups match? (If all values in var3 = all values in var3 by var2 then delete last.var2).

I’ve tried sql:

proc sql;

create table want as

select * from have

group by var3

having var2 = min(var2)

order by var1;

quit;

And it deletes the duplicate var2 group fine, but it also deletes any duplicate value of var3.

I’ve also tried proc sort with a data step including nodups and first and last, but I could never get it to come close to my desired output.

Where am I going wrong?

RW9 · Posted 04-13-2016 10:05 AM

Ah, I see. The way I do it below, is to transpose up the data within each group, sort nodupkey the cat'd columns to get a disintct list, and then remove observations not having a group in the nodupkeylist. Well, you will see the logic if you run it step by step. Will have a think if there is an easier syntax for it.

data have;
 input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;

proc sort data=have out=have2;
 by var2 var3;
run;

proc transpose data=have2 out=t_have;
 by var2;
 var var3;
run;

data t_have;
 set t_have (drop=_name_);
 all=catx(',',of col:);
run;

proc sort data=t_have nodupkey;
 by all;
run;

proc sql;
 delete from HAVE
 where VAR2 not in (select VAR2 from T_HAVE);
quit;

One good thing about this approach is it removes data from the original table, leaving sort as is.

View solution in original post

sas-inquirer · Posted 04-13-2016 08:57 AM

I forgot to mention, I am using SAS 9.2.

RW9 · Posted 04-13-2016 09:02 AM

Sort the data, then drop not first records, then sort back into original order - note I set a variable N to keep original order:

data have;
  input var1 var2 var3 $;
  n=_n_;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;

proc sort data=have out=want;
  by var3 var2 var1;
run;

data want;
  set want;
  by var3 var2 var1;
  if not(first.var3) then delete;
run;

proc sort data=want;
  by n;
run;

sas-inquirer · Posted 04-13-2016 09:08 AM

Thanks the quick response RW9! I applied your solution and am getting results that not only remove the duplicate group from var2 but also any duplicate var3 values. I will have var3 values that duplicate a lot so need to keep those. Is there a way to restrict the deletion to just the duplicated group?

RW9 · Posted 04-13-2016 09:39 AM

Sorry, I can only work with what you post here, show the problem in data. What defines the term group in your statement "Is there a way to restrict the deletion to just the duplicated group"? I had thought by your original post that var3 indicated the group, and thus any subsequent appearances of that variable would be considered "duplicates".

sas-inquirer · Posted 04-13-2016 09:48 AM

Sorry I wasn't clear. Var2 is the variable that contains the groups (i.e. 1, 2, 3 and 4). Var3 are the values for each group (i.e. Group 1 contains the values A, B and C). I am hoping to find a solution that identifies that groups 1 and 3 both have the same set of values (A, B, C) and to then delete group 3.

I hope that helps. Does "show the problem in data" mean you would prefer to work with a table that has more relevant data instead of example data? I can put something together if that's the case.

RW9 · Posted 04-13-2016 10:05 AM

Ah, I see. The way I do it below, is to transpose up the data within each group, sort nodupkey the cat'd columns to get a disintct list, and then remove observations not having a group in the nodupkeylist. Well, you will see the logic if you run it step by step. Will have a think if there is an easier syntax for it.

data have;
 input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;

proc sort data=have out=have2;
 by var2 var3;
run;

proc transpose data=have2 out=t_have;
 by var2;
 var var3;
run;

data t_have;
 set t_have (drop=_name_);
 all=catx(',',of col:);
run;

proc sort data=t_have nodupkey;
 by all;
run;

proc sql;
 delete from HAVE
 where VAR2 not in (select VAR2 from T_HAVE);
quit;

One good thing about this approach is it removes data from the original table, leaving sort as is.

sas-inquirer · Posted 04-13-2016 10:26 AM

Wow thank you RW9! That is a very clever solution! I applied it to my data and it works like a charm.

If you have time to explain, how did "of col:" in the following code segment remove the duplicated group from var2? It's fascinating.

data t_have;
set t_have (drop=_name_);
all=catx(',',of col:);
run;

RW9 · Posted 04-13-2016 10:32 AM

You will normally see the of used in conjunction with array statements, so thats probably a good place to start. Say we have:
array col{5} abc def asd iop yui;

What this is doing is creating a reference to those variables: abc def asd iop yui.

This reference is col and has 5 elements.

Now if I do:

catx(',',of col{*});

This wil concatenate all values in the array col with commas.

Much the same in the way I have used it in my code. There is a special syntax for specifying a range of variables with the same prefix, so in my example each variable I am interested in is prefixed with COL, so I can refer to all of them as: COL:

I can refer to this in the same way as if it was an array, so we end up with:

all=catx(',',of COL:);

Or to use long notation:

all=catx(',',COL1,COL2,COL3...);

sas-inquirer · Posted 04-13-2016 10:59 AM

Thanks for the explanation RW9.

FreelanceReinh · Posted 04-15-2016 05:43 PM

Maybe I've overlooked something, but it seems to me that the data step creating variable ALL could be omitted. Variable ALL could be replaced by variable list COL: in the subsequent PROC SORT step.

TomKari · Posted 04-13-2016 09:11 AM

I'm pondering it, but I DO NOT consider this a simple problem! Hopefully, more soon.

Tom

sas-inquirer · Posted 04-13-2016 09:21 AM

That's actually very validating to hear Tom!

Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Re: Duplicates by group

Registration is open

Call for Content EXTENDED

Registration is open

Call for Content EXTENDED

SAS Training: Just a Click Away