topic Re: Duplicates by group in SAS Programming

Duplicates by group

sas-inquirer — Wed, 13 Apr 2016 12:56:18 GMT

Hi there,

This seems like a very simple problem, but I can't seem to resolve it on my own. I need to delete a group when the groups values are duplicates of another.

data have;
input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;

Want:

Var 1 Var2 Var3

1 1 A

2 1 B

3 1 C

4 2 A

5 2 M

6 2 T

10 4 S

11 4 V

12 4 Z

How can I delete an entire group if the values between two groups match? (If all values in var3 = all values in var3 by var2 then delete last.var2).

I’ve tried sql:

proc sql;

create table want as

select * from have

group by var3

having var2 = min(var2)

order by var1;

quit;

And it deletes the duplicate var2 group fine, but it also deletes any duplicate value of var3.

I’ve also tried proc sort with a data step including nodups and first and last, but I could never get it to come close to my desired output.

Where am I going wrong?

Re: Duplicates by group

sas-inquirer — Wed, 13 Apr 2016 12:57:42 GMT

I forgot to mention, I am using SAS 9.2.

Re: Duplicates by group

RW9 — Wed, 13 Apr 2016 13:02:02 GMT

Sort the data, then drop not first records, then sort back into original order - note I set a variable N to keep original order:

data have;
  input var1 var2 var3 $;
  n=_n_;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;

proc sort data=have out=want;
  by var3 var2 var1;
run;

data want;
  set want;
  by var3 var2 var1;
  if not(first.var3) then delete;
run;

proc sort data=want;
  by n;
run;

Re: Duplicates by group

sas-inquirer — Wed, 13 Apr 2016 13:08:34 GMT

Thanks the quick response RW9! I applied your solution and am getting results that not only remove the duplicate group from var2 but also any duplicate var3 values. I will have var3 values that duplicate a lot so need to keep those. Is there a way to restrict the deletion to just the duplicated group?

Re: Duplicates by group

TomKari — Wed, 13 Apr 2016 13:11:41 GMT

I'm pondering it, but I DO NOT consider this a simple problem! Hopefully, more soon.

Tom

Re: Duplicates by group

sas-inquirer — Wed, 13 Apr 2016 13:21:39 GMT

That's actually very validating to hear Tom!

Re: Duplicates by group

RW9 — Wed, 13 Apr 2016 13:39:39 GMT

Sorry, I can only work with what you post here, show the problem in data. What defines the term group in your statement "Is there a way to restrict the deletion to just the duplicated group"? I had thought by your original post that var3 indicated the group, and thus any subsequent appearances of that variable would be considered "duplicates".

Re: Duplicates by group

sas-inquirer — Wed, 13 Apr 2016 13:48:42 GMT

Sorry I wasn't clear. Var2 is the variable that contains the groups (i.e. 1, 2, 3 and 4). Var3 are the values for each group (i.e. Group 1 contains the values A, B and C). I am hoping to find a solution that identifies that groups 1 and 3 both have the same set of values (A, B, C) and to then delete group 3.

I hope that helps. Does "show the problem in data" mean you would prefer to work with a table that has more relevant data instead of example data? I can put something together if that's the case.

Re: Duplicates by group

RW9 — Wed, 13 Apr 2016 14:05:02 GMT

Ah, I see. The way I do it below, is to transpose up the data within each group, sort nodupkey the cat'd columns to get a disintct list, and then remove observations not having a group in the nodupkeylist. Well, you will see the logic if you run it step by step. Will have a think if there is an easier syntax for it.

data have;
 input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;

proc sort data=have out=have2;
 by var2 var3;
run;

proc transpose data=have2 out=t_have;
 by var2;
 var var3;
run;

data t_have;
 set t_have (drop=_name_);
 all=catx(',',of col:);
run;

proc sort data=t_have nodupkey;
 by all;
run;

proc sql;
 delete from HAVE
 where VAR2 not in (select VAR2 from T_HAVE);
quit;

One good thing about this approach is it removes data from the original table, leaving sort as is.

Re: Duplicates by group

sas-inquirer — Wed, 13 Apr 2016 14:26:42 GMT

Wow thank you RW9! That is a very clever solution! I applied it to my data and it works like a charm.

If you have time to explain, how did "of col:" in the following code segment remove the duplicated group from var2? It's fascinating.

data t_have;
set t_have (drop=_name_);
all=catx(',',of col:);
run;

Re: Duplicates by group

RW9 — Wed, 13 Apr 2016 14:32:50 GMT

You will normally see the of used in conjunction with array statements, so thats probably a good place to start. Say we have:
array col{5} abc def asd iop yui;

What this is doing is creating a reference to those variables: abc def asd iop yui.

This reference is col and has 5 elements.

Now if I do:

catx(',',of col{*});

This wil concatenate all values in the array col with commas.

Much the same in the way I have used it in my code. There is a special syntax for specifying a range of variables with the same prefix, so in my example each variable I am interested in is prefixed with COL, so I can refer to all of them as: COL:

I can refer to this in the same way as if it was an array, so we end up with:

all=catx(',',of COL:);

Or to use long notation:

all=catx(',',COL1,COL2,COL3...);

Re: Duplicates by group

sas-inquirer — Wed, 13 Apr 2016 14:59:46 GMT

Thanks for the explanation RW9.

Re: Duplicates by group

FreelanceReinh — Fri, 15 Apr 2016 21:43:12 GMT

Maybe I've overlooked something, but it seems to me that the data step creating variable ALL could be omitted. Variable ALL could be replaced by variable list COL: in the subsequent PROC SORT step.