Hi there,
This seems like a very simple problem, but I can't seem to resolve it on my own. I need to delete a group when the groups values are duplicates of another.
data have;
input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;
Ah, I see. The way I do it below, is to transpose up the data within each group, sort nodupkey the cat'd columns to get a disintct list, and then remove observations not having a group in the nodupkeylist. Well, you will see the logic if you run it step by step. Will have a think if there is an easier syntax for it.
data have;
input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;
proc sort data=have out=have2;
by var2 var3;
run;
proc transpose data=have2 out=t_have;
by var2;
var var3;
run;
data t_have;
set t_have (drop=_name_);
all=catx(',',of col:);
run;
proc sort data=t_have nodupkey;
by all;
run;
proc sql;
delete from HAVE
where VAR2 not in (select VAR2 from T_HAVE);
quit;
One good thing about this approach is it removes data from the original table, leaving sort as is.
Sort the data, then drop not first records, then sort back into original order - note I set a variable N to keep original order:
data have; input var1 var2 var3 $; n=_n_; cards; 1 1 A 2 1 B 3 1 C 4 2 A 5 2 M 6 2 T 7 3 A 8 3 B 9 3 C 10 4 S 11 4 V 12 4 Z ; run; proc sort data=have out=want; by var3 var2 var1; run; data want; set want; by var3 var2 var1; if not(first.var3) then delete; run; proc sort data=want; by n; run;
Sorry, I can only work with what you post here, show the problem in data. What defines the term group in your statement "Is there a way to restrict the deletion to just the duplicated group"? I had thought by your original post that var3 indicated the group, and thus any subsequent appearances of that variable would be considered "duplicates".
Sorry I wasn't clear. Var2 is the variable that contains the groups (i.e. 1, 2, 3 and 4). Var3 are the values for each group (i.e. Group 1 contains the values A, B and C). I am hoping to find a solution that identifies that groups 1 and 3 both have the same set of values (A, B, C) and to then delete group 3.
I hope that helps. Does "show the problem in data" mean you would prefer to work with a table that has more relevant data instead of example data? I can put something together if that's the case.
Ah, I see. The way I do it below, is to transpose up the data within each group, sort nodupkey the cat'd columns to get a disintct list, and then remove observations not having a group in the nodupkeylist. Well, you will see the logic if you run it step by step. Will have a think if there is an easier syntax for it.
data have;
input var1 var2 var3 $;
cards;
1 1 A
2 1 B
3 1 C
4 2 A
5 2 M
6 2 T
7 3 A
8 3 B
9 3 C
10 4 S
11 4 V
12 4 Z
;
run;
proc sort data=have out=have2;
by var2 var3;
run;
proc transpose data=have2 out=t_have;
by var2;
var var3;
run;
data t_have;
set t_have (drop=_name_);
all=catx(',',of col:);
run;
proc sort data=t_have nodupkey;
by all;
run;
proc sql;
delete from HAVE
where VAR2 not in (select VAR2 from T_HAVE);
quit;
One good thing about this approach is it removes data from the original table, leaving sort as is.
Wow thank you RW9! That is a very clever solution! I applied it to my data and it works like a charm.
If you have time to explain, how did "of col:" in the following code segment remove the duplicated group from var2? It's fascinating.
data t_have;
set t_have (drop=_name_);
all=catx(',',of col:);
run;
You will normally see the of used in conjunction with array statements, so thats probably a good place to start. Say we have:
array col{5} abc def asd iop yui;
What this is doing is creating a reference to those variables: abc def asd iop yui.
This reference is col and has 5 elements.
Now if I do:
catx(',',of col{*});
This wil concatenate all values in the array col with commas.
Much the same in the way I have used it in my code. There is a special syntax for specifying a range of variables with the same prefix, so in my example each variable I am interested in is prefixed with COL, so I can refer to all of them as: COL:
I can refer to this in the same way as if it was an array, so we end up with:
all=catx(',',of COL:);
Or to use long notation:
all=catx(',',COL1,COL2,COL3...);
Thanks for the explanation RW9.
Maybe I've overlooked something, but it seems to me that the data step creating variable ALL could be omitted. Variable ALL could be replaced by variable list COL: in the subsequent PROC SORT step.
I'm pondering it, but I DO NOT consider this a simple problem! Hopefully, more soon.
Tom
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.