topic Drop duplicate and original in SAS Enterprise Guide

Drop duplicate and original

Sean_OConnor — Wed, 04 Jan 2017 09:17:00 GMT

Folks,

I've been searching online for a solution to my query, but have yet to find one. In my orginal dataset I had a number of duplicate individuals (however, the information contained within the variables were different). Thus, I originally took out the two columns of information on the individual and aggregated it into one row.

However, now I would like to merge it back into my original dataset but firstly I would like to drop all duplicates along with the original. For ease here is a visual example.

id x1 x2 x3

1 5 6 5

1 10 2 3

2 6 5 85

3 63 2 11

4 4 68 95

id x1 x2 x3

2 6 5 85

3 63 2 11

4 4 68 95

Any advice is welcome.

Re: Drop duplicate and original

jefreytag — Wed, 04 Jan 2017 10:12:14 GMT

data dupremove;
	input id x1 x2 x3;
	datalines;
2 6 5 85
1 5 6 5
3 4 7 9
9 9 8 8
4 9 23 234
1 10 2 3
23 5 42 3
3 63 2 11
4 4 68 95
;
run;

proc sort data=dupremove;
	by id;
run;

data dupremove;
	set dupremove;
	by id;

	if not first.id=last.id then
		delete;
run;

proc print data=dupremove;
run;

Here you are!

The trick after sorting is the "if not first.id=last.id then delete;" line. This only outputs lines where only a unique id exists. If the first.id and the last.id variable are not 1 it means there is more than one line with the same id. Note you have to have the "by id" in the data step.

Re: Drop duplicate and original

Astounding — Wed, 04 Jan 2017 12:15:02 GMT

I'm afraid that solution won't work when you start with 3 or more observations for the same ID. It's simpler and more reliable to code:

if first.id=0 or last.id=0 then delete;

If you feel more clever than that, you can try:

if not first.id * last.id;

Re: Drop duplicate and original

jefreytag — Wed, 04 Jan 2017 12:22:57 GMT

Absolutely right, missed that logic. Corrected example with enhanced test data:

data dupremove;
	input id x1 x2 x3;
	datalines;
2 6 5 85
1 5 6 5
3 4 7 9
9 9 8 8
3 8 8 9
3 8 9 9
4 9 23 234
1 10 2 3
23 5 42 3
3 63 2 11
4 4 68 95
;
run;

proc sort data=dupremove;
	by id;
run;

data dupremove;
	set dupremove;
	by id;

if first.id=0 or last.id=0 then
		delete;
run;

proc print data=dupremove;
run;