DATA Step, Macro, Functions and more

Removing Duplicates by multiple key variables

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 14
Accepted Solution

Removing Duplicates by multiple key variables

Hello,

 

I have a dataset that has some duplicate records. I am trying to removing duplicates which are identified as unique combinations of a case id and individual id. However, when I use the nodupkey option, it is removing hundreds of records instead of the 11 records with unique combinations of these two variables. All the documentation says that this should work. So I tried using the last. option, but it didn't work either. Neither of the options below are removing unique combinations only. 

 

Can someone help me figure out what I'm doing wrong?

 

PROC SORT data=one out=two NODUPKEY;
	by caseid indvid;
run;

DATA nodup;
set one; by caseid indvid;
if last.indvid;
run;

Accepted Solutions
Solution
‎09-26-2017 02:30 PM
Occasional Contributor
Posts: 14

Re: Removing Duplicates by multiple key variables

I found my own solution. I tried it just before I posted and it didn't work - syntax error. This works.

DATA nodups;
	set one;
	by caseid indvid;
	if first.caseid and first.indvid then;
	else output;
run;

View solution in original post


All Replies
Solution
‎09-26-2017 02:30 PM
Occasional Contributor
Posts: 14

Re: Removing Duplicates by multiple key variables

I found my own solution. I tried it just before I posted and it didn't work - syntax error. This works.

DATA nodups;
	set one;
	by caseid indvid;
	if first.caseid and first.indvid then;
	else output;
run;
Super User
Super User
Posts: 7,844

Re: Removing Duplicates by multiple key variables

[ Edited ]

That code is NOT removing duplicate records. Instead it will just REMOVE one record per value of CASEID.  Note there is no need to include FIRST.INDVID in your code as is does not add anything.  The value of FIRST.INDVID will always be true when FIRST.CASEID is true since by definition if it is the first record for this group of CASEID values it also the first the set of INDVID values nested within the values of CASEID.

 

So all that adding INDVID to the BY statement does in your data step is to make sure that when there are multiple records for a value of CASEID you will REMVOE the record that has the minimum value of INDVID.

Super User
Posts: 13,000

Re: Removing Duplicates by multiple key variables

Concrete data examples for start and desired end always help.

 

From the number of records that are removed it sounds like you have multiple records of caseid indvid with other variables that are not duplicated and you want to keep. Perhaps you should try NODUPREC instead of NoDupKey.

See this example:

 

data junk;
   do x= 1 to 3;
   do y= 10 to 13;
   output;
   output;
   end;
   end;

run;

proc sort data=junk out=junksorted noduprec;
   by x;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 550 views
  • 0 likes
  • 3 in conversation