BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MillerEL
Obsidian | Level 7

Hello,

 

I have a dataset that has some duplicate records. I am trying to removing duplicates which are identified as unique combinations of a case id and individual id. However, when I use the nodupkey option, it is removing hundreds of records instead of the 11 records with unique combinations of these two variables. All the documentation says that this should work. So I tried using the last. option, but it didn't work either. Neither of the options below are removing unique combinations only. 

 

Can someone help me figure out what I'm doing wrong?

 

PROC SORT data=one out=two NODUPKEY;
	by caseid indvid;
run;

DATA nodup;
set one; by caseid indvid;
if last.indvid;
run;
1 ACCEPTED SOLUTION

Accepted Solutions
MillerEL
Obsidian | Level 7

I found my own solution. I tried it just before I posted and it didn't work - syntax error. This works.

DATA nodups;
	set one;
	by caseid indvid;
	if first.caseid and first.indvid then;
	else output;
run;

View solution in original post

6 REPLIES 6
MillerEL
Obsidian | Level 7

I found my own solution. I tried it just before I posted and it didn't work - syntax error. This works.

DATA nodups;
	set one;
	by caseid indvid;
	if first.caseid and first.indvid then;
	else output;
run;
Tom
Super User Tom
Super User

That code is NOT removing duplicate records. Instead it will just REMOVE one record per value of CASEID.  Note there is no need to include FIRST.INDVID in your code as is does not add anything.  The value of FIRST.INDVID will always be true when FIRST.CASEID is true since by definition if it is the first record for this group of CASEID values it also the first the set of INDVID values nested within the values of CASEID.

 

So all that adding INDVID to the BY statement does in your data step is to make sure that when there are multiple records for a value of CASEID you will REMVOE the record that has the minimum value of INDVID.

MillerEL
Obsidian | Level 7

That's odd, because the code does exactly what I want it to do when tested. It keeps the first record with a unique individ for each caseid (and, by default removing the rest). I tested it extensively, it is doing what it is supposed to.  

Tom
Super User Tom
Super User

Perhaps you mean something different than what it sounds like you are saying?

Here is some sample data.

data one;
  input caseid indvid row_id;
cards;
1 1 1
1 1 2
1 2 3
1 3 4
2 1 5
2 2 6
2 3 7
3 1 8
4 1 9
;

Your original request was

I am trying to removing duplicates which are identified as unique combinations of a case id and individual id. 

Which I would take to mean you want to only delete  rows 1 and 2 since those are the only two observations that have the exact same values of CASEID and INDVID .

data eliminate_dups;
  set one;
  by caseid indvid;
  if (first.indvid and last.indvid) then output;
run;

 

But if you run your logic on that file

DATA nodups;
	set one;
	by caseid indvid;
	if first.caseid and first.indvid then;
	else output;
run;

It removes rows 1, 5, 8 and 9 and keeps rows 2,3,4,6 and 7.

 

 

 

MillerEL
Obsidian | Level 7

I'll look into it, thanks!

ballardw
Super User

Concrete data examples for start and desired end always help.

 

From the number of records that are removed it sounds like you have multiple records of caseid indvid with other variables that are not duplicated and you want to keep. Perhaps you should try NODUPREC instead of NoDupKey.

See this example:

 

data junk;
   do x= 1 to 3;
   do y= 10 to 13;
   output;
   output;
   end;
   end;

run;

proc sort data=junk out=junksorted noduprec;
   by x;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 17352 views
  • 1 like
  • 3 in conversation