DATA Step, Macro, Functions and more

Identify duplicate based on three variables

Accepted Solution Solved
Reply
Regular Contributor
Posts: 194
Accepted Solution

Identify duplicate based on three variables

G'day.  Any help you can give will be much appreciated:

I have 3 variables (all character) I need to use to determine if there are any duplicates.

Table Structure

ID             Child   Group     ...OTHER VARIABLES

S001        001      P

S001        001      R

S002        001      R

S002        001      P

S002        002      P

S002       003       P

S003       001      R

S003      002       R

S003      003       R

S003      001      P

How do you determine if there are duplicates based on the three variables?  I have old code which identified based on the first two variables (ID and Child which I modified below

to include the third variable but I'm getting the error message that count has two many variables:

ERROR 72-185: The COUNT function call has too many arguments.

data errors;


set edit;


concatvbl=id||child||group;

keep id child group concatvbl ;

proc sort data=krserrors;


by concatvbl;


data errors;


set errors;


by concatvbl;


if first.concatvbl then duplicate=0; else duplicate=count(concatvbl, id, child, group);


data errors;


set errors;


if duplicate ne 0;


;



proc print;


var id child group duplicate;


title2 'Duplicate Errors';


run;


Accepted Solutions
Solution
‎09-04-2014 09:27 PM
Trusted Advisor
Posts: 1,137

Re: Identify duplicate based on three variables

chris suggested a good approache to delete the duplicate records.

after checking the code mentioned by you, the reason for the error is there are too many arguments in the count function. The purpose of count function is to count the number of similar words in the string, in your case the count function has many arguments, like you are asking the function to search for similar words that are in concatvbl by using id , child, group. However you have to mention only one variable of the three. the correct syntax is count(concatvbl, id)

data edit;

input ID$             Child   Group$     ;

cards;

S001        001      P

S001        001      R

S002        001      R

S002        001      P

S002        002      P

S002       003       P

S003       001      R

S003      002       R

S003      003       R

S003      001      P

;

run;

data errors;

set edit;

concatvbl=id||child||group;

keep id child group concatvbl ;

run;

proc sort data=errors;

by concatvbl;

run;

data errors2;

set errors;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=count(concatvbl, id);

run;

data errors3;

set errors2;

if duplicate ne 0;

run;

proc print;

var id child group duplicate;

title2 'Duplicate Errors';

run;

after correction, the code executes well. Please check.

Also another approach to detect the duplicated records on three variables is by using retain statement

data errors2;

set errors;

retain duplicate;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=duplicate+1;

run;

where the duplicate records have incremental values.

Thanks,

Jag

Thanks,
Jag

View solution in original post


All Replies
Super Contributor
Posts: 440

Re: Identify duplicate based on three variables

There's a much simpler way of doing this with Proc Sort and the dupout option which writes duplicate key values to a new file as below (I've added a record with duplicate keys at observation 2 to show it in action)

data edit;

  input id $ child $ group $ other;

  datalines;

S001 001 P 1

S001 001 P 2

S001 001 R 2

S002 001 R 3

S002 001 P 4

S002 002 P 5

S002 003 P 6

S003 001 R 7

S003 002 R 8

S003 003 R 9

S003 001 P 0

;

run;

proc sort data=edit out=edit_unique dupout=dups nodupkey;

by id child group;

run;

proc print data=dups;

run;

Chris

Regular Contributor
Posts: 194

Re: Identify duplicate based on three variables

Posted in reply to ChrisBrooks

Thanks! I need to look at the duplicates before deleting but will keep this in mind for other other projects!

Solution
‎09-04-2014 09:27 PM
Trusted Advisor
Posts: 1,137

Re: Identify duplicate based on three variables

chris suggested a good approache to delete the duplicate records.

after checking the code mentioned by you, the reason for the error is there are too many arguments in the count function. The purpose of count function is to count the number of similar words in the string, in your case the count function has many arguments, like you are asking the function to search for similar words that are in concatvbl by using id , child, group. However you have to mention only one variable of the three. the correct syntax is count(concatvbl, id)

data edit;

input ID$             Child   Group$     ;

cards;

S001        001      P

S001        001      R

S002        001      R

S002        001      P

S002        002      P

S002       003       P

S003       001      R

S003      002       R

S003      003       R

S003      001      P

;

run;

data errors;

set edit;

concatvbl=id||child||group;

keep id child group concatvbl ;

run;

proc sort data=errors;

by concatvbl;

run;

data errors2;

set errors;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=count(concatvbl, id);

run;

data errors3;

set errors2;

if duplicate ne 0;

run;

proc print;

var id child group duplicate;

title2 'Duplicate Errors';

run;

after correction, the code executes well. Please check.

Also another approach to detect the duplicated records on three variables is by using retain statement

data errors2;

set errors;

retain duplicate;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=duplicate+1;

run;

where the duplicate records have incremental values.

Thanks,

Jag

Thanks,
Jag
Regular Contributor
Posts: 194

Re: Identify duplicate based on three variables

Posted in reply to Jagadishkatam

Thanks!  This works and I appreciate your sharing with me the 'retain' statement.

Trusted Advisor
Posts: 3,215

Re: Identify duplicate based on three variables

de-duplication is part of sort functionality Base SAS(R) 9.4 Procedures Guide, Third Edition

---->-- ja karman --<-----
Super User
Super User
Posts: 7,070

Re: Identify duplicate based on three variables

There is no need to create the combined variable concatvbl to locate duplicates.

When you have multiple BY variables to test for exact duplicates you just need to test the FIRST. and LAST flags for the last of the BY variables.

data have ;

input id $ child $ group $ other ;

cards;

S001 001 P .

S001 001 R .

S002 001 R .

S002 001 P .

S002 002 P .

S002 003 P .

S003 001 R .

S003 001 R 1

S003 002 R .

S003 003 R .

S003 001 P .

run;

proc sort;

  by id child group;

run;

data duplicates ;

  set have ;

  by id child group ;

  if not (first.group and last.group);

  put (_all_) (Smiley Happy;

run;

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 687 views
  • 3 likes
  • 5 in conversation