Solved: Re: Identify duplicate based on three variables

jcis7 · Posted 09-04-2014 06:30 PM

G'day. Any help you can give will be much appreciated:

I have 3 variables (all character) I need to use to determine if there are any duplicates.

Table Structure

ID Child Group ...OTHER VARIABLES

S001 001 P

S001 001 R

S002 001 R

S002 001 P

S002 002 P

S002 003 P

S003 001 R

S003 002 R

S003 003 R

S003 001 P

How do you determine if there are duplicates based on the three variables? I have old code which identified based on the first two variables (ID and Child which I modified below

to include the third variable but I'm getting the error message that count has two many variables:

ERROR 72-185: The COUNT function call has too many arguments.

data errors;

set edit;

concatvbl=id||child||group;

keep id child group concatvbl ;

proc sort data=krserrors;

by concatvbl;

data errors;

set errors;

by concatvbl;

if first.concatvbl then duplicate=0; else duplicate=count(concatvbl, id, child, group);

data errors;

set errors;

if duplicate ne 0;

;

proc print;

var id child group duplicate;

title2 'Duplicate Errors';

run;

Jagadishkatam · Posted 09-04-2014 09:27 PM

chris suggested a good approache to delete the duplicate records.

after checking the code mentioned by you, the reason for the error is there are too many arguments in the count function. The purpose of count function is to count the number of similar words in the string, in your case the count function has many arguments, like you are asking the function to search for similar words that are in concatvbl by using id , child, group. However you have to mention only one variable of the three. the correct syntax is count(concatvbl, id)

data edit;

input ID$ Child Group$ ;

cards;

S001 001 P

S001 001 R

S002 001 R

S002 001 P

S002 002 P

S002 003 P

S003 001 R

S003 002 R

S003 003 R

S003 001 P

;

run;

data errors;

set edit;

concatvbl=id||child||group;

keep id child group concatvbl ;

run;

proc sort data=errors;

by concatvbl;

run;

data errors2;

set errors;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=count(concatvbl, id);

run;

data errors3;

set errors2;

if duplicate ne 0;

run;

proc print;

var id child group duplicate;

title2 'Duplicate Errors';

run;

after correction, the code executes well. Please check.

Also another approach to detect the duplicated records on three variables is by using retain statement

data errors2;

set errors;

retain duplicate;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=duplicate+1;

run;

where the duplicate records have incremental values.

Thanks,

Jag

Thanks,
Jag

View solution in original post

ChrisBrooks · Posted 09-04-2014 07:39 PM

There's a much simpler way of doing this with Proc Sort and the dupout option which writes duplicate key values to a new file as below (I've added a record with duplicate keys at observation 2 to show it in action)

data edit;

input id $ child $ group $ other;

datalines;

S001 001 P 1

S001 001 P 2

S001 001 R 2

S002 001 R 3

S002 001 P 4

S002 002 P 5

S002 003 P 6

S003 001 R 7

S003 002 R 8

S003 003 R 9

S003 001 P 0

;

run;

proc sort data=edit out=edit_unique dupout=dups nodupkey;

by id child group;

run;

proc print data=dups;

run;

Chris

jcis7 · Posted 09-09-2014 02:06 PM

Thanks! I need to look at the duplicates before deleting but will keep this in mind for other other projects!

Jagadishkatam · Posted 09-04-2014 09:27 PM

chris suggested a good approache to delete the duplicate records.

after checking the code mentioned by you, the reason for the error is there are too many arguments in the count function. The purpose of count function is to count the number of similar words in the string, in your case the count function has many arguments, like you are asking the function to search for similar words that are in concatvbl by using id , child, group. However you have to mention only one variable of the three. the correct syntax is count(concatvbl, id)

data edit;

input ID$ Child Group$ ;

cards;

S001 001 P

S001 001 R

S002 001 R

S002 001 P

S002 002 P

S002 003 P

S003 001 R

S003 002 R

S003 003 R

S003 001 P

;

run;

data errors;

set edit;

concatvbl=id||child||group;

keep id child group concatvbl ;

run;

proc sort data=errors;

by concatvbl;

run;

data errors2;

set errors;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=count(concatvbl, id);

run;

data errors3;

set errors2;

if duplicate ne 0;

run;

proc print;

var id child group duplicate;

title2 'Duplicate Errors';

run;

after correction, the code executes well. Please check.

Also another approach to detect the duplicated records on three variables is by using retain statement

data errors2;

set errors;

retain duplicate;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=duplicate+1;

run;

where the duplicate records have incremental values.

Thanks,

Jag

Thanks,
Jag

jcis7 · Posted 09-09-2014 02:07 PM

Thanks! This works and I appreciate your sharing with me the 'retain' statement.

jakarman · Posted 09-05-2014 04:35 AM

de-duplication is part of sort functionality Base SAS(R) 9.4 Procedures Guide, Third Edition

---->-- ja karman --<-----

Tom · Posted 09-09-2014 03:29 PM

There is no need to create the combined variable concatvbl to locate duplicates.

When you have multiple BY variables to test for exact duplicates you just need to test the FIRST. and LAST flags for the last of the BY variables.

data have ;

input id $ child $ group $ other ;

cards;

S001 001 P .

S001 001 R .

S002 001 R .

S002 001 P .

S002 002 P .

S002 003 P .

S003 001 R .

S003 001 R 1

S003 002 R .

S003 003 R .

S003 001 P .

run;

proc sort;

by id child group;

run;

data duplicates ;

set have ;

by id child group ;

if not (first.group and last.group);

put (_all_) (:);

run;

Classroom Training Available!