BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jcis7
Pyrite | Level 9

G'day.  Any help you can give will be much appreciated:

I have 3 variables (all character) I need to use to determine if there are any duplicates.

Table Structure

ID             Child   Group     ...OTHER VARIABLES

S001        001      P

S001        001      R

S002        001      R

S002        001      P

S002        002      P

S002       003       P

S003       001      R

S003      002       R

S003      003       R

S003      001      P

How do you determine if there are duplicates based on the three variables?  I have old code which identified based on the first two variables (ID and Child which I modified below

to include the third variable but I'm getting the error message that count has two many variables:

ERROR 72-185: The COUNT function call has too many arguments.

data errors;


set edit;


concatvbl=id||child||group;

keep id child group concatvbl ;

proc sort data=krserrors;


by concatvbl;


data errors;


set errors;


by concatvbl;


if first.concatvbl then duplicate=0; else duplicate=count(concatvbl, id, child, group);


data errors;


set errors;


if duplicate ne 0;


;



proc print;


var id child group duplicate;


title2 'Duplicate Errors';


run;

1 ACCEPTED SOLUTION

Accepted Solutions
Jagadishkatam
Amethyst | Level 16

chris suggested a good approache to delete the duplicate records.

after checking the code mentioned by you, the reason for the error is there are too many arguments in the count function. The purpose of count function is to count the number of similar words in the string, in your case the count function has many arguments, like you are asking the function to search for similar words that are in concatvbl by using id , child, group. However you have to mention only one variable of the three. the correct syntax is count(concatvbl, id)

data edit;

input ID$             Child   Group$     ;

cards;

S001        001      P

S001        001      R

S002        001      R

S002        001      P

S002        002      P

S002       003       P

S003       001      R

S003      002       R

S003      003       R

S003      001      P

;

run;

data errors;

set edit;

concatvbl=id||child||group;

keep id child group concatvbl ;

run;

proc sort data=errors;

by concatvbl;

run;

data errors2;

set errors;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=count(concatvbl, id);

run;

data errors3;

set errors2;

if duplicate ne 0;

run;

proc print;

var id child group duplicate;

title2 'Duplicate Errors';

run;

after correction, the code executes well. Please check.

Also another approach to detect the duplicated records on three variables is by using retain statement

data errors2;

set errors;

retain duplicate;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=duplicate+1;

run;

where the duplicate records have incremental values.

Thanks,

Jag

Thanks,
Jag

View solution in original post

6 REPLIES 6
ChrisBrooks
Ammonite | Level 13

There's a much simpler way of doing this with Proc Sort and the dupout option which writes duplicate key values to a new file as below (I've added a record with duplicate keys at observation 2 to show it in action)

data edit;

  input id $ child $ group $ other;

  datalines;

S001 001 P 1

S001 001 P 2

S001 001 R 2

S002 001 R 3

S002 001 P 4

S002 002 P 5

S002 003 P 6

S003 001 R 7

S003 002 R 8

S003 003 R 9

S003 001 P 0

;

run;

proc sort data=edit out=edit_unique dupout=dups nodupkey;

by id child group;

run;

proc print data=dups;

run;

Chris

jcis7
Pyrite | Level 9

Thanks! I need to look at the duplicates before deleting but will keep this in mind for other other projects!

Jagadishkatam
Amethyst | Level 16

chris suggested a good approache to delete the duplicate records.

after checking the code mentioned by you, the reason for the error is there are too many arguments in the count function. The purpose of count function is to count the number of similar words in the string, in your case the count function has many arguments, like you are asking the function to search for similar words that are in concatvbl by using id , child, group. However you have to mention only one variable of the three. the correct syntax is count(concatvbl, id)

data edit;

input ID$             Child   Group$     ;

cards;

S001        001      P

S001        001      R

S002        001      R

S002        001      P

S002        002      P

S002       003       P

S003       001      R

S003      002       R

S003      003       R

S003      001      P

;

run;

data errors;

set edit;

concatvbl=id||child||group;

keep id child group concatvbl ;

run;

proc sort data=errors;

by concatvbl;

run;

data errors2;

set errors;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=count(concatvbl, id);

run;

data errors3;

set errors2;

if duplicate ne 0;

run;

proc print;

var id child group duplicate;

title2 'Duplicate Errors';

run;

after correction, the code executes well. Please check.

Also another approach to detect the duplicated records on three variables is by using retain statement

data errors2;

set errors;

retain duplicate;

by concatvbl;

if first.concatvbl then duplicate=0;

else duplicate=duplicate+1;

run;

where the duplicate records have incremental values.

Thanks,

Jag

Thanks,
Jag
jcis7
Pyrite | Level 9

Thanks!  This works and I appreciate your sharing with me the 'retain' statement.

jakarman
Barite | Level 11

de-duplication is part of sort functionality Base SAS(R) 9.4 Procedures Guide, Third Edition

---->-- ja karman --<-----
Tom
Super User Tom
Super User

There is no need to create the combined variable concatvbl to locate duplicates.

When you have multiple BY variables to test for exact duplicates you just need to test the FIRST. and LAST flags for the last of the BY variables.

data have ;

input id $ child $ group $ other ;

cards;

S001 001 P .

S001 001 R .

S002 001 R .

S002 001 P .

S002 002 P .

S002 003 P .

S003 001 R .

S003 001 R 1

S003 002 R .

S003 003 R .

S003 001 P .

run;

proc sort;

  by id child group;

run;

data duplicates ;

  set have ;

  by id child group ;

  if not (first.group and last.group);

  put (_all_) (:);

run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 11120 views
  • 3 likes
  • 5 in conversation