G'day. Any help you can give will be much appreciated:
I have 3 variables (all character) I need to use to determine if there are any duplicates.
Table Structure
ID Child Group ...OTHER VARIABLES
S001 001 P
S001 001 R
S002 001 R
S002 001 P
S002 002 P
S002 003 P
S003 001 R
S003 002 R
S003 003 R
S003 001 P
How do you determine if there are duplicates based on the three variables? I have old code which identified based on the first two variables (ID and Child which I modified below
to include the third variable but I'm getting the error message that count has two many variables:
ERROR 72-185: The COUNT function call has too many arguments.
data errors;
set edit;
concatvbl=id||child||group;
keep id child group concatvbl ;
proc sort data=krserrors;
by concatvbl;
data errors;
set errors;
by concatvbl;
if first.concatvbl then duplicate=0; else duplicate=count(concatvbl, id, child, group);
data errors;
set errors;
if duplicate ne 0;
;
proc print;
var id child group duplicate;
title2 'Duplicate Errors';
run;
chris suggested a good approache to delete the duplicate records.
after checking the code mentioned by you, the reason for the error is there are too many arguments in the count function. The purpose of count function is to count the number of similar words in the string, in your case the count function has many arguments, like you are asking the function to search for similar words that are in concatvbl by using id , child, group. However you have to mention only one variable of the three. the correct syntax is count(concatvbl, id)
data edit;
input ID$ Child Group$ ;
cards;
S001 001 P
S001 001 R
S002 001 R
S002 001 P
S002 002 P
S002 003 P
S003 001 R
S003 002 R
S003 003 R
S003 001 P
;
run;
data errors;
set edit;
concatvbl=id||child||group;
keep id child group concatvbl ;
run;
proc sort data=errors;
by concatvbl;
run;
data errors2;
set errors;
by concatvbl;
if first.concatvbl then duplicate=0;
else duplicate=count(concatvbl, id);
run;
data errors3;
set errors2;
if duplicate ne 0;
run;
proc print;
var id child group duplicate;
title2 'Duplicate Errors';
run;
after correction, the code executes well. Please check.
Also another approach to detect the duplicated records on three variables is by using retain statement
data errors2;
set errors;
retain duplicate;
by concatvbl;
if first.concatvbl then duplicate=0;
else duplicate=duplicate+1;
run;
where the duplicate records have incremental values.
Thanks,
Jag
There's a much simpler way of doing this with Proc Sort and the dupout option which writes duplicate key values to a new file as below (I've added a record with duplicate keys at observation 2 to show it in action)
data edit;
input id $ child $ group $ other;
datalines;
S001 001 P 1
S001 001 P 2
S001 001 R 2
S002 001 R 3
S002 001 P 4
S002 002 P 5
S002 003 P 6
S003 001 R 7
S003 002 R 8
S003 003 R 9
S003 001 P 0
;
run;
proc sort data=edit out=edit_unique dupout=dups nodupkey;
by id child group;
run;
proc print data=dups;
run;
Chris
Thanks! I need to look at the duplicates before deleting but will keep this in mind for other other projects!
chris suggested a good approache to delete the duplicate records.
after checking the code mentioned by you, the reason for the error is there are too many arguments in the count function. The purpose of count function is to count the number of similar words in the string, in your case the count function has many arguments, like you are asking the function to search for similar words that are in concatvbl by using id , child, group. However you have to mention only one variable of the three. the correct syntax is count(concatvbl, id)
data edit;
input ID$ Child Group$ ;
cards;
S001 001 P
S001 001 R
S002 001 R
S002 001 P
S002 002 P
S002 003 P
S003 001 R
S003 002 R
S003 003 R
S003 001 P
;
run;
data errors;
set edit;
concatvbl=id||child||group;
keep id child group concatvbl ;
run;
proc sort data=errors;
by concatvbl;
run;
data errors2;
set errors;
by concatvbl;
if first.concatvbl then duplicate=0;
else duplicate=count(concatvbl, id);
run;
data errors3;
set errors2;
if duplicate ne 0;
run;
proc print;
var id child group duplicate;
title2 'Duplicate Errors';
run;
after correction, the code executes well. Please check.
Also another approach to detect the duplicated records on three variables is by using retain statement
data errors2;
set errors;
retain duplicate;
by concatvbl;
if first.concatvbl then duplicate=0;
else duplicate=duplicate+1;
run;
where the duplicate records have incremental values.
Thanks,
Jag
Thanks! This works and I appreciate your sharing with me the 'retain' statement.
de-duplication is part of sort functionality Base SAS(R) 9.4 Procedures Guide, Third Edition
There is no need to create the combined variable concatvbl to locate duplicates.
When you have multiple BY variables to test for exact duplicates you just need to test the FIRST. and LAST flags for the last of the BY variables.
data have ;
input id $ child $ group $ other ;
cards;
S001 001 P .
S001 001 R .
S002 001 R .
S002 001 P .
S002 002 P .
S002 003 P .
S003 001 R .
S003 001 R 1
S003 002 R .
S003 003 R .
S003 001 P .
run;
proc sort;
by id child group;
run;
data duplicates ;
set have ;
by id child group ;
if not (first.group and last.group);
put (_all_) (:);
run;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.