- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
G'day. Any help you can give will be much appreciated:
I have 3 variables (all character) I need to use to determine if there are any duplicates.
Table Structure
ID Child Group ...OTHER VARIABLES
S001 001 P
S001 001 R
S002 001 R
S002 001 P
S002 002 P
S002 003 P
S003 001 R
S003 002 R
S003 003 R
S003 001 P
How do you determine if there are duplicates based on the three variables? I have old code which identified based on the first two variables (ID and Child which I modified below
to include the third variable but I'm getting the error message that count has two many variables:
ERROR 72-185: The COUNT function call has too many arguments.
data errors;
set edit;
concatvbl=id||child||group;
keep id child group concatvbl ;
proc sort data=krserrors;
by concatvbl;
data errors;
set errors;
by concatvbl;
if first.concatvbl then duplicate=0; else duplicate=count(concatvbl, id, child, group);
data errors;
set errors;
if duplicate ne 0;
;
proc print;
var id child group duplicate;
title2 'Duplicate Errors';
run;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
chris suggested a good approache to delete the duplicate records.
after checking the code mentioned by you, the reason for the error is there are too many arguments in the count function. The purpose of count function is to count the number of similar words in the string, in your case the count function has many arguments, like you are asking the function to search for similar words that are in concatvbl by using id , child, group. However you have to mention only one variable of the three. the correct syntax is count(concatvbl, id)
data edit;
input ID$ Child Group$ ;
cards;
S001 001 P
S001 001 R
S002 001 R
S002 001 P
S002 002 P
S002 003 P
S003 001 R
S003 002 R
S003 003 R
S003 001 P
;
run;
data errors;
set edit;
concatvbl=id||child||group;
keep id child group concatvbl ;
run;
proc sort data=errors;
by concatvbl;
run;
data errors2;
set errors;
by concatvbl;
if first.concatvbl then duplicate=0;
else duplicate=count(concatvbl, id);
run;
data errors3;
set errors2;
if duplicate ne 0;
run;
proc print;
var id child group duplicate;
title2 'Duplicate Errors';
run;
after correction, the code executes well. Please check.
Also another approach to detect the duplicated records on three variables is by using retain statement
data errors2;
set errors;
retain duplicate;
by concatvbl;
if first.concatvbl then duplicate=0;
else duplicate=duplicate+1;
run;
where the duplicate records have incremental values.
Thanks,
Jag
Jag
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
There's a much simpler way of doing this with Proc Sort and the dupout option which writes duplicate key values to a new file as below (I've added a record with duplicate keys at observation 2 to show it in action)
data edit;
input id $ child $ group $ other;
datalines;
S001 001 P 1
S001 001 P 2
S001 001 R 2
S002 001 R 3
S002 001 P 4
S002 002 P 5
S002 003 P 6
S003 001 R 7
S003 002 R 8
S003 003 R 9
S003 001 P 0
;
run;
proc sort data=edit out=edit_unique dupout=dups nodupkey;
by id child group;
run;
proc print data=dups;
run;
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! I need to look at the duplicates before deleting but will keep this in mind for other other projects!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
chris suggested a good approache to delete the duplicate records.
after checking the code mentioned by you, the reason for the error is there are too many arguments in the count function. The purpose of count function is to count the number of similar words in the string, in your case the count function has many arguments, like you are asking the function to search for similar words that are in concatvbl by using id , child, group. However you have to mention only one variable of the three. the correct syntax is count(concatvbl, id)
data edit;
input ID$ Child Group$ ;
cards;
S001 001 P
S001 001 R
S002 001 R
S002 001 P
S002 002 P
S002 003 P
S003 001 R
S003 002 R
S003 003 R
S003 001 P
;
run;
data errors;
set edit;
concatvbl=id||child||group;
keep id child group concatvbl ;
run;
proc sort data=errors;
by concatvbl;
run;
data errors2;
set errors;
by concatvbl;
if first.concatvbl then duplicate=0;
else duplicate=count(concatvbl, id);
run;
data errors3;
set errors2;
if duplicate ne 0;
run;
proc print;
var id child group duplicate;
title2 'Duplicate Errors';
run;
after correction, the code executes well. Please check.
Also another approach to detect the duplicated records on three variables is by using retain statement
data errors2;
set errors;
retain duplicate;
by concatvbl;
if first.concatvbl then duplicate=0;
else duplicate=duplicate+1;
run;
where the duplicate records have incremental values.
Thanks,
Jag
Jag
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks! This works and I appreciate your sharing with me the 'retain' statement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
de-duplication is part of sort functionality Base SAS(R) 9.4 Procedures Guide, Third Edition
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
There is no need to create the combined variable concatvbl to locate duplicates.
When you have multiple BY variables to test for exact duplicates you just need to test the FIRST. and LAST flags for the last of the BY variables.
data have ;
input id $ child $ group $ other ;
cards;
S001 001 P .
S001 001 R .
S002 001 R .
S002 001 P .
S002 002 P .
S002 003 P .
S003 001 R .
S003 001 R 1
S003 002 R .
S003 003 R .
S003 001 P .
run;
proc sort;
by id child group;
run;
data duplicates ;
set have ;
by id child group ;
if not (first.group and last.group);
put (_all_) (:);
run;