Hello,
I'm hoping someone can point me in the right direction.
I have a dataset containing up to 10 test types (in no particular order) and the corresponding test dates and test results. Unfortunately, something went wrong when our participants were uploading the data, and many tests were uploaded multiple times. I need to identify the records with duplicate entries (where the same test/date/result was entered more than once), and it would be good if I could identify how many duplicates (a count) for each record.
As an example, I have:
patID | testID_1 | date_1 | result_1 | testID_2 | date_2 | result_2 | testID_3 | date_3 | result_3 | testID_4 | date_4 | result_4 |
1243 | 1 | 01/01/20 | P | 2 | 01/12/20 | N | 1 | 01/01/20 | P | 1 | 01/01/20 | P |
6495 | 4 | 05/23/20 | N | 1 | 03/04/20 | P | 1 | 03/13/20 | N | 6 | 10/12/20 | P |
4712 | 3 | 07/23/20 | P | 3 | 07/23/20 | P | 1 | 06/15/20 | P | 6 | 05/17/20 | N |
3021 | 2 | 06/09/20 | N | 1 | 06/12/20 | N | 4 | 12/01/20 | P | 1 | 06/12/20 | N |
8537 | 5 | 04/01/20 | P | 5 | 04/01/20 | P | 1 | 07/18/20 | N | 3 | 06/05/20 | P |
I want:
patID | testID_1 | date_1 | result_1 | testID_2 | date_2 | result_2 | testID_3 | date_3 | result_3 | testID_4 | date_4 | result_4 | dup_cnt |
1243 | 1 | 01/01/20 | P | 2 | 01/12/20 | N | 1 | 01/01/20 | P | 1 | 01/01/20 | P | 3 |
6495 | 4 | 05/23/20 | N | 1 | 03/04/20 | P | 1 | 03/13/20 | N | 6 | 10/12/20 | P | 0 |
4712 | 3 | 07/23/20 | P | 3 | 07/23/20 | P | 1 | 06/15/20 | P | 6 | 05/17/20 | N | 2 |
3021 | 2 | 06/09/20 | N | 1 | 06/12/20 | N | 4 | 12/01/20 | P | 1 | 06/12/20 | N | 2 |
8537 | 5 | 04/01/20 | P | 5 | 04/01/20 | P | 1 | 07/18/20 | N | 3 | 06/05/20 | P | 2 |
I've created concatenated strings for each test (e.g., 1_01/01/20_P or 1_01/12/20_N) for comparison; however, that's still 10 variables that have to be compared in combination. I started to try proc compare, but I would have to list out 45 combinations (I think) and I'm still not sure how to get the count for each record.
Are there better, more efficient, approaches to do this? There are lots of posts on looking for duplicates across rows, but I'm struggling to find information on identifying duplicates across columns, but within the row.
Can anyone point me in the right direction?
Thank you,
-EM
Code for example "have" table above if anyone wants it.
data test;
input patID testID_1 date_1 $ result_1 $
testID_2 date_2 $ result_2 $
testID_3 date_3 $ result_3 $
testID_4 date_4 $ result_4 $;
datelines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
;
run;
data test_comp; set test;
test_str_1=catx('_', OF testID_1--result_1);
test_str_2=catx('_', OF testID_2--result_2);
test_str_3=catx('_', OF testID_3--result_3);
test_str_4=catx('_', OF testID_4--result_4);
run;
Thank you for your response. I posted this over 3 weeks ago, so I have already found a solution (provided below).
data test;
input patID testID_1 date_1 $ result_1 $
testID_2 date_2 $ result_2 $
testID_3 date_3 $ result_3 $
testID_4 date_4 $ result_4 $;
datalines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
;
run;
data test;
set test;
test_str_1=catx('_', OF testID_1--result_1);
test_str_2=catx('_', OF testID_2--result_2);
test_str_3=catx('_', OF testID_3--result_3);
test_str_4=catx('_', OF testID_4--result_4);
n+1;
run;
/*Restructure Data*/
DATA redo; SET test;
ARRAY test_str{*} test_str: ;
DO i=1 TO dim(test_str);
compare=test_str{i};
OUTPUT;
END;
KEEP patID compare;
RUN;
/*Identify and Count Duplicates*/
PROC SQL;
DELETE FROM work.restrct
WHERE compare = '';
CREATE TABLE dups (drop = n)
AS SELECT *, count(*) AS count
FROM redo
GROUP BY patID, compare
ORDER BY n;
QUIT;
PROC SORT DATA=dups OUT=dropdup NODUPKEY ;
WHERE count > 1;
BY patID compare count;
RUN;
PROC TRANSPOSE DATA=dropdup OUT=final (drop=_name_) ;
VAR compare;
BY patID count;
RUN;
I think you are almost there with the second datastep, just need to sort and count:
data want;
set test;
test_str_1=catx('_', OF testID_1--result_1);
test_str_2=catx('_', OF testID_2--result_2);
test_str_3=catx('_', OF testID_3--result_3);
test_str_4=catx('_', OF testID_4--result_4);
array tests(*) test_str:;
call sortc(of tests(*));
dup_cnt=0;
do _N_=1 to dim(tests)-1;
dup_cnt+tests(_N_)=tests(_N_+1);
end;
/* add 1 if value is positive */
dup_cnt+dup_cnt>0;
drop test_str:;
run;
On a second note, there may be a problem in my code if there are two pairs of duplicates in a record. In that case, DUP_CNT will be set to 3, and my guess is that you want 4. The easy solution (if that's acceptable) would be to drop the final statement that increases DUP_CNT if it is >1, and just say that the first record in a set of duplicates does not count as a duplicate.
Almost every data set that has things "wide" like this, with date_1-date_n, testId_1-testid_n, result_1-result_n is poorly designed for use with SAS.
A structure with PatId, Test, Date and Result as the variables in a "long" format would be easier. And if part of this is to reduce duplicate entries then a SORT with NODUPKEY would get rid of the duplicates.
data test; input patID testID_1 date_1 :mmddyy10. result_1 $ testID_2 date_2 :mmddyy10. result_2 $ testID_3 date_3 :mmddyy10. result_3 $ testID_4 date_4 :mmddyy10. result_4 $; format date: mmddyy10.; datelines; 1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P 6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P 4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N 3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N 8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P ; run; data long; set test; array t(*) testid: ; array d(*) date_: ; array r(*) result_: ; do i=1 to dim(t); test=t[i]; date=d[i]; result=r[i]; output; end; keep patid test date result; format date mmddyy10.; run; /* one way to count*/ proc summary data=long nway; class patid test date result; output out=testcount (drop=_type_); run; /* get the unique results*/ proc sort data=long nodupkey dupout=theduplicates; by patid test date result; run;
Note: I used an actual date informat and format as having dates as other than actual date values is not optimal for almost any purpose.
The approach for creating the LONG data set doesn't even need any changes if you have 50 sets of values on one observation as long as the variable names are consistent with the use of the list building : So testid: uses ALL values of variables whose names start with Testid.
Thank you for your response. I posted this over 3 weeks ago, so I have already found a solution (provided below).
data test;
input patID testID_1 date_1 $ result_1 $
testID_2 date_2 $ result_2 $
testID_3 date_3 $ result_3 $
testID_4 date_4 $ result_4 $;
datalines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
;
run;
data test;
set test;
test_str_1=catx('_', OF testID_1--result_1);
test_str_2=catx('_', OF testID_2--result_2);
test_str_3=catx('_', OF testID_3--result_3);
test_str_4=catx('_', OF testID_4--result_4);
n+1;
run;
/*Restructure Data*/
DATA redo; SET test;
ARRAY test_str{*} test_str: ;
DO i=1 TO dim(test_str);
compare=test_str{i};
OUTPUT;
END;
KEEP patID compare;
RUN;
/*Identify and Count Duplicates*/
PROC SQL;
DELETE FROM work.restrct
WHERE compare = '';
CREATE TABLE dups (drop = n)
AS SELECT *, count(*) AS count
FROM redo
GROUP BY patID, compare
ORDER BY n;
QUIT;
PROC SORT DATA=dups OUT=dropdup NODUPKEY ;
WHERE count > 1;
BY patID compare count;
RUN;
PROC TRANSPOSE DATA=dropdup OUT=final (drop=_name_) ;
VAR compare;
BY patID count;
RUN;
I recognize that this code has already been accepted as a solution. But it is a LOT of steps for what could be a very short single data step.
Simply make an array of test strings as has been done in the solution, but add two elements to the array, a blank test string, and a test string of hexadecimal FF's. So the test string array is two elements larger than the number of tests.
Sort the array (so the blank(s) will sort low, and the 'FF'x will sort high).
Loop through the sorted array, counting only elements that are not blank and are not unique (i.e. not equal to either of its neighbors):
data test;
input patID testID_1 date_1 $ result_1 $
testID_2 date_2 $ result_2 $
testID_3 date_3 $ result_3 $
testID_4 date_4 $ result_4 $;
datalines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
run;
data want (drop=i _:);
set test;
array _strng {0:5} $30 ; /*Lower bound is 0, upper bound is 1 more than N of tests*/
array tid {*} testid_: ;
array dat {*} date_: ;
array res {*} result_: ;
/*element _strng{0} will be left blank, and element {5} will be 'FF'x */
_strng{5}= repeat('FF'x,29); /*Assign highest sort value */
do i=1 to 4;
_strng{i}=catx('!',tid{i},dat{i},res{i});
end;
call sortc(of _strng{*});
ndupes=0;
do i=1 to 4;
if _strng{i}=' ' then continue; /*Ignore blanks*/
else if _strng{i-1}<_strng{i}<_strng{i+1} then continue; /*Ignore unique values */
else ndupes=ndupes+1;
end;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.