- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I'm hoping someone can point me in the right direction.
I have a dataset containing up to 10 test types (in no particular order) and the corresponding test dates and test results. Unfortunately, something went wrong when our participants were uploading the data, and many tests were uploaded multiple times. I need to identify the records with duplicate entries (where the same test/date/result was entered more than once), and it would be good if I could identify how many duplicates (a count) for each record.
As an example, I have:
patID | testID_1 | date_1 | result_1 | testID_2 | date_2 | result_2 | testID_3 | date_3 | result_3 | testID_4 | date_4 | result_4 |
1243 | 1 | 01/01/20 | P | 2 | 01/12/20 | N | 1 | 01/01/20 | P | 1 | 01/01/20 | P |
6495 | 4 | 05/23/20 | N | 1 | 03/04/20 | P | 1 | 03/13/20 | N | 6 | 10/12/20 | P |
4712 | 3 | 07/23/20 | P | 3 | 07/23/20 | P | 1 | 06/15/20 | P | 6 | 05/17/20 | N |
3021 | 2 | 06/09/20 | N | 1 | 06/12/20 | N | 4 | 12/01/20 | P | 1 | 06/12/20 | N |
8537 | 5 | 04/01/20 | P | 5 | 04/01/20 | P | 1 | 07/18/20 | N | 3 | 06/05/20 | P |
I want:
patID | testID_1 | date_1 | result_1 | testID_2 | date_2 | result_2 | testID_3 | date_3 | result_3 | testID_4 | date_4 | result_4 | dup_cnt |
1243 | 1 | 01/01/20 | P | 2 | 01/12/20 | N | 1 | 01/01/20 | P | 1 | 01/01/20 | P | 3 |
6495 | 4 | 05/23/20 | N | 1 | 03/04/20 | P | 1 | 03/13/20 | N | 6 | 10/12/20 | P | 0 |
4712 | 3 | 07/23/20 | P | 3 | 07/23/20 | P | 1 | 06/15/20 | P | 6 | 05/17/20 | N | 2 |
3021 | 2 | 06/09/20 | N | 1 | 06/12/20 | N | 4 | 12/01/20 | P | 1 | 06/12/20 | N | 2 |
8537 | 5 | 04/01/20 | P | 5 | 04/01/20 | P | 1 | 07/18/20 | N | 3 | 06/05/20 | P | 2 |
I've created concatenated strings for each test (e.g., 1_01/01/20_P or 1_01/12/20_N) for comparison; however, that's still 10 variables that have to be compared in combination. I started to try proc compare, but I would have to list out 45 combinations (I think) and I'm still not sure how to get the count for each record.
Are there better, more efficient, approaches to do this? There are lots of posts on looking for duplicates across rows, but I'm struggling to find information on identifying duplicates across columns, but within the row.
Can anyone point me in the right direction?
Thank you,
-EM
Code for example "have" table above if anyone wants it.
data test;
input patID testID_1 date_1 $ result_1 $
testID_2 date_2 $ result_2 $
testID_3 date_3 $ result_3 $
testID_4 date_4 $ result_4 $;
datelines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
;
run;
data test_comp; set test;
test_str_1=catx('_', OF testID_1--result_1);
test_str_2=catx('_', OF testID_2--result_2);
test_str_3=catx('_', OF testID_3--result_3);
test_str_4=catx('_', OF testID_4--result_4);
run;
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your response. I posted this over 3 weeks ago, so I have already found a solution (provided below).
data test;
input patID testID_1 date_1 $ result_1 $
testID_2 date_2 $ result_2 $
testID_3 date_3 $ result_3 $
testID_4 date_4 $ result_4 $;
datalines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
;
run;
data test;
set test;
test_str_1=catx('_', OF testID_1--result_1);
test_str_2=catx('_', OF testID_2--result_2);
test_str_3=catx('_', OF testID_3--result_3);
test_str_4=catx('_', OF testID_4--result_4);
n+1;
run;
/*Restructure Data*/
DATA redo; SET test;
ARRAY test_str{*} test_str: ;
DO i=1 TO dim(test_str);
compare=test_str{i};
OUTPUT;
END;
KEEP patID compare;
RUN;
/*Identify and Count Duplicates*/
PROC SQL;
DELETE FROM work.restrct
WHERE compare = '';
CREATE TABLE dups (drop = n)
AS SELECT *, count(*) AS count
FROM redo
GROUP BY patID, compare
ORDER BY n;
QUIT;
PROC SORT DATA=dups OUT=dropdup NODUPKEY ;
WHERE count > 1;
BY patID compare count;
RUN;
PROC TRANSPOSE DATA=dropdup OUT=final (drop=_name_) ;
VAR compare;
BY patID count;
RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I think you are almost there with the second datastep, just need to sort and count:
data want;
set test;
test_str_1=catx('_', OF testID_1--result_1);
test_str_2=catx('_', OF testID_2--result_2);
test_str_3=catx('_', OF testID_3--result_3);
test_str_4=catx('_', OF testID_4--result_4);
array tests(*) test_str:;
call sortc(of tests(*));
dup_cnt=0;
do _N_=1 to dim(tests)-1;
dup_cnt+tests(_N_)=tests(_N_+1);
end;
/* add 1 if value is positive */
dup_cnt+dup_cnt>0;
drop test_str:;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
On a second note, there may be a problem in my code if there are two pairs of duplicates in a record. In that case, DUP_CNT will be set to 3, and my guess is that you want 4. The easy solution (if that's acceptable) would be to drop the final statement that increases DUP_CNT if it is >1, and just say that the first record in a set of duplicates does not count as a duplicate.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Almost every data set that has things "wide" like this, with date_1-date_n, testId_1-testid_n, result_1-result_n is poorly designed for use with SAS.
A structure with PatId, Test, Date and Result as the variables in a "long" format would be easier. And if part of this is to reduce duplicate entries then a SORT with NODUPKEY would get rid of the duplicates.
data test; input patID testID_1 date_1 :mmddyy10. result_1 $ testID_2 date_2 :mmddyy10. result_2 $ testID_3 date_3 :mmddyy10. result_3 $ testID_4 date_4 :mmddyy10. result_4 $; format date: mmddyy10.; datelines; 1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P 6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P 4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N 3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N 8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P ; run; data long; set test; array t(*) testid: ; array d(*) date_: ; array r(*) result_: ; do i=1 to dim(t); test=t[i]; date=d[i]; result=r[i]; output; end; keep patid test date result; format date mmddyy10.; run; /* one way to count*/ proc summary data=long nway; class patid test date result; output out=testcount (drop=_type_); run; /* get the unique results*/ proc sort data=long nodupkey dupout=theduplicates; by patid test date result; run;
Note: I used an actual date informat and format as having dates as other than actual date values is not optimal for almost any purpose.
The approach for creating the LONG data set doesn't even need any changes if you have 50 sets of values on one observation as long as the variable names are consistent with the use of the list building : So testid: uses ALL values of variables whose names start with Testid.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your response. I posted this over 3 weeks ago, so I have already found a solution (provided below).
data test;
input patID testID_1 date_1 $ result_1 $
testID_2 date_2 $ result_2 $
testID_3 date_3 $ result_3 $
testID_4 date_4 $ result_4 $;
datalines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
;
run;
data test;
set test;
test_str_1=catx('_', OF testID_1--result_1);
test_str_2=catx('_', OF testID_2--result_2);
test_str_3=catx('_', OF testID_3--result_3);
test_str_4=catx('_', OF testID_4--result_4);
n+1;
run;
/*Restructure Data*/
DATA redo; SET test;
ARRAY test_str{*} test_str: ;
DO i=1 TO dim(test_str);
compare=test_str{i};
OUTPUT;
END;
KEEP patID compare;
RUN;
/*Identify and Count Duplicates*/
PROC SQL;
DELETE FROM work.restrct
WHERE compare = '';
CREATE TABLE dups (drop = n)
AS SELECT *, count(*) AS count
FROM redo
GROUP BY patID, compare
ORDER BY n;
QUIT;
PROC SORT DATA=dups OUT=dropdup NODUPKEY ;
WHERE count > 1;
BY patID compare count;
RUN;
PROC TRANSPOSE DATA=dropdup OUT=final (drop=_name_) ;
VAR compare;
BY patID count;
RUN;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I recognize that this code has already been accepted as a solution. But it is a LOT of steps for what could be a very short single data step.
Simply make an array of test strings as has been done in the solution, but add two elements to the array, a blank test string, and a test string of hexadecimal FF's. So the test string array is two elements larger than the number of tests.
Sort the array (so the blank(s) will sort low, and the 'FF'x will sort high).
Loop through the sorted array, counting only elements that are not blank and are not unique (i.e. not equal to either of its neighbors):
data test;
input patID testID_1 date_1 $ result_1 $
testID_2 date_2 $ result_2 $
testID_3 date_3 $ result_3 $
testID_4 date_4 $ result_4 $;
datalines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
run;
data want (drop=i _:);
set test;
array _strng {0:5} $30 ; /*Lower bound is 0, upper bound is 1 more than N of tests*/
array tid {*} testid_: ;
array dat {*} date_: ;
array res {*} result_: ;
/*element _strng{0} will be left blank, and element {5} will be 'FF'x */
_strng{5}= repeat('FF'x,29); /*Assign highest sort value */
do i=1 to 4;
_strng{i}=catx('!',tid{i},dat{i},res{i});
end;
call sortc(of _strng{*});
ndupes=0;
do i=1 to 4;
if _strng{i}=' ' then continue; /*Ignore blanks*/
else if _strng{i-1}<_strng{i}<_strng{i+1} then continue; /*Ignore unique values */
else ndupes=ndupes+1;
end;
run;
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------