BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
ErinLM
Obsidian | Level 7

Hello,

 

I'm hoping someone can point me in the right direction.

 

I have a dataset containing up to 10 test types (in no particular order) and the corresponding test dates and test results. Unfortunately, something went wrong when our participants were uploading the data, and many tests were uploaded multiple times. I need to identify the records with duplicate entries (where the same test/date/result was entered more than once), and it would be good if I could identify how many duplicates (a count) for each record. 

 

As an example, I have: 

 

patIDtestID_1date_1result_1testID_2date_2result_2testID_3date_3result_3testID_4date_4result_4
1243101/01/20P201/12/20N101/01/20P101/01/20P
6495405/23/20N103/04/20P103/13/20N610/12/20P
4712307/23/20P307/23/20P106/15/20P605/17/20N
3021206/09/20N106/12/20N412/01/20P106/12/20N
8537504/01/20P504/01/20P107/18/20N306/05/20P

 

I want: 

patIDtestID_1date_1result_1testID_2date_2result_2testID_3date_3result_3testID_4date_4result_4dup_cnt
1243101/01/20P201/12/20N101/01/20P101/01/20P3
6495405/23/20N103/04/20P103/13/20N610/12/20P0
4712307/23/20P307/23/20P106/15/20P605/17/20N2
3021206/09/20N106/12/20N412/01/20P106/12/20N2
8537504/01/20P504/01/20P107/18/20N306/05/20P2

 

I've created concatenated strings for each test (e.g., 1_01/01/20_P or 1_01/12/20_N) for comparison; however, that's still 10 variables that have to be compared in combination. I started to try proc compare, but I would have to list out 45 combinations (I think) and I'm still not sure how to get the count for each record. 

 

Are there better, more efficient, approaches to do this? There are lots of posts on looking for duplicates across rows, but I'm struggling to find information on identifying duplicates across columns, but within the row.

 

Can anyone point me in the right direction? 

 

Thank you,

-EM

 

Code for example "have" table above if anyone wants it.

data test;
input patID testID_1 date_1 $ result_1 $
		 	testID_2 date_2 $ result_2 $
		 	testID_3 date_3 $ result_3 $
		 	testID_4 date_4 $ result_4 $;
datelines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P	5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
;
run;

data test_comp; set test;
	test_str_1=catx('_', OF testID_1--result_1);
	test_str_2=catx('_', OF testID_2--result_2);
	test_str_3=catx('_', OF testID_3--result_3);
	test_str_4=catx('_', OF testID_4--result_4);
run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
ErinLM
Obsidian | Level 7

Thank you for your response. I posted this over 3 weeks ago, so I have already found a solution (provided below).

data test;
input patID testID_1 date_1 $ result_1 $

                      testID_2 date_2 $ result_2 $

                      testID_3 date_3 $ result_3 $

                      testID_4 date_4 $ result_4 $;

datalines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
;		
run;

data test;
  set test;
  	    test_str_1=catx('_', OF testID_1--result_1);
       	test_str_2=catx('_', OF testID_2--result_2);
       	test_str_3=catx('_', OF testID_3--result_3);
       	test_str_4=catx('_', OF testID_4--result_4);
 n+1;
run;



	/*Restructure Data*/
	DATA redo; SET test;

		ARRAY test_str{*} test_str: ;
	 
		DO i=1 TO dim(test_str);
		 	compare=test_str{i};
			OUTPUT;
		END;
		
		KEEP patID compare; 
	RUN;

	/*Identify and Count Duplicates*/

	PROC SQL; 
		DELETE FROM work.restrct
		WHERE compare = '';
		CREATE TABLE dups (drop = n)
			AS SELECT *, count(*) AS count
				FROM redo
					GROUP BY patID, compare
					ORDER BY n;
	QUIT;

	PROC SORT DATA=dups OUT=dropdup NODUPKEY ; 
		WHERE count > 1;
		BY patID compare count;
	RUN;

	PROC TRANSPOSE DATA=dropdup OUT=final (drop=_name_) ; 
	  VAR compare;
	  BY patID count;
	RUN;

View solution in original post

5 REPLIES 5
s_lassen
Meteorite | Level 14

I think you are almost there with the second datastep, just need to sort and count:

data want; 
    set test;
	test_str_1=catx('_', OF testID_1--result_1);
	test_str_2=catx('_', OF testID_2--result_2);
	test_str_3=catx('_', OF testID_3--result_3);
	test_str_4=catx('_', OF testID_4--result_4);
	array tests(*) test_str:;
	call sortc(of tests(*));
	dup_cnt=0;
	do _N_=1 to dim(tests)-1;
	  dup_cnt+tests(_N_)=tests(_N_+1);
	  end;
	/* add 1 if value is positive */
	dup_cnt+dup_cnt>0;
	drop test_str:;
run;
s_lassen
Meteorite | Level 14

On a second note, there may be a problem in my code if there are two pairs of duplicates in a record. In that case, DUP_CNT will be set to 3, and my guess is that you want 4. The easy solution (if that's acceptable) would be to drop the final statement that increases DUP_CNT if it is >1, and just say that the first record in a set of duplicates does not count as a duplicate.

ballardw
Super User

Almost every data set that has things "wide" like this, with date_1-date_n, testId_1-testid_n, result_1-result_n is poorly designed for use with SAS.

A structure with PatId, Test, Date and Result as the variables in a "long" format would be easier. And if part of this is to reduce duplicate entries then a SORT with NODUPKEY would get rid of the duplicates.

 

data test;
input patID testID_1 date_1 :mmddyy10. result_1 $
		 	testID_2 date_2 :mmddyy10. result_2 $
		 	testID_3 date_3 :mmddyy10. result_3 $
		 	testID_4 date_4 :mmddyy10. result_4 $;
format date: mmddyy10.;
datelines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P	5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
;
run;

data long;
  set test;
  array t(*) testid: ;
  array d(*) date_: ;
  array r(*) result_: ;
  do i=1 to dim(t);
    test=t[i];
    date=d[i];
    result=r[i];
    output;
  end;
  keep patid test date result;
  format date mmddyy10.;
run;

/* one way to count*/
proc summary data=long nway;
   class patid test date result;
   output out=testcount (drop=_type_);
run;

/* get the unique results*/
proc sort data=long nodupkey dupout=theduplicates;
   by patid test date result;
run;  

Note: I used an actual date informat and format as having dates as other than actual date values is not optimal for almost any purpose.

The approach for creating the LONG data set doesn't even need any changes if you have 50 sets of values on one observation as long as the variable names are consistent with the use of the list building :  So testid: uses ALL values of variables whose names start with Testid.

 

ErinLM
Obsidian | Level 7

Thank you for your response. I posted this over 3 weeks ago, so I have already found a solution (provided below).

data test;
input patID testID_1 date_1 $ result_1 $

                      testID_2 date_2 $ result_2 $

                      testID_3 date_3 $ result_3 $

                      testID_4 date_4 $ result_4 $;

datalines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P 5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
;		
run;

data test;
  set test;
  	    test_str_1=catx('_', OF testID_1--result_1);
       	test_str_2=catx('_', OF testID_2--result_2);
       	test_str_3=catx('_', OF testID_3--result_3);
       	test_str_4=catx('_', OF testID_4--result_4);
 n+1;
run;



	/*Restructure Data*/
	DATA redo; SET test;

		ARRAY test_str{*} test_str: ;
	 
		DO i=1 TO dim(test_str);
		 	compare=test_str{i};
			OUTPUT;
		END;
		
		KEEP patID compare; 
	RUN;

	/*Identify and Count Duplicates*/

	PROC SQL; 
		DELETE FROM work.restrct
		WHERE compare = '';
		CREATE TABLE dups (drop = n)
			AS SELECT *, count(*) AS count
				FROM redo
					GROUP BY patID, compare
					ORDER BY n;
	QUIT;

	PROC SORT DATA=dups OUT=dropdup NODUPKEY ; 
		WHERE count > 1;
		BY patID compare count;
	RUN;

	PROC TRANSPOSE DATA=dropdup OUT=final (drop=_name_) ; 
	  VAR compare;
	  BY patID count;
	RUN;
mkeintz
PROC Star

I recognize that this code has already been accepted as a solution.  But it is a LOT of steps for what could be a very short single data step.   

 

Simply make an array of test strings as has been done in the solution, but add two elements to the array, a blank test string, and a test string of hexadecimal FF's.  So the test string array is two elements larger than the number of tests.

 

Sort the array (so the blank(s) will sort low, and the 'FF'x will sort high).

 

Loop through the sorted array, counting only elements that are not blank and are not unique (i.e. not equal to either of its neighbors):

 

data test;
input patID testID_1 date_1 $ result_1 $
		 	testID_2 date_2 $ result_2 $
		 	testID_3 date_3 $ result_3 $
		 	testID_4 date_4 $ result_4 $;
datalines;
1243 1 01/01/2022 P 2 01/12/2022 N 1 01/01/2022 P 1 01/01/2022 P
6495 4 05/23/2022 N 1 03/04/2022 P 1 03/13/2022 N 6 10/12/2022 P
4712 3 07/23/2022 P 3 07/23/2022 P 1 06/15/2022 P 6 05/17/2022 N
3021 2 06/09/2022 N 1 06/12/2022 N 4 12/01/2022 P 1 06/12/2022 N
8537 5 04/01/2022 P	5 04/01/2022 P 1 07/18/2022 N 3 06/05/2022 P
run;


data want (drop=i _:);
  set test;
  array _strng {0:5} $30  ;  /*Lower bound is 0, upper bound is 1 more than N of tests*/

  array tid {*} testid_: ;
  array dat {*} date_:   ;
  array res {*} result_: ;

  /*element _strng{0} will be left blank, and element {5} will be 'FF'x */
  _strng{5}=  repeat('FF'x,29);  /*Assign highest sort value */

  do i=1 to 4;
    _strng{i}=catx('!',tid{i},dat{i},res{i});
  end;
  call sortc(of _strng{*});

  ndupes=0;
  do i=1 to 4;
    if _strng{i}=' ' then continue;                           /*Ignore blanks*/
    else if _strng{i-1}<_strng{i}<_strng{i+1} then continue;  /*Ignore unique values */
    else ndupes=ndupes+1;
  end;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1100 views
  • 0 likes
  • 4 in conversation