Re: Simplistic code for identifying multiple duplicate date pairs

carmong · Posted 02-10-2022 02:05 PM

I am in the process of writing code to identify how many duplicate date pairs a record has. These records contain multiple admission and discharge dates and I have written the following code below, but is there a more simplistic code that will identify the number of duplicate date pairs that a record has?

Sample data

Obs

ICU_AdmitDate

ICU_DischargeDate

ICU_AdmitDate2

ICU_DischargeDate2

ICU_AdmitDate3

ICU_DischargeDate3

ICU_AdmitDate4

ICU_DischargeDate4

ICU_AdmitDate5

ICU_DischargeDate5

ICU_AdmitDate6

ICU_DischargeDate6

1

1-Jul-21

2-Jul-21

3-Jul-21

4-Jul-21

5-Jul-21

6-Jul-21

19-Aug-21

8

2-Jul-21

3-Jul-21

4-Jul-21

5-Jul-21

6-Jul-21

7-Jul-21

31-Dec-20

15-Jan-21

31-Dec-20

15-Jan-21

31-Dec-20

15-Jan-21

9

3-Jul-21

4-Jul-21

5-Jul-21

6-Jul-21

7-Jul-21

8-Jul-21

11-Nov-21

12

4-Jul-21

5-Jul-21

6-Jul-21

7-Jul-21

8-Jul-21

9-Jul-21

3-Nov-21

4-Nov-21

15

5-Jul-21

6-Jul-21

7-Jul-21

8-Jul-21

9-Jul-21

10-Jul-21

31-Jul-21

5-Aug-21

Code to identify duplicates

data icutwo; set icutwo;

if ICU_AdmitDate = ICU_AdmitDate2 and ICU_DischargeDate = ICU_DischargeDate2 then dup = '1';

else dup = 'no';

run;

data icutthre; set icutthre;

if ICU_AdmitDate2 = ICU_AdmitDate3 and ICU_DischargeDate2 = ICU_DischargeDate3 then dup2 = '2';

else dup2 = 'no';

data icufour; set icufour;

if ICU_AdmitDate3 = ICU_AdmitDate4 and ICU_DischargeDate3 = ICU_DischargeDate4 then dup3 = '3';

else dup3 = 'no';

data icufive; set icufive;

if ICU_AdmitDate4 = ICU_AdmitDate5 and ICU_DischargeDate4 = ICU_DischargeDate5 then dup4 = '4';

else dup4 = 'no';

data icusix; set icusix;

if ICU_AdmitDate5 = ICU_AdmitDate6 and ICU_DischargeDate5 = ICU_DischargeDate6 then dup5 = '5';

else dup5 = 'no';

run;

Code to count duplicate date pairs (is there a simpler way to write this)

if dup = '1' and dup2 = . and dup3 = . and dup4 = . and dup5 = . then dupnum = 1;

if dup = '1' and dup2 = '2' and dup3 = . and dup4 = . and dup5 = . then dupnum = 2;

if dup = '1' and dup2 = '2' and dup3 = '3' and dup4 = . and dup5 = . then dupnum = 3;

if dup = '1' and dup2 = '2' and dup3 = '3' and dup4 = '4' and dup5 = . then dupnum = 4;

if dup = '1' and dup2 = '2' and dup3 = '3' and dup4 = '4' and dup5 = '5' then dupnum = 5;

PaigeMiller · Posted 02-10-2022 02:26 PM

This seems to be a job for ARRAYs here. ARRAYs allow you to perform the same action on many columns, which is what you are trying to do. You would help yourself greatly to learn about ARRAYs.

If I understand you properly, this code should work. This code is UNTESTED, as I cannot test my program on data in screen captures. The proper way to provide data, which allows us to send you TESTED code, is by following these instructions.

data want;
    set have;
    length dup $ 3
    dup='no';
    array admit icu_admitdate:;
    array discharge icu_dischargedate:;
    do i = 2 to dim (admit);
         if admit(i-1)=admit(i) and discharge(i-1)=discharge(i) then dup=put(i,3.);
    end;
    drop i;
run;

--
Paige Miller

carmong · Posted 02-11-2022 01:27 PM

Thank you very much for your suggestion. Do you have any recommendations of supplemental material to help with learning arrays? I have the little SAS book, but would like more indepth practice.

PaigeMiller · Posted 02-11-2022 01:30 PM

Several examples here:

https://documentation.sas.com/doc/en/pgmmvacdc/9.4/lestmtsref/p08do6szetrxe2n136ush727sbuo.htm

--
Paige Miller

carmong · Posted 02-11-2022 06:37 PM

I tried re-writting the array but my results did not come out right. I attempted to follow the directions from the Jedi SAS tricks, but I didn't understand how to convert that over to my dataset.

data icu1dup;
dup = 0;
set icu1;
array admit {*} ICU_AdmitDate ICU_AdmitDate2 ICU_AdmitDate3 ICU_AdmitDate4 ICU_AdmitDate5 ICU_AdmitDate6;
array discharge {*}ICU_DischargeDate ICU_DischargeDate2 ICU_DischargeDate3 ICU_DischargeDate4 ICU_DischargeDate5 ICU_DischargeDate6;
do i = 2 to dim (admit);
if admit(i-1)=admit(i) and discharge(i-1)=discharge(i) then
do;
dup = i + 1;
leave;
end;
end;
run;

Tom · Posted 02-11-2022 07:48 PM

Define "not come out right".

Did it not run? Show the error message.

Not get what you wanted? How is it different than what you wanted?

When posting logs or code make sure to use the Insert Code or Insert SAS Code icon on the forum editor to get a pop-up text box where you can paste and/or edit the text. This will preserve the formatting. If you just paste it into the body of you message directly it will be re-flowed into paragraphs by the HTML rendered.

carmong · Posted 02-11-2022 08:05 PM

The results did not come out as expected. If there were no duplicate pairs I would expect dup to = 0. Posted below is an output sample:

Obs	ICU_AdmitDate	ICU_DischargeDate	ICU_AdmitDate2	ICU_DischargeDate2	ICU_AdmitDate3	ICU_DischargeDate3	ICU_AdmitDate4	ICU_DischargeDate4	ICU_AdmitDate5	ICU_DischargeDate5	ICU_AdmitDate6	ICU_DischargeDate6	dup
1	4-Apr-21	7-Apr-21	10-Apr-21	10-Apr-21	.	.	.	.	.	.	.	.	5
2	18-Oct-21	25-Oct-21	25-Oct-21	5-Nov-21	5-Nov-21	27-Nov-21	.	.	.	.	.	.	6
3	29-Jul-21	10-Sep-21	.	.	.	.	.	.	.	.	.	.	4
4	27-Jun-20	14-Jul-20	.	.	.	.	.	.	.	.	.	.	4
5	26-Dec-20	28-Dec-20	.	.	.	.	.	.	.	.	.	.	4
6	20-Jan-21	20-Jan-21	21-Jan-21	2-Feb-21	.	.	.	.	.	.	.	.	5
7	30-Mar-20	11-Apr-20	.	.	.	.	.	.	.	.	.	.	4
8	23-Apr-21	25-Apr-21	.	.	.	.	.	.	.	.	.	.	4
9	17-Jan-21	5-Feb-21	.	.	.	.	.	.	.	.	.	.	4
10	29-Dec-20	29-Dec-20	12-Jan-21	22-Jan-21	.	.	.	.	.	.	.	.	5

Tom · Posted 02-11-2022 08:14 PM

And the results you wanted were what?

You have a lot of duplicate pairs. All of those missing values are duplicates based on your current logic.

By the way why is your data organized like that anyway? Why do have so many admit dates per observation?

Why not just have one observation per date pair? Then you can use use BY group processing to locate duplicate observations.

carmong · Posted 02-14-2022 12:34 PM

some observations have multiple admit and discharge dates, this is the way the database is set up to count the number of admissions because some patients will have more than one in a certain time period.

Tom · Posted 02-14-2022 12:45 PM

@carmong wrote:
some observations have multiple admit and discharge dates, this is the way the database is set up to count the number of admissions because some patients will have more than one in a certain time period.

It is simple to ignore the missing values. If all of the missing values are at the "end" then just change the upper bound on your DO loop. Instead of using the dimension of the array use the count of non-missing values by using the N() function. So if your array is named ADM then the number of non-missing values in it is N(of adm[*]). If there are "holes" in the array then change the logic that is detecting duplicates to ignore duplicates that are missing values.

If the patients have more than one stay then you can just create multiple observations for that patient. Then there is no need to set a limit on how many stays they can have. Perhaps if you check into where you are getting the file they actually already have it in that form and they had to do work to roll the multiple observations into the single observation you are getting now.

Quentin · Posted 02-14-2022 01:46 PM

Yes, please listen to @Tom. Even if the database is set up in this wide format (which would be a very poor decision by the database designer), you as an analyst are always free to reshape the data in SAS to make your work possible / faster / easier / etc.

If you transpose this into long narrow format with just three variables (PatientID, ICU_AdmitDate, ICU_DischargeDate), this task will become MUCH easier to code, and about 99% of the tasks you will want to do with the data will also become much easier.

In many cases, it is faster/better/safer to re-shape data into a usable structure rather than write a lot of code to deal with data in a poor structure.

This problem is not unusual. Many people coming to analytics without training in data modeling, naturally think in terms of "one row per patient." But once you free yourself of that constraint, the programming becomes much easier. Even if your boss or somebody asks your for an analysis and they specify their request in terms of one record per patient, looping over admissions, you as the programmer can understand that they do that because they are thinking in terms of Excel, and you can implement it differently.

PaigeMiller · Posted 02-12-2022 06:10 AM

I attempted to follow the directions from the Jedi SAS tricks, but I didn't understand how to convert that over to my dataset.

@carmong

Twice in this thread, you tell us it doesn't work, but you don't tell us WHY it doesn't work and you don't tell us or show us WHAT you did and you don't tell us WHAT happened. Telling us something didn't work and provide no other information never is useful. The follow-up question is always: why didn't it work and then SHOW US what did you do and then SHOW US what happened. From now on, without us asking, you need to tell us these things.

--
Paige Miller

carmong · Posted 02-14-2022 01:47 PM

The array code (posted below) that I used below did not produce an error, but I am trying to determine where I went wrong in altering the code because the results of the dup variable did not come out as desired (posted below) so I must have not altered the code correctly. What I am trying to gather is how many duplicate date pairs (for example does admit 1 and discharge 1 = admit 2 and discharge 2, does admit 2 and discharge 2 = admit 3 and discharge 3 ex). If admit 1 and discharge 1 was the same as admit 2 and discharge 2, then I would expect my dup variable calculated in the array to = 1. If admit 1 and discharge 1 were the same as admit 2 and discharge 2 which was also the same as admit 3 and discharge 3 then I would expect my dup variable to = 2. I have posted a small part of the output and a column of what I would want the dup variable to be.

Code:

data icu1dup;
dup = 0;
set icu1;
array admit {*} ICU_AdmitDate ICU_AdmitDate2 ICU_AdmitDate3 ICU_AdmitDate4 ICU_AdmitDate5 ICU_AdmitDate6;
array discharge {*}ICU_DischargeDate ICU_DischargeDate2 ICU_DischargeDate3 ICU_DischargeDate4 ICU_DischargeDate5 ICU_DischargeDate6;
do i = 2 to dim (admit);
if admit(i-1)=admit(i) and discharge(i-1)=discharge(i) then
do;
dup = i + 1;
leave;
end;
end;
run;

Log:

NOTE: There were 24900 observations read from the data set WORK.ICU.
NOTE: The data set WORK.ICU1 has 23556 observations and 264 variables.
NOTE: DATA statement used (Total process time):
real time 0.96 seconds
cpu time 0.24 seconds

15 data icu1dup;
16 dup = 0;
17 set icu1;
18 array admit {*} ICU_AdmitDate ICU_AdmitDate2 ICU_AdmitDate3 ICU_AdmitDate4 ICU_AdmitDate5
18 ! ICU_AdmitDate6;
19 array discharge {*}ICU_DischargeDate ICU_DischargeDate2 ICU_DischargeDate3
19 ! ICU_DischargeDate4 ICU_DischargeDate5 ICU_DischargeDate6;
20 do i = 2 to dim (admit);
21 if admit(i-1)=admit(i) and discharge(i-1)=discharge(i) then
22 do;
23 dup = i + 1;
24 leave;
25 end;
26 end;
27 run;

NOTE: There were 23556 observations read from the data set WORK.ICU1.
NOTE: The data set WORK.ICU1DUP has 23556 observations and 266 variables.
NOTE: DATA statement used (Total process time):
real time 0.70 seconds
cpu time 0.32 seconds

Output (small portion)

Obs	ICU_AdmitDate	ICU_DischargeDate	ICU_AdmitDate2	ICU_DischargeDate2	ICU_AdmitDate3	ICU_DischargeDate3	ICU_AdmitDate4	ICU_DischargeDate4	ICU_AdmitDate5	ICU_DischargeDate5	ICU_AdmitDate6	ICU_DischargeDate6	dup	what I would expect
127	7-Jul-20	8-Jul-20	08-Jul-20	08-Jul-20	08-Jul-20	09-Jul-20	09-Jul-20	10-Jul-20	.	.	.	.	7	0
131	6-Oct-21	6-Oct-21	06-Oct-21	11-Oct-21	.	.	.	.	.	.	.	.	5	0
132	23-Nov-20	23-Nov-20	23-Nov-20	23-Nov-20	23-Nov-20	23-Nov-20	23-Nov-20	23-Nov-20	23-Nov-20	23-Nov-20	23-Nov-20	24-Nov-20	3	6
134	23-Jul-20	23-Jul-20	.	23-Jul-20	.	24-Jul-20	.	.	.	.	.	.	6	0

Registration is open

SAS Training: Just a Click Away