BookmarkSubscribeRSS Feed
carmong
Obsidian | Level 7

I am in the process of writing code to identify how many duplicate date pairs a record has. These records contain multiple admission and discharge dates and I have written the following code below, but is there a more simplistic code that will identify the number of duplicate date pairs that a record has?

 

Sample data

ObsICU_AdmitDateICU_DischargeDateICU_AdmitDate2ICU_DischargeDate2ICU_AdmitDate3ICU_DischargeDate3ICU_AdmitDate4ICU_DischargeDate4ICU_AdmitDate5ICU_DischargeDate5ICU_AdmitDate6ICU_DischargeDate6
11-Jul-212-Jul-213-Jul-214-Jul-215-Jul-216-Jul-2119-Aug-2119-Aug-2119-Aug-2119-Aug-2119-Aug-2119-Aug-21
82-Jul-213-Jul-214-Jul-215-Jul-216-Jul-217-Jul-2131-Dec-2015-Jan-2131-Dec-2015-Jan-2131-Dec-2015-Jan-21
93-Jul-214-Jul-215-Jul-216-Jul-217-Jul-218-Jul-2111-Nov-2111-Nov-2111-Nov-2111-Nov-2111-Nov-2111-Nov-21
124-Jul-215-Jul-216-Jul-217-Jul-218-Jul-219-Jul-213-Nov-214-Nov-214-Nov-214-Nov-214-Nov-214-Nov-21
155-Jul-216-Jul-217-Jul-218-Jul-219-Jul-2110-Jul-2131-Jul-215-Aug-215-Aug-215-Aug-215-Aug-215-Aug-21

 

Code to identify duplicates

data icutwo; set icutwo;

if ICU_AdmitDate = ICU_AdmitDate2 and ICU_DischargeDate = ICU_DischargeDate2 then dup = '1';

else dup = 'no';

run;

data icutthre; set icutthre;

if ICU_AdmitDate2 = ICU_AdmitDate3 and ICU_DischargeDate2 = ICU_DischargeDate3 then dup2 = '2';

else dup2 = 'no';

data icufour; set icufour;

if ICU_AdmitDate3 = ICU_AdmitDate4 and ICU_DischargeDate3 = ICU_DischargeDate4 then dup3 = '3';

else dup3 = 'no';

data icufive; set icufive;

if ICU_AdmitDate4 = ICU_AdmitDate5 and ICU_DischargeDate4 = ICU_DischargeDate5 then dup4 = '4';

else dup4 = 'no'; 

data icusix; set icusix;

if ICU_AdmitDate5 = ICU_AdmitDate6 and ICU_DischargeDate5 = ICU_DischargeDate6 then dup5 = '5';

else dup5 = 'no';

run;

 

Code to count duplicate date pairs (is there a simpler way to write this)

if dup = '1' and dup2 = . and dup3 = . and dup4 = . and dup5 = . then dupnum = 1;

if dup = '1' and dup2 = '2' and dup3 = . and dup4 = . and dup5 = . then dupnum = 2;

if dup = '1' and dup2 = '2' and dup3 = '3' and dup4 = . and dup5 = . then dupnum = 3;

if dup = '1' and dup2 = '2' and dup3 = '3' and dup4 = '4' and dup5 = . then dupnum = 4;

if dup = '1' and dup2 = '2' and dup3 = '3' and dup4 = '4' and dup5 = '5' then dupnum = 5;

 

 

 

 

 

12 REPLIES 12
PaigeMiller
Diamond | Level 26

This seems to be a job for ARRAYs here. ARRAYs allow you to perform the same action on many columns, which is what you are trying to do. You would help yourself greatly to learn about ARRAYs.


If I understand you properly, this code should work. This code is UNTESTED, as I cannot test my program on data in screen captures. The proper way to provide data, which allows us to send you TESTED code, is by following these instructions.

 

data want;
    set have;
    length dup $ 3
    dup='no';
    array admit icu_admitdate:;
    array discharge icu_dischargedate:;
    do i = 2 to dim (admit);
         if admit(i-1)=admit(i) and discharge(i-1)=discharge(i) then dup=put(i,3.);
    end;
    drop i;
run;

 

--
Paige Miller
carmong
Obsidian | Level 7

Thank you very much for your suggestion. Do you have any recommendations of supplemental material to help with learning arrays? I have the little SAS book, but would like more indepth practice.

carmong
Obsidian | Level 7

I tried re-writting the array but my results did not come out right. I attempted to follow the directions from the Jedi SAS tricks, but I didn't understand how to convert that over to my dataset. 

 

data icu1dup;
dup = 0;
set icu1;
array admit {*} ICU_AdmitDate ICU_AdmitDate2 ICU_AdmitDate3 ICU_AdmitDate4 ICU_AdmitDate5 ICU_AdmitDate6;
array discharge {*}ICU_DischargeDate ICU_DischargeDate2 ICU_DischargeDate3 ICU_DischargeDate4 ICU_DischargeDate5 ICU_DischargeDate6;
do i = 2 to dim (admit);
if admit(i-1)=admit(i) and discharge(i-1)=discharge(i) then
do;
dup = i + 1;
leave;
end;
end;
run;

Tom
Super User Tom
Super User

Define "not come out right".

Did it not run?  Show the error message.

Not get what you wanted? How is it different than what you wanted?

 

When posting logs or code make sure to use the Insert Code or Insert SAS Code icon on the forum editor to get a pop-up text box where you can paste and/or edit the text.  This will preserve the formatting.  If you just paste it into the body of you message directly it will be re-flowed into paragraphs by the HTML rendered.

carmong
Obsidian | Level 7

The results did not come out as expected. If there were no duplicate pairs I would expect dup to = 0. Posted below is an output sample:

 

ObsICU_AdmitDateICU_DischargeDateICU_AdmitDate2ICU_DischargeDate2ICU_AdmitDate3ICU_DischargeDate3ICU_AdmitDate4ICU_DischargeDate4ICU_AdmitDate5ICU_DischargeDate5ICU_AdmitDate6ICU_DischargeDate6dup
14-Apr-217-Apr-2110-Apr-2110-Apr-21........5
218-Oct-2125-Oct-2125-Oct-215-Nov-215-Nov-2127-Nov-21......6
329-Jul-2110-Sep-21..........4
427-Jun-2014-Jul-20..........4
526-Dec-2028-Dec-20..........4
620-Jan-2120-Jan-2121-Jan-212-Feb-21........5
730-Mar-2011-Apr-20..........4
823-Apr-2125-Apr-21..........4
917-Jan-215-Feb-21..........4
1029-Dec-2029-Dec-2012-Jan-2122-Jan-21........5
Tom
Super User Tom
Super User

And the results you wanted were what?

You have a lot of duplicate pairs.  All of those missing values are duplicates based on your current logic.

 

By the way why is your data organized like that anyway?  Why do have so many admit dates per observation?

Why not just have one observation per date pair? Then you can use use BY group processing to locate duplicate observations.

carmong
Obsidian | Level 7
some observations have multiple admit and discharge dates, this is the way the database is set up to count the number of admissions because some patients will have more than one in a certain time period.
Tom
Super User Tom
Super User

@carmong wrote:
some observations have multiple admit and discharge dates, this is the way the database is set up to count the number of admissions because some patients will have more than one in a certain time period.

It is simple to ignore the missing values.  If all of the missing values are at the "end" then just change the upper bound on your DO loop.  Instead of using the dimension of the array use the count of non-missing values by using the N() function.  So if your array is named ADM then the number of non-missing values in it is N(of adm[*]).  If there are "holes" in the array then change the logic that is detecting duplicates to ignore duplicates that are missing values.

 

If the patients have more than one stay then you can just create multiple observations for that patient. Then there is no need to set a limit on how many stays they can have.   Perhaps if you check into where you are getting the file they actually already have it in that form and they had to do work to roll the multiple observations into the single observation you are getting now.

Quentin
Super User

Yes, please listen to @Tom.  Even if the database is set up in this wide format (which would be a very poor decision by the database designer), you as an analyst are always free to reshape the data in SAS to make your work possible / faster / easier / etc.

 

If you transpose this into long narrow format with just three variables (PatientID, ICU_AdmitDate, ICU_DischargeDate), this task will become MUCH easier to code, and about 99% of the tasks you will want to do with the data will also become much easier.  

 

In many cases, it is faster/better/safer to re-shape data into a usable structure rather than write a lot of code to deal with data in a poor structure.

 

This problem is not unusual. Many people coming to analytics without training in data modeling, naturally think in terms of "one row per patient." But once you free yourself of that constraint, the programming becomes much easier.  Even if your boss or somebody asks your for an analysis and they specify their request in terms of one record per patient, looping over admissions, you as the programmer can understand that they do that because they are thinking in terms of Excel, and you can implement it differently.

PaigeMiller
Diamond | Level 26

I attempted to follow the directions from the Jedi SAS tricks, but I didn't understand how to convert that over to my dataset.

 

@carmong 

Twice in this thread, you tell us it doesn't work, but you don't tell us WHY it doesn't work and you don't tell us or show us WHAT you did and you don't tell us WHAT happened. Telling us something didn't work and provide no other information never is useful. The follow-up question is always: why didn't it work and then SHOW US what did you do and then SHOW US what happened. From now on, without us asking, you need to tell us these things.

--
Paige Miller
carmong
Obsidian | Level 7

The array code (posted below) that I used below did not produce an error, but I am trying to determine where I went wrong in altering the code because the results of the dup variable did not come out as desired (posted below) so I must have not altered the code correctly. What I am trying to gather is how many duplicate date pairs (for example does admit 1 and discharge 1 = admit 2 and discharge 2, does admit 2 and discharge 2 = admit 3 and discharge 3 ex). If admit 1 and discharge 1 was the same as admit 2 and discharge 2, then I would expect my dup variable calculated in the array to = 1. If admit 1 and discharge 1 were the same as admit 2 and discharge 2 which was also the same as admit 3 and discharge 3 then I would expect my dup variable to = 2. I have posted a small part of the output and a column of what I would want the dup variable to be.  

 

Code:

data icu1dup;
dup = 0;
set icu1;
array admit {*} ICU_AdmitDate ICU_AdmitDate2 ICU_AdmitDate3 ICU_AdmitDate4 ICU_AdmitDate5 ICU_AdmitDate6;
array discharge {*}ICU_DischargeDate ICU_DischargeDate2 ICU_DischargeDate3 ICU_DischargeDate4 ICU_DischargeDate5 ICU_DischargeDate6;
do i = 2 to dim (admit);
if admit(i-1)=admit(i) and discharge(i-1)=discharge(i) then
do;
dup = i + 1;
leave;
end;
end;
run;

 

Log:

NOTE: There were 24900 observations read from the data set WORK.ICU.
NOTE: The data set WORK.ICU1 has 23556 observations and 264 variables.
NOTE: DATA statement used (Total process time):
real time 0.96 seconds
cpu time 0.24 seconds


15 data icu1dup;
16 dup = 0;
17 set icu1;
18 array admit {*} ICU_AdmitDate ICU_AdmitDate2 ICU_AdmitDate3 ICU_AdmitDate4 ICU_AdmitDate5
18 ! ICU_AdmitDate6;
19 array discharge {*}ICU_DischargeDate ICU_DischargeDate2 ICU_DischargeDate3
19 ! ICU_DischargeDate4 ICU_DischargeDate5 ICU_DischargeDate6;
20 do i = 2 to dim (admit);
21 if admit(i-1)=admit(i) and discharge(i-1)=discharge(i) then
22 do;
23 dup = i + 1;
24 leave;
25 end;
26 end;
27 run;

NOTE: There were 23556 observations read from the data set WORK.ICU1.
NOTE: The data set WORK.ICU1DUP has 23556 observations and 266 variables.
NOTE: DATA statement used (Total process time):
real time 0.70 seconds
cpu time 0.32 seconds

 

Output (small portion)

 

ObsICU_AdmitDateICU_DischargeDateICU_AdmitDate2ICU_DischargeDate2ICU_AdmitDate3ICU_DischargeDate3ICU_AdmitDate4ICU_DischargeDate4ICU_AdmitDate5ICU_DischargeDate5ICU_AdmitDate6ICU_DischargeDate6dupwhat I would expect
1277-Jul-208-Jul-2008-Jul-2008-Jul-2008-Jul-2009-Jul-2009-Jul-2010-Jul-20....70
1316-Oct-216-Oct-2106-Oct-2111-Oct-21........50
13223-Nov-2023-Nov-2023-Nov-2023-Nov-2023-Nov-2023-Nov-2023-Nov-2023-Nov-2023-Nov-2023-Nov-2023-Nov-2024-Nov-2036
13423-Jul-2023-Jul-20.23-Jul-20.24-Jul-20......60

 

 

 

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 12 replies
  • 2909 views
  • 1 like
  • 4 in conversation