Hi all,
I am looking for help in identifying duplicate organisms within 14 days of each other.
Here is the description of my data set.
I have a blood cultures data set with ID_No (Unique for each patient), Adm_No (unique for each admission), Test_No (unique for each blood culture), Culture_Date (unique and tied to each accession number) and PrintName1- PrintName 5 (Number of organisms isolated for each culture ranging from 1 to 5, individual organism listed under PrintName 1- PrintName 5, not necessary that each culture has 5 organisms).
I need to figure out duplicate organisms within 14 days of each other for a given patient and count only non-duplicate organisms. For now, my data set is in short form, one row per blood culture test (i.e. one row for each Test_no). How do I compare first blood culture test for a patients with subsequent ones to see if they were duplicates, comparing PrintName 1 through PrintName5? However, many patients have only single culture, and for these patients PrintName 1-5 are empty.
Thanks for your help in advance!
SM
Sample data in the form of a working data step please.
Help us help you.
And also a data set showing how you want the results to look like.
Some addition items to consider:
1) Does "duplicate" mean that all organisms match? Any match? More than one if more than one is present but not all match?
2) Spelling. Have you checked to see that the names of these organisms are spelled the same consistently? "Matching" usually implies values are the same and if you have different people doing data entry you may have quite different spellings if one uses "E Coli", another "Escherichia coli" and a third "E coli". I am sure you can think of potential others.
3) Does duplicate mean Printname1 has to be the same as Printname1 or can Printname1 match Printname3 at the other time point as a duplicate?
4) before proceeding at all you want to make sure the dates are SAS date values, numeric with an appropriate format for people to recognize the values. There are several functions that are specifically designed to calculate intervals between dates
You may also want to provide some things like how many of these per patient are involved. Some solutions that may work with only two or three records can fall apart quickly if there are 100s.
Thank you for the response. I have attached an excel sheet. I Highlighted a couple of examples, Pt_ID 3 and 4, the second test for Pt_ID 3 and Pt_ID 4 are duplicates and I need to tag them as duplicates.
*IDENTIFY THE 14-DAY BLOCKS (I USED ACTUAL DATES INSTEAD OF HOURS);
data blood_cultures_blocks;
format block 3.0 lag_dt datetime.;
set Inp_BC_Postv;
by Pt_ID;
lag_dt = lag(collection_date);
retain lag_dt;
if first.Pt_ID then do;
block = 1;
lag_dt = .;
end;
else if datepart(collection_date) - datepart(lag_dt) > 14 then block + 1;
run;
proc sort data=blood_cultures_blocks;
by PT_ID block;
run;
data blood_cultures_collapsed;
set blood_cultures_blocks;
by Pt_ID block;
lag_dt = lag(collection_date);
array bugs[30] $30;
array print_name[5] $30 print_name1-print_name5;
retain bugs lag_dt;
if first.Pt_ID then do;
call missing(of bugs[*]);
a = 1;
end;
if datepart(collection_date) - datepart(lag_dt) > 14 then do;
call missing(of bugs[*]);
end;
do i = 1 to 5;
if print_name[i] not in bugs then do;
bugs[a] = print_name[i];
a + 1;
end;
end;
run;
1) Does "duplicate" mean that all organisms match? Any match? More than one if more than one is present but not all match? For this, Any match is what I am looking for, i.e. if 1st Test for Pt1 has Staphylococcus and Streptococcus, and the 2nd Test for the same pt has Streptococcus within 14 days, then 2nd test is a duplicate test. However, if second test as a separate organism E. Coli along with Streptococcus, then I would consider it as a separate test.
2) Spelling. Have you checked to see that the names of these organisms are spelled the same consistently? "Matching" usually implies values are the same and if you have different people doing data entry you may have quite different spellings if one uses "E Coli", another "Escherichia coli" and a third "E coli". I am sure you can think of potential others. Spell check and other naming conventions are taken care off.
3) Does duplicate mean Printname1 has to be the same as Printname1 or can Printname1 match Printname3 at the other time point as a duplicate? Printname1 can match Printname3. I need to check all combinations from printname1-5.
4) before proceeding at all you want to make sure the dates are SAS date values, numeric with an appropriate format for people to recognize the values. There are several functions that are specifically designed to calculate intervals between dates. I have Date_Culture as date variable only, so looking for 14 days difference (not worried about exact DateTime).
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.