I am working on a long form dataset with multiple lab tests (labeled LOINC1, LOINC 2…), result dates, etc. as shown in the snapshot.
LOINCS can be repeated for an observation, meaning, same tests might have been performed multiple times. A particular observation might have 5 repeated tests or all unique tests. Another one might have two different tests repeated 3 times each, etc.
If same test was performed within 3 months interval (for example), I want to de-duplicate the tests FOR THAT OBSERVATION. But if the result dates are 4 months apart for same tests, I want both to stay. Any suggestions to tackle this would be much appreciated!
Thank you LinusH! an "Observation" is a patient's unique ID. the data was transposed from long to wide, not wide to long. so, in it's orginal form, the data looked like..
ID LOINCS_1 RESULT_DATE
123 987 01-SEPT-XXXX
123 945 07-DEC-XXXX
134 945 21-OCT-XXXX
186 756 23-JAN-XXXX
186 654 12-JAN-XXXX
234 654 14-FEB-XXXX
234 987 27-MAR-XXXX
data in this form was transposed to a wider format as in the earlier attachment.
Hope that clarifies..?
The simplest way is using proc means + idgroup option.
data have;
input (ID LOINCS_1 RESULT_DATE) (: $40.);
cards;
123 987 01-SEPT-XXXX
123 945 07-DEC-XXXX
134 945 21-OCT-XXXX
186 756 23-JAN-XXXX
186 654 12-JAN-XXXX
234 654 14-FEB-XXXX
234 987 27-MAR-XXXX
;
run;
proc sql;
select max(n) into : n
from (select count(*) as n from have group by id)
quit;
proc summary data=have;
by id;
output out=want idgroup(out[&n] (LOINCS_1 RESULT_DATE)=);
quit;
If you have big data, you could check MERGE skill:
http://support.sas.com/resources/papers/proceedings15/2785-2015.pdf
What was the rational for the original transformation?
I think processing your rules in a long format would be easier?
Sort by ID, test and date, thenu se the dif or lag function to get the time from last test to determine if it's a duplicate record.
For the analysis that I am doing, it was important for me to see records per ID (per patient), and I thought wider format would be useful. Specifically, I had to look at statistics like # of cases- "with ONLY test A and B", "Only test B", Test B and C", "Only B and C" etc., and it sounded easier for me to pull from wider table than long one.
If it works, I don't have problems going back to the long format (again. But something i probably should have clarified before- the labs are NOT true duplicates. I want to treat like duplicates if it is the same test performed within certain duration (like 3 months).
how would you duplicate using the long format? any quick hints/sample codes? dif/lag fucntions?
thank you
/*1*/
proc sort data=have;
by ID test date;
run;
/*2*/
data want;
set have;
by id test;
retain count 0;
prev_test=lag(date);
if first.test then do;
prev_test=.;
count=0;
end;
if date-prev_date > 90 then count+1;
run;
Duplicates will be grouped and you can take the first per group. It's untested but should get you started.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.