BookmarkSubscribeRSS Feed
TEJ
Calcite | Level 5 TEJ
Calcite | Level 5


I am working on a long form dataset with multiple lab tests (labeled LOINC1, LOINC 2…), result dates, etc. as shown in the snapshot.
LOINCS can be repeated for an observation, meaning, same tests might have been performed multiple times. A particular observation might have 5 repeated tests or all unique tests. Another one might have two different tests repeated 3 times each, etc.
If same test was performed within 3 months interval (for example), I want to de-duplicate the tests FOR THAT OBSERVATION. But if the result dates are 4 months apart for same tests, I want both to stay. Any suggestions to tackle this would be much appreciated!

6 REPLIES 6
LinusH
Tourmaline | Level 20
So the question areises: what is an observation? Can't see any clear primary key there?
Given that the data was stored normalized (transposed from wide to long) an with a key, the operation would be simpler.
Data never sleeps
TEJ
Calcite | Level 5 TEJ
Calcite | Level 5

Thank you LinusH! an "Observation" is a patient's unique ID. the data was transposed from long to wide, not wide to long. so, in it's orginal form, the data looked like..

 

ID                             LOINCS_1               RESULT_DATE

123                           987                         01-SEPT-XXXX

123                           945                         07-DEC-XXXX

134                           945                         21-OCT-XXXX

186                           756                         23-JAN-XXXX

186                           654                         12-JAN-XXXX

234                           654                         14-FEB-XXXX

234                           987                         27-MAR-XXXX

 

data in this form was transposed to a wider format as in the earlier attachment.

 

Hope that clarifies..?

Ksharp
Super User

The simplest way is using proc means + idgroup option.

data have;
input (ID                             LOINCS_1               RESULT_DATE) (: $40.);
cards;
123                           987                         01-SEPT-XXXX
123                           945                         07-DEC-XXXX
134                           945                         21-OCT-XXXX
186                           756                         23-JAN-XXXX
186                           654                         12-JAN-XXXX
234                           654                         14-FEB-XXXX
234                           987                         27-MAR-XXXX
;
run;

proc sql;
 select max(n) into : n
  from (select count(*) as n from have group by id)
quit;
proc summary data=have;
by id;
output out=want idgroup(out[&n] (LOINCS_1   RESULT_DATE)=);
quit;

If you have big data, you could check MERGE skill:

 

http://support.sas.com/resources/papers/proceedings15/2785-2015.pdf

Reeza
Super User

What was the rational for the original transformation?

I think processing your rules in a long format would be easier?

Sort by ID, test and date, thenu se the dif or lag function to get the time from last test to determine if it's a duplicate record.

 

 

TEJ
Calcite | Level 5 TEJ
Calcite | Level 5

For the analysis that I am doing, it was important for me to see records per ID (per patient), and I thought wider format would be useful. Specifically, I had to look at statistics like # of cases- "with ONLY test A and B", "Only test B", Test B and C", "Only B and C" etc., and it sounded easier for me to pull from wider table than long one.

If it works, I don't have problems going back to the long format (again.  But something i probably should have clarified before- the labs are NOT true duplicates. I want to treat like duplicates if it is the same test performed within certain duration (like 3 months).

 

how would you duplicate using the long format? any quick hints/sample codes? dif/lag fucntions?

 

thank you

Reeza
Super User
  1. Sort Long data by ID/Test/Date
  2. Within each by group test for duplicates.
/*1*/
proc sort data=have;
by ID test date;
run;

/*2*/
data want;
set have;
by id test;
retain count 0;

prev_test=lag(date);
if first.test then do;
prev_test=.;
count=0;
end;

if date-prev_date > 90 then count+1;

run;

Duplicates will be grouped and you can take the first per group. It's untested but should get you started.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1669 views
  • 0 likes
  • 4 in conversation