Re: de-duplicating

TEJ · Posted 03-14-2016 02:36 PM

I am working on a long form dataset with multiple lab tests (labeled LOINC1, LOINC 2…), result dates, etc. as shown in the snapshot.
LOINCS can be repeated for an observation, meaning, same tests might have been performed multiple times. A particular observation might have 5 repeated tests or all unique tests. Another one might have two different tests repeated 3 times each, etc.
If same test was performed within 3 months interval (for example), I want to de-duplicate the tests FOR THAT OBSERVATION. But if the result dates are 4 months apart for same tests, I want both to stay. Any suggestions to tackle this would be much appreciated!

LinusH · Posted 03-14-2016 02:57 PM

So the question areises: what is an observation? Can't see any clear primary key there?
Given that the data was stored normalized (transposed from wide to long) an with a key, the operation would be simpler.

Data never sleeps

TEJ · Posted 03-14-2016 03:09 PM

Thank you LinusH! an "Observation" is a patient's unique ID. the data was transposed from long to wide, not wide to long. so, in it's orginal form, the data looked like..

ID LOINCS_1 RESULT_DATE

123 987 01-SEPT-XXXX

123 945 07-DEC-XXXX

134 945 21-OCT-XXXX

186 756 23-JAN-XXXX

186 654 12-JAN-XXXX

234 654 14-FEB-XXXX

234 987 27-MAR-XXXX

data in this form was transposed to a wider format as in the earlier attachment.

Hope that clarifies..?

Ksharp · Posted 03-14-2016 09:11 PM

The simplest way is using proc means + idgroup option.

data have;
input (ID                             LOINCS_1               RESULT_DATE) (: $40.);
cards;
123                           987                         01-SEPT-XXXX
123                           945                         07-DEC-XXXX
134                           945                         21-OCT-XXXX
186                           756                         23-JAN-XXXX
186                           654                         12-JAN-XXXX
234                           654                         14-FEB-XXXX
234                           987                         27-MAR-XXXX
;
run;

proc sql;
 select max(n) into : n
  from (select count(*) as n from have group by id)
quit;
proc summary data=have;
by id;
output out=want idgroup(out[&n] (LOINCS_1   RESULT_DATE)=);
quit;

If you have big data, you could check MERGE skill:

http://support.sas.com/resources/papers/proceedings15/2785-2015.pdf

Reeza · Posted 03-14-2016 03:49 PM

What was the rational for the original transformation?

I think processing your rules in a long format would be easier?

Sort by ID, test and date, thenu se the dif or lag function to get the time from last test to determine if it's a duplicate record.

TEJ · Posted 03-14-2016 04:46 PM

For the analysis that I am doing, it was important for me to see records per ID (per patient), and I thought wider format would be useful. Specifically, I had to look at statistics like # of cases- "with ONLY test A and B", "Only test B", Test B and C", "Only B and C" etc., and it sounded easier for me to pull from wider table than long one.

If it works, I don't have problems going back to the long format (again. But something i probably should have clarified before- the labs are NOT true duplicates. I want to treat like duplicates if it is the same test performed within certain duration (like 3 months).

how would you duplicate using the long format? any quick hints/sample codes? dif/lag fucntions?

thank you

Reeza · Posted 03-14-2016 04:53 PM

Sort Long data by ID/Test/Date
Within each by group test for duplicates.

/*1*/
proc sort data=have;
by ID test date;
run;

/*2*/
data want;
set have;
by id test;
retain count 0;

prev_test=lag(date);
if first.test then do;
prev_test=.;
count=0;
end;

if date-prev_date > 90 then count+1;

run;

Duplicates will be grouped and you can take the first per group. It's untested but should get you started.

de-duplicating