Help using Base SAS procedures

de-duplicating

Reply
Occasional Contributor TEJ
Occasional Contributor
Posts: 5

de-duplicating


I am working on a long form dataset with multiple lab tests (labeled LOINC1, LOINC 2…), result dates, etc. as shown in the snapshot.
LOINCS can be repeated for an observation, meaning, same tests might have been performed multiple times. A particular observation might have 5 repeated tests or all unique tests. Another one might have two different tests repeated 3 times each, etc.
If same test was performed within 3 months interval (for example), I want to de-duplicate the tests FOR THAT OBSERVATION. But if the result dates are 4 months apart for same tests, I want both to stay. Any suggestions to tackle this would be much appreciated!

Super User
Posts: 5,426

Re: de-duplicating

So the question areises: what is an observation? Can't see any clear primary key there?
Given that the data was stored normalized (transposed from wide to long) an with a key, the operation would be simpler.
Data never sleeps
Occasional Contributor TEJ
Occasional Contributor
Posts: 5

Re: de-duplicating

Thank you LinusH! an "Observation" is a patient's unique ID. the data was transposed from long to wide, not wide to long. so, in it's orginal form, the data looked like..

 

ID                             LOINCS_1               RESULT_DATE

123                           987                         01-SEPT-XXXX

123                           945                         07-DEC-XXXX

134                           945                         21-OCT-XXXX

186                           756                         23-JAN-XXXX

186                           654                         12-JAN-XXXX

234                           654                         14-FEB-XXXX

234                           987                         27-MAR-XXXX

 

data in this form was transposed to a wider format as in the earlier attachment.

 

Hope that clarifies..?

Super User
Posts: 10,023

Re: de-duplicating

The simplest way is using proc means + idgroup option.

data have;
input (ID                             LOINCS_1               RESULT_DATE) (: $40.);
cards;
123                           987                         01-SEPT-XXXX
123                           945                         07-DEC-XXXX
134                           945                         21-OCT-XXXX
186                           756                         23-JAN-XXXX
186                           654                         12-JAN-XXXX
234                           654                         14-FEB-XXXX
234                           987                         27-MAR-XXXX
;
run;

proc sql;
 select max(n) into : n
  from (select count(*) as n from have group by id)
quit;
proc summary data=have;
by id;
output out=want idgroup(out[&n] (LOINCS_1   RESULT_DATE)=);
quit;

If you have big data, you could check MERGE skill:

 

http://support.sas.com/resources/papers/proceedings15/2785-2015.pdf

Super User
Posts: 19,772

Re: de-duplicating

What was the rational for the original transformation?

I think processing your rules in a long format would be easier?

Sort by ID, test and date, thenu se the dif or lag function to get the time from last test to determine if it's a duplicate record.

 

 

Occasional Contributor TEJ
Occasional Contributor
Posts: 5

Re: de-duplicating

For the analysis that I am doing, it was important for me to see records per ID (per patient), and I thought wider format would be useful. Specifically, I had to look at statistics like # of cases- "with ONLY test A and B", "Only test B", Test B and C", "Only B and C" etc., and it sounded easier for me to pull from wider table than long one.

If it works, I don't have problems going back to the long format (again.  But something i probably should have clarified before- the labs are NOT true duplicates. I want to treat like duplicates if it is the same test performed within certain duration (like 3 months).

 

how would you duplicate using the long format? any quick hints/sample codes? dif/lag fucntions?

 

thank you

Super User
Posts: 19,772

Re: de-duplicating

  1. Sort Long data by ID/Test/Date
  2. Within each by group test for duplicates.
/*1*/
proc sort data=have;
by ID test date;
run;

/*2*/
data want;
set have;
by id test;
retain count 0;

prev_test=lag(date);
if first.test then do;
prev_test=.;
count=0;
end;

if date-prev_date > 90 then count+1;

run;

Duplicates will be grouped and you can take the first per group. It's untested but should get you started.

Ask a Question
Discussion stats
  • 6 replies
  • 476 views
  • 0 likes
  • 4 in conversation