Re: Limiting duplicates in Proc SQL

chuakp · Posted 03-12-2016 10:36 PM

I'm working with a health insurance claims database and am trying to identify instances in which there was a drug claim for an antibiotic that occurred either on the same day as a claim with a diagnosis code of bronchitis or within the 3 days after of a claim with a diagnsois code of bronchitis. I am working with two tables in PROC SQL, the table antibiotics (which has all drug claims for antibiotics for people in my sample) and the table bronchitis, which has all claims with a diagnosis of bronchitis for people in my sample. If there was more than one claim with a diagnosis of bronchitis on the same day for a particular person, I consider that to be one instance of a bronchitis diagnosis, and I only want to allow a maximum of one antibiotic per instance of a bronchitis diagnosis.

Here is my code:

proc sql;
create table antibiotics_bronchitis as
select L.claimid, L.servicedate, L.drugname, R.claimid as claimid_bronchitis, R.servicedate as svcdate_bronchitis,
from work.antibiotics as L
inner join
work.bronchitis as R
on L.enrolid=R.enrolid
where L.servicedate-3 <= servicedate_bronchitis <= L.servicedate
order by L.claimid;
quit;

Using this code, I get many observations with duplicate claimid when a particular drug claim (say one that occurred January 4, 2015) is matched with more than one claim with a diagnosis of bronchitis with a service date of January 1, January 2, January 3, or January 4 (e.g.,this would happen if there were two claims with a diagnosis of bronchitis of January 1, one claim with a diagnosis of bronchitis on January 1 and one claim with a diagnosis of bronchitis on January 2, etc.). Using the "distinct" term in the select statement doesn't do anything because observations with duplicate claimid will have different claimid_bronchitis. I could just use proc sort NODUPKEY by claimid after this code, but I was wondering if there is an alternative method within SQL. Thanks.

Reeza · Posted 03-13-2016 12:13 AM

Untested, but what about having clause to equal the minimum difference? You probably will also need a group by.

L.service_date - r.bronchitis_date as dif,

...

Group by a.claimid

...

having dif=min(dif)

chuakp · Posted 03-13-2016 01:31 PM

Thanks for the suggestion. This doesn't seem to achieve what I want to do. Perhaps this is not easy to achieve in PROC SQL and I just need to use an additional step. Are there are concerns with using

proc sort data = antibiotics_bronchitis nodupkey; by claimid; run;

This should eliminate any duplicate drug claims that happened to be associated with more than one claim with a diagnosis of bronchitis on the same day or in the previous three days.

Reeza · Posted 03-13-2016 03:29 PM

No issues, but sometimes you need a prior sort so that the single record you get is the one you want. Maybe you want the smallest or largest diff.

This isn't easy to do in SQL but it's possible. The other option is a subquery but I don't have data to test anything right now.

RW9 · Posted 03-14-2016 05:29 AM

Hi,

With any solution here, you will need to work out a rule as to what observation from the joining table should be kept. Its difficult to say, maybe post some test data - a few data elements of table as a datastep - and what the output should look like - and specify why a certain rwo should be pulled out, this is the key part here. Once you know hwich record you want to pull out then you should be able to simply sub-query the input dataset to filter for this record.

Limiting duplicates in Proc SQL