DATA Step, Macro, Functions and more

Data cleaning based on rules: time period duration and time spans in between tied to certain dates

Reply
Super Contributor
Posts: 331

Data cleaning based on rules: time period duration and time spans in between tied to certain dates

[ Edited ]

I'm working with insurance claim data where patients treatment information can be affected by insurance enrollments. Losing insurance but somehow managing to get it back is common, needless to say. If i use data as is, I will have problems such as: patients that appear no treatment received were in actuality happened due to loss of insurance thus not in my claim data. Many other rules can be thought of but I'd like to stick to below few major rules and clean the data accordingly. Please help write SAS code reflecting these rules to a final clean data where my patients treatment information could be sought with minimal influence of insurance status. Perfect doesn't exist. But would like to weed out those observations looking obviously unreliable. @art297 @ChrisNZ @PGStats @Ksharp

 

Eligibility rules:

 

- Patients must have insurance => 4 months on a continuous basis where times in between enrollments are up to 30 days but not greater (John had 4 months of continuous insurance enrollment but lost it at the end of 4th month but got his insurance back after 29 days. And similar scenarios were repeated. I would keep John's observation in my data even if he lost his insurance (continued longer then 4 months many times) but always managed to get insurance back within less than a month every time he lose his insurance).

 

- Insurance start date must be tied to diagnosis date. Aka, patients must have insurance started within one month since date of diagnosis

 

- With the exception when:

Above conditions is not met because insurance was ended due to death

 

Data attached to this post.

where variables are: 

'begn:' date for beginning of insurance;

'endn:' date for end of insurance

alive: (1)-alive, (0)-death

date of death and date of diagnosis are self explanatory.

proc import datafile="...\insurance.csv" out=insurance
dbms=csv replace;
getnames=yes;
run;

My attempt which failed to tie the insurance dates to 'date of diagnosis' and 'date of death' and whether time in between two insurances greater or less than a month.

 

/*whether patients insurance lasted only a month or less*/
data insurance1; set insurance; 
if sum(end1-beg1)<31 and sum(end2-beg2)<31 and sum(end3-beg3)<31 and sum(end4-beg4)<31 and sum(end5-beg5)<31
and sum(end6-beg6)<31 and sum(end7-beg7)<31 and sum(end8-beg8)<31 and sum(end9-beg9)<31 and sum(end10-beg10)<31
and sum(end11-beg11)<31 and sum(end12-beg12)<31 and sum(end13-beg13)<31 then remove=1;
if sum(end1-beg1)>=31 and sum(end2-beg2)>=31 and sum(end3-beg3)>=31 and sum(end4-beg4)>=31 and sum(end5-beg5)>=31
and sum(end6-beg6)>=31 and sum(end7-beg7)>=31 and sum(end8-beg8)>=31 and sum(end9-beg9)>=31 and sum(end10-beg10)>=31
and sum(end11-beg11)>=31 and sum(end12-beg12)>=31 and sum(end13-beg13)>=31 then remove=0;

/*whether patients insurance continued => 4 months**/
if sum(end1-beg1)<124 or sum(end2-beg2)<124 or sum(end3-beg3)<124 or sum(end4-beg4)<124 or sum(end5-beg5)<124
or sum(end6-beg6)<124 or sum(end7-beg7)<124 or sum(end8-beg8)<124 or sum(end9-beg9)<124 or sum(end10-beg10)<124
or sum(end11-beg11)<124 or sum(end12-beg12)<124 or sum(end13-beg13)<124 then remove=1;
if sum(end1-beg1)>=124 or sum(end2-beg2)>=124 or sum(end3-beg3)>=124 or sum(end4-beg4)>=124 or sum(end5-beg5)>=124
or sum(end6-beg6)>=124 or sum(end7-beg7)>=124 or sum(end8-beg8)>=124 or sum(end9-beg9)>=124 or sum(end10-beg10)>=124
or sum(end11-beg11)>=124 or sum(end12-beg12)>=124 or sum(end13-beg13)>=124 then remove=0;
if remove=0 then output; 

 

PROC Star
Posts: 2,311

Re: Data cleaning based on rules: time period duration and time spans in between tied to certain dat

What do you want to happen if sum(end1-beg1)<31 and sum(end2-beg2)>=31 ? You don't test for this scenario at the moment.
Super Contributor
Posts: 331

Re: Data cleaning based on rules: time period duration and time spans in between tied to certain dat

Good catch indeed. I will keep the case but will check whether time period between insurance is not longer than a month. Can you please help me have my rules reflected in the code? I'll appreciate that so much.
Super User
Posts: 13,300

Re: Data cleaning based on rules: time period duration and time spans in between tied to certain dat

[ Edited ]

I think there may be some concern about the entire logic.

 

For instance here:

if sum(end1-beg1)<124 or sum(end2-beg2)<124 or sum(end3-beg3)<124 or sum(end4-beg4)<124 or sum(end5-beg5)<124
or sum(end6-beg6)<124 or sum(end7-beg7)<124 or sum(end8-beg8)<124 or sum(end9-beg9)<124 or sum(end10-beg10)<124
or sum(end11-beg11)<124 or sum(end12-beg12)<124 or sum(end13-beg13)<124 then remove=1;
if sum(end1-beg1)>=124 or sum(end2-beg2)>=124 or sum(end3-beg3)>=124 or sum(end4-beg4)>=124 or sum(end5-beg5)>=124
or sum(end6-beg6)>=124 or sum(end7-beg7)>=124 or sum(end8-beg8)>=124 or sum(end9-beg9)>=124 or sum(end10-beg10)>=124
or sum(end11-beg11)>=124 or sum(end12-beg12)>=124 or sum(end13-beg13)>=124 then remove=0;
if remove=0 then output; 

If sum(end1-beg1)<124  and sum(end11-beg11)>=124 are both true then what should remove actually be?

 

 

And what the heck is SUM doing in there at all? It is only adding characters. If the values are dates then If (end1-beg1)<124 is sufficient.

 

It is also time to learn to use arrays.

 

Also you should be aware that every record where sum(end1-beg1)<31 is also less than 124 and every record sum(end1-beg1)>=124 is also >=31. So assignment to a single "remove" variable is very likely not behaving as intended. So as a minimum you may want to consider a 31 day variable and a 124 day variable separately.

 

A better description of what the inclusion exclusion rule is may help. I'm not going to look a the entire data example to try to figure out rules.

Ask a Question
Discussion stats
  • 3 replies
  • 175 views
  • 1 like
  • 3 in conversation