Re: Calculating differences in dates multiple observations

Wolverine · Posted 04-24-2018 02:20 PM

I'm testing a case definition that's looking for patients who have at least 3 separate healthcare visits during the year, and all 3 visits have to be at least 30 days apart. Each visit is stored in a separate record. So a person who had 6 visits during the year would have 6 records in the file.

I tested a similar definition, but it only required 2 visits. I accomplished that by doing a PROC SQL join that merged the data file with itself. This created a many-to-many merge. Then it was just a matter of finding all records where the difference in dates was 30 days or more, flagging them, and selecting all people who had at least one flag. But when I tried this approach with 3 dates, the file created by the many-to-many-to-many merge was too big and took too long to create to be practical.

I should also note that I need to calculate true positives, false positives, true negatives, false negatives, sensitivity, and specificity. So I can't simply drop cases that don't fit the criteria, because those cases will need to be classified later. Instead, all cases have to be flagged to indicate if they qualify for the definition or not.

Here is the code I used for the 2-visit version:

/*Create A and B versions of claim file, use PROC SQL to create a many-to-many merge.  This will generate 
a file with all possible combinations of claims dates.*/

PROC SQL;
	Create Table temp.SCA_ICD10_def&defnumb._mm 
	as Select a.encrypted_id, a.Med_def1516171920_count,
		a.DxSCA1520_max as DxSCA1520_max_A, 
		b.DxSCA1520_max as DxSCA1520_max_B, 
		a.CLAIM_SERVICE_FROM_DATE as CLAIM_DATE_A, 
		b.CLAIM_SERVICE_FROM_DATE as CLAIM_DATE_B
	From temp.SCA_ICD10_def&defnumb._line_rec as a JOIN temp.SCA_ICD10_def&defnumb._line_rec as b
	ON a.encrypted_id = b.encrypted_id
	order by a.encrypted_id, a.CLAIM_SERVICE_FROM_DATE;
QUIT;


/****Apply Case Definition &defnumb. - At least 2 visits at least 30 days apart****/
DATA temp.SCA_ICD10_def&defnumb._dates; SET temp.SCA_ICD10_def&defnumb._mm;

/*Convert claim dates to raw numbers so number of days will calculate correctly even across months*/
CLAIM_DATE_A_n=input(put(CLAIM_DATE_A,yymmddn8.), 8.);
CLAIM_DATE_B_n=input(put(CLAIM_DATE_B,yymmddn8.), 8.);

CLAIM_DATE_A_n2=input(put(CLAIM_DATE_A_n,$8.),yymmdd8.);
CLAIM_DATE_B_n2=input(put(CLAIM_DATE_B_n,$8.),yymmdd8.);

CLAIM_DATE_diff = CLAIM_DATE_A_n2 - CLAIM_DATE_B_n2;

IF (CLAIM_DATE_diff => 30 AND Med_def1516171920_count => 2 AND DxSCA1520_max_A = 1 AND DxSCA1520_max_B = 1) 
	THEN def&defnumb._Med_flag = 1;	ELSE def&defnumb._Med_flag = 0;
RUN;  

	PROC SQL;
		Create table temp.SCA_ICD10_def&defnumb._uniqID
			as Select encrypted_id,
			max(def&defnumb._Med_flag) as def&defnumb._Med_case
		From temp.SCA_ICD10_def&defnumb._dates
		Group by encrypted_id;
	QUIT;

ballardw · Posted 04-24-2018 04:14 PM

A brief example input data set, preferably in the form of a data step so we can generate data and test code against it, and the desired output for that code would be enlightening.

Clarification is needed because of the possible interpretations of "at least 30 days apart". If I have 6 visits and they are all at a 20 days interval would the 1st, 3rd and 5th records match your requirement (the would each be 40 days apart) or do you require consideration of sequential visits? If date1 and date3 are 30 days apart and date1 and date4 are 30 days apart but date3 is not 30 days from date4 does the set date1, date3 and date4 meet your requirements? If they don't what happens if there is not a third date?

You have a moderate number of other boundary values to consider as well I suspect.

Since ORDER is a big portion of this request it may well be that SQL is not the best solution.

I might start with something like this to get a set of dates 30 or more days apart for each ID, but reducing to your actual dates would require some answers to the above questions.

proc sql;
   create table days30plus as
   select distinct a.identification, a.date as date1, b.date as date2
   from have as a
        left join
        have as b
        on a.identification=b.identification
   where b.date - a.date ge 30
   order by a.identification, a.date
 ;
quit;

I included the Distinct because it seems that healthcare visit data often has multiple visits for single dates.

lizzdream12 · Posted 04-28-2019 12:30 AM

Hi Ballardw,

I have a question that hope you can help with.

Context: The data contains four variables: id, A1C value, date (in months) of A1C measurement, and A1C count. One id has >=3 A1C values. I sorted the data by id and descending date so that the most recent A1C comes up first. I am trying to keep the most recent 3 A1C values that are at least measured 3 months apart for a specific id. So for example, if a patient has 4 A1C values, but the 3rd one is measured within 3 months from the previous A1C (The 2nd one), then it should be deleted or kept separately. and the 4th A1C's gap from the 2nd one need to be checked and see if it is >=3 months. If yes, then the 1st, 2nd, and 4th A1C value will be kept for that id for further calculation.

Each id can have varied counts of A1C values. How can I remove the observations not meeting the criteria and calculate the gap for the next A1C date?

Thanks in advance!

art297 · Posted 04-24-2018 06:09 PM

I agree with @ballardw, but also question why you have multiple procs when it could all be done in one proc sql call.

I, too, would prefer knowing your rules, but here is a simple example:

%let defnumb=3;
libname temp '/folders/myfolders/temp';

data temp.SCA_ICD10_def&defnumb._line_rec;
  input encrypted_id Med_def1516171920_count CLAIM_SERVICE_FROM_DATE $;
  cards;
1 1 20180101
1 2 20180120
1 3 20180128
1 4 20180228
1 5 20180315
1 6 20180330
;
proc sql;
   create table days30plus as
     select distinct a.encrypted_id, 
         a.Med_def1516171920_count as count1,
         b.Med_def1516171920_count as count2,
         c.Med_def1516171920_count as count3,
         input(put(a.CLAIM_SERVICE_FROM_DATE,$8.),yymmdd8.) as date1,
         input(put(b.CLAIM_SERVICE_FROM_DATE,$8.),yymmdd8.) as date2,
         input(put(c.CLAIM_SERVICE_FROM_DATE,$8.),yymmdd8.) as date3
       from temp.SCA_ICD10_def&defnumb._line_rec a
         left join temp.SCA_ICD10_def&defnumb._line_rec b
           on a.encrypted_id=b.encrypted_id
            left join temp.SCA_ICD10_def&defnumb._line_rec c
              on b.encrypted_id=c.encrypted_id
                where calculated date2 - calculated date1 ge 30 and
                     calculated date3 - calculated date2 ge 30
                  order by a.encrypted_id, date1
 ;
quit;

Art, CEO, AnalystFinder.com

Wolverine · Posted 04-25-2018 08:59 AM

I spoke to the researcher, and the current thinking is that it doesn't matter if there are intermediate visits or not. As long as there are visits that are more than 30 days apart, it is a qualifying case. The explanation is that we just want to make sure that the patient is getting care at multiple times throughout the year, rather just having a few visits in a short period of time.

Unfortunately, the output file is still much too large with either of these approaches. It uses all 2.5TB of empty disc space and then SAS gives me an "out of resources" error. As I mentioned before, a WHERE clause that discards records isn't advisable because I need to retain those cases to classify them as true positive, false positive, etc later on.

What if we flattened the file? I've never done that in SAS, but it seems like it might work. The problem is that the output file could potentially have 365 date columns, if there was a person in the data who had a visit every day. In reality, I think the max is more like 100 visits in a year. I don't know how long it would take to produce that file, but at least it should be much smaller than the current approach.

Of course, then the problem becomes calculating all the potential differences between all those date variables. Would that be done with some sort of do-loop or macro?

art297 · Posted 04-25-2018 09:45 AM

How big is the file you are trying analyze? Also, is it already in ID order and, if not, do you have the disk space needed to sort it?

It would definitely make it easier for people to suggest a solution if you provide an example dataset.

Art, CEO, AnalystFinder.com

Wolverine · Posted 04-25-2018 10:49 AM

It was too large to post, so I converted it to a text file. I also had to modify the file a bit to make sure it doesn't have any identifying information.

art297 · Posted 04-25-2018 02:52 PM

Since you're running out of space using the sql approach, if your data are already grouped by encrypted_id then you might be able to use something like the following code. It only outputs one record per encrypted_id, with each record only containing the encrypted_id and the flag variable. I set the temporary array to maintain up to 360 dates as one of your encrypted_ids had 357 records:

data want (keep=encrypted_id flag);
  set have;
  by encrypted_id notsorted;
  array x[360] _temporary_;
  retain x;
  if first.encrypted_id then counter=1;
  else counter+1;
  x(counter)=CLAIM_DATE_A_n2;
  if last.encrypted_id then do;
    flag=0;
    if counter gt 2 then do;
      do i = 1 to counter-2;
        do j = (i + 1) to counter-1 ;
          do k = (j + 1) to counter ;
            if x(j)-x(i) ge 30 and x(k)-x(j) ge 30 then do;
              flag=1;
              leave;
            end;
          end;
          if flag then leave;
        end;
        if flag then leave;
      end;
    end;
    output;
 end;
run;

Art, CEO, AnalystFinder.com

Wolverine · Posted 04-30-2018 09:07 AM

I was able to use the Proc SQL approach by using where clauses that reduce the number of redundant date matches while still retaining and flagging all cases. And it still runs reasonably fast. Here is the final code:

proc sql;
   create table temp.SCA_ICD10_def&defnumb._mm as
     select distinct a.encrypted_id, a.Med_defOPQRS_count,
         a.DxSCAOS_max_A,
         b.DxSCAOS_max_B,
         c.DxSCAOS_max_C,
         a.CLAIM_DATE_A_n2,
         b.CLAIM_DATE_B_n2,
         c.CLAIM_DATE_C_n2
       from temp.SCA_ICD10_def&defnumb._line_rec a
         left join temp.SCA_ICD10_def&defnumb._line_rec b
           on a.encrypted_id=b.encrypted_id
            left join temp.SCA_ICD10_def&defnumb._line_rec c
              on b.encrypted_id=c.encrypted_id
                where c.CLAIM_DATE_C_n2 >= b.CLAIM_DATE_B_n2 and
                     b.CLAIM_DATE_B_n2 >= a.CLAIM_DATE_A_n2
                  order by a.encrypted_id, CLAIM_DATE_A_n2;
quit;

Classroom Training Available!