DATA Step, Macro, Functions and more

Select observations with evenly distributed dates across a long time interval

Reply
New Contributor
Posts: 2

Select observations with evenly distributed dates across a long time interval

Hi everyone, 

I'm dealing with a longitudinal dataset with about 10-year follow-up. 

Each subject has a unique ID, dates of clinic visits, test results. Each subject had multiple clinic visits (different dates).The data looks as:

ID      Date            Result

001   Jan01/10      Pos

001   Mar28/12     Pos

001   Jan30/13      Neg

001   May04/12     Pos

001   Jun20/10      Neg

002   Jan01/10      Neg

002   Mar28/12     Neg

002   Feb30/13      Neg

002   May04/12     Neg

002   Jun20/10      Neg

003   Oct01/14      Neg

003   Mar28/12      Pos

003   Jan30/14      Neg

004   May04/12     Pos

005   Jun29/16      Neg

.....

I am trying to select observations across their entire follow-up time with evenly distributed length of time. i.e. if a subject has a 5-year follow-up with multiple visits, I'd like to select the most evenly distributed observations across the 5 years. Let's say I need select 3 time points, ideally it'd be the the first date, the last date (first and last dates are required), and the 2.5(half of 5-year) date (or any dates close to the 2.5 point). I'm able to filter the first and last dates by using if first. / if last. statements, but having problems with the time points in between. 

Thanks all ahead 

 

Super Contributor
Posts: 478

Re: Select observations with evenly distributed dates across a long time interval

Can you be little specific, provide the sample data you have and the result you want

Thanks,
Suryakiran
Super User
Posts: 6,543

Re: Select observations with evenly distributed dates across a long time interval

[ Edited ]

So if a person had 10 visits, is it acceptable to take the 1st, 5th, and 10th, or do we need to measure the interval between the first and last visits to find a midpoint based on the dates?

 

If the 1st, 5th, and 10th will do, here's a way.  It assumes your data is in order (leaving that part to you, since the dates look like they might contain character strings):

 

proc freq data=have;

tables id / noprint out=visit_counts;

run;

 

data want;

merge have visit_counts (keep=id count);

by id;

if first.patient then do;

   visit_num=1;

   output;

end;

else do;

   visit_num + 1;

   if visit_num = ceil(count/2) then output;

end;

else if last.patient then output;

drop visit_num count;

run;

New Contributor
Posts: 2

Re: Select observations with evenly distributed dates across a long time interval

Posted in reply to Astounding

We need to estimate the time interval. Some subjects had tons of visits within the first year, then less over the follow-up. So I need to consider the interval to find the evenly distributed visits in terms of "time". 

Super User
Posts: 6,543

Re: Select observations with evenly distributed dates across a long time interval

[ Edited ]

For time intervals, you need to fix the data first.  These dates are not in a form that SAS can treat as dates.  That's a topic you will need to read up on ... the solution below will handle that but won't become clear until you understand how SAS handles dates.

 

data temp;

length date $ 9;

set have;

date = cats( substr(date,4,2), substr(date,1,3), '20', substr(date,7,2));

visit_sasdate = input(date, date9.);

run;

proc sort data=temp;

   by id visit_sasdate;

run;

proc summary data=temp;

   by id;

   var visit_sasdate;

   output out=daterange (keep=id mindate maxdate) min=mindate max=maxdate;

run;

data want;

merge temp daterange;

by id;

if first.id then do;

   midpoint_flag=0;

   output;

end;

else if visit_sasdate >= (mindate + maxdate) / 2 then do;

   if midpoint_flag=0 then output;

   midpoint_flag + 1;

end;

else if last.id then output;

drop midpoint_flag mindate maxdate;

run;

 

It's untested code, so might need a tweak here or there.

 

Technically, this always selects the "midpoint visit" as the earliest visit from the second half of the date range.  It is conceivable that the last visit in the first half of the date range might be closer to the actual midpoint.  But it seemed like this would be close enough.

 

Super User
Posts: 13,084

Re: Select observations with evenly distributed dates across a long time interval

Posted in reply to Astounding

I would be tempted to see how well the median value of the date fits in here:

proc summary data=temp qntldef=2;
   by id;
   var visit_sasdate;
   output out=daterange (keep=id mindate maxdate mediandate) min=mindate max=maxdate median=mediandate;

run;
Trusted Advisor
Posts: 1,294

Re: Select observations with evenly distributed dates across a long time interval

You want the observed dates that most closely match NT target dates from NO observed dates, where

  1. target date T=1 is the first observed date (OBS_DAT1) for a given id
  2. target date T=NT is the last observed date (OBS_DATno) for a given id
  3. target date T=2 through NT-1 is the observed date closest to the pseudo target of equally-spaced dates:
        OBS_DAT1 + (OBS_DATno-OBS_DAT1)*(t-1)/(nt-1)

This can be addressed by a double DO loop separated by an internal step.

  1. Loop 1 through all the observed dates, and record them in an array
  2. With results of step 1
    1. Calculate initial target dates (1 through NT)
    2. Revise the internal target dates 2 (initialized via #3 above) through NT-1 to the closest observed dates
    3. Store these revised dates in an array
  3. Loop 2: re-read the records of the given ID, and keep only those with dates matching the revised target dates.

 

The code below (untested) follows the logic above:

 

%let max_obs=30;
%let n_targets=4;

data want (drop=o t);
  array obs_dats{&max_obs} _temporary_;
  array targ_dats{&n_targets} _temporary_;

  ** Populate the obs_dats array **;
  do o=1 by 1 until (last.id);
    set have;
    by id;
    obs_dats{o}=date;
  end;

  ** Populate targ_dats array **;
  targ_dats{1}=obs_dats{1};
  targ_dats{&n_targets}=obs_dats{o};
  do t=2 to &n_targets-1;
    do o=1 by 1 until (obs_dats{o}<=targ_dats{t}<=obs_dats{o+1});
    end;
    if targ_dats{t} >= mean(obs_dats{o},obs_dats{o+1}) then targ_dats{t}=obs_dats{o+1};
    else targ_dats{t}=obs_dats{o};
  end;

  ** Now re-read and keep dates matching target dates **;
  do until (last.id);
    set have;
    by id;
    if whichn(date,of targ_dats{*}) then output;
  end;
run;

 

  • Set macrovar MAX_OBS to the maximum expected number of dates for any id
  • Set macrovar N_TARGETS to the number of equally-spaced observations wanted.
Esteemed Advisor
Posts: 5,408

Re: Select observations with evenly distributed dates across a long time interval

I divided the date range into bins and required that the best date (closest to the center of the bin) be Inside the bin. It should work for any number of bins > 1.

 

data have;
input ID      DateStr $  Result $;
/* Grrrr! */
date = input(cats(substr(dateStr,4,2),substr(dateStr,1,3),substr(dateStr,7)),date7.);
format date yymmdd10.;
drop dateStr;
datalines;
001   Jan01/10      Pos
001   Mar28/12     Pos
001   Jan30/13      Neg
001   May04/12     Pos
001   Jun20/10      Neg
002   Jan01/10      Neg
002   Mar28/12     Neg
002   Feb28/13      Neg
002   May04/12     Neg
002   Jun20/10      Neg
003   Oct01/14      Neg
003   Mar28/12      Pos
003   Jan30/14      Neg
004   May04/12     Pos
005   Jun29/16      Neg
006   Jan01/10      Pos
006   Mar28/12     Pos
;

%let nPoints=3;

proc means data=have noprint;
by id;
var date;
output out=ranges min= max= range= / autoname;
run;

data bins;
set ranges;
if date_range = 0 then do;
    binStart = date_min;
    binEnd = date_min;
    binMid = date_min;
    output;
    end;
else do;
    width = date_range / (&nPoints-1);
    do binStart = date_min - 0.5*width to date_max - 0.5*width by width;
        binEnd = binStart + width;
        binMid = ( binStart + binEnd ) / 2;
        output;
        end;
    end;
format bin: yymmdd10.;
keep id bin: ;
run;

proc sql;
create table cases as
select unique
    b.id,
    round(b.binMid) as midDate format=yymmdd10.,
    a.date,
    a.result
from have as a right join
    bins as b on a.id=b.id and a.date between b.binStart and b.binEnd
group by b.id, b.binMid
having abs(a.date-b.binMid) = min(abs(a.date-b.binMid));
select * from cases;
quit;
                           ID     midDate        date  Result
                     ------------------------------------------
                            1  2010-01-01  2010-01-01  Pos
                            1  2011-07-18  2012-03-28  Pos
                            1  2013-01-30  2013-01-30  Neg
                            2  2010-01-01  2010-01-01  Neg
                            2  2011-08-01  2012-03-28  Neg
                            2  2013-02-28  2013-02-28  Neg
                            3  2012-03-28  2012-03-28  Pos
                            3  2013-06-30  2014-01-30  Neg
                            3  2014-10-01  2014-10-01  Neg
                            4  2012-05-04  2012-05-04  Pos
                            5  2016-06-29  2016-06-29  Neg
                            6  2010-01-01  2010-01-01  Pos
                            6  2011-02-14           .
                            6  2012-03-28  2012-03-28  Pos
PG
Ask a Question
Discussion stats
  • 7 replies
  • 103 views
  • 0 likes
  • 6 in conversation