Re: Select observations with evenly distributed dates across a long ti...

cindyyy · Posted 04-03-2018 01:59 PM

Hi everyone,

I'm dealing with a longitudinal dataset with about 10-year follow-up.

Each subject has a unique ID, dates of clinic visits, test results. Each subject had multiple clinic visits (different dates).The data looks as:

ID Date Result

001 Jan01/10 Pos

001 Mar28/12 Pos

001 Jan30/13 Neg

001 May04/12 Pos

001 Jun20/10 Neg

002 Jan01/10 Neg

002 Mar28/12 Neg

002 Feb30/13 Neg

002 May04/12 Neg

002 Jun20/10 Neg

003 Oct01/14 Neg

003 Mar28/12 Pos

003 Jan30/14 Neg

004 May04/12 Pos

005 Jun29/16 Neg

.....

I am trying to select observations across their entire follow-up time with evenly distributed length of time. i.e. if a subject has a 5-year follow-up with multiple visits, I'd like to select the most evenly distributed observations across the 5 years. Let's say I need select 3 time points, ideally it'd be the the first date, the last date (first and last dates are required), and the 2.5(half of 5-year) date (or any dates close to the 2.5 point). I'm able to filter the first and last dates by using if first. / if last. statements, but having problems with the time points in between.

Thanks all ahead

SuryaKiran · Posted 04-03-2018 02:16 PM

Can you be little specific, provide the sample data you have and the result you want

Thanks,
Suryakiran

Astounding · Posted 04-03-2018 02:48 PM

So if a person had 10 visits, is it acceptable to take the 1st, 5th, and 10th, or do we need to measure the interval between the first and last visits to find a midpoint based on the dates?

If the 1st, 5th, and 10th will do, here's a way. It assumes your data is in order (leaving that part to you, since the dates look like they might contain character strings):

proc freq data=have;

tables id / noprint out=visit_counts;

run;

data want;

merge have visit_counts (keep=id count);

by id;

if first.patient then do;

visit_num=1;

output;

end;

else do;

visit_num + 1;

if visit_num = ceil(count/2) then output;

end;

else if last.patient then output;

drop visit_num count;

run;

cindyyy · Posted 04-03-2018 02:54 PM

We need to estimate the time interval. Some subjects had tons of visits within the first year, then less over the follow-up. So I need to consider the interval to find the evenly distributed visits in terms of "time".

Astounding · Posted 04-03-2018 03:07 PM

For time intervals, you need to fix the data first. These dates are not in a form that SAS can treat as dates. That's a topic you will need to read up on ... the solution below will handle that but won't become clear until you understand how SAS handles dates.

data temp;

length date $ 9;

set have;

date = cats( substr(date,4,2), substr(date,1,3), '20', substr(date,7,2));

visit_sasdate = input(date, date9.);

run;

proc sort data=temp;

by id visit_sasdate;

run;

proc summary data=temp;

by id;

var visit_sasdate;

output out=daterange (keep=id mindate maxdate) min=mindate max=maxdate;

run;

data want;

merge temp daterange;

by id;

if first.id then do;

midpoint_flag=0;

output;

end;

else if visit_sasdate >= (mindate + maxdate) / 2 then do;

if midpoint_flag=0 then output;

midpoint_flag + 1;

end;

else if last.id then output;

drop midpoint_flag mindate maxdate;

run;

It's untested code, so might need a tweak here or there.

Technically, this always selects the "midpoint visit" as the earliest visit from the second half of the date range. It is conceivable that the last visit in the first half of the date range might be closer to the actual midpoint. But it seemed like this would be close enough.

ballardw · Posted 04-03-2018 04:19 PM

I would be tempted to see how well the median value of the date fits in here:

proc summary data=temp qntldef=2;
   by id;
   var visit_sasdate;
   output out=daterange (keep=id mindate maxdate mediandate) min=mindate max=maxdate median=mediandate;

run;

mkeintz · Posted 04-03-2018 05:03 PM

You want the observed dates that most closely match NT target dates from NO observed dates, where

target date T=1 is the first observed date (OBS_DAT1) for a given id
target date T=NT is the last observed date (OBS_DATno) for a given id
target date T=2 through NT-1 is the observed date closest to the pseudo target of equally-spaced dates:
OBS_DAT1 + (OBS_DATno-OBS_DAT1)*(t-1)/(nt-1)

This can be addressed by a double DO loop separated by an internal step.

Loop 1 through all the observed dates, and record them in an array
With results of step 1
1. Calculate initial target dates (1 through NT)
2. Revise the internal target dates 2 (initialized via #3 above) through NT-1 to the closest observed dates
3. Store these revised dates in an array
Loop 2: re-read the records of the given ID, and keep only those with dates matching the revised target dates.

The code below (untested) follows the logic above:

%let max_obs=30;
%let n_targets=4;

data want (drop=o t);
  array obs_dats{&max_obs} _temporary_;
  array targ_dats{&n_targets} _temporary_;

  ** Populate the obs_dats array **;
  do o=1 by 1 until (last.id);
    set have;
    by id;
    obs_dats{o}=date;
  end;

  ** Populate targ_dats array **;
  targ_dats{1}=obs_dats{1};
  targ_dats{&n_targets}=obs_dats{o};
  do t=2 to &n_targets-1;
    do o=1 by 1 until (obs_dats{o}<=targ_dats{t}<=obs_dats{o+1});
    end;
    if targ_dats{t} >= mean(obs_dats{o},obs_dats{o+1}) then targ_dats{t}=obs_dats{o+1};
    else targ_dats{t}=obs_dats{o};
  end;

  ** Now re-read and keep dates matching target dates **;
  do until (last.id);
    set have;
    by id;
    if whichn(date,of targ_dats{*}) then output;
  end;
run;

Set macrovar MAX_OBS to the maximum expected number of dates for any id
Set macrovar N_TARGETS to the number of equally-spaced observations wanted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PGStats · Posted 04-04-2018 12:40 AM

I divided the date range into bins and required that the best date (closest to the center of the bin) be Inside the bin. It should work for any number of bins > 1.

data have;
input ID      DateStr $  Result $;
/* Grrrr! */
date = input(cats(substr(dateStr,4,2),substr(dateStr,1,3),substr(dateStr,7)),date7.);
format date yymmdd10.;
drop dateStr;
datalines;
001   Jan01/10      Pos
001   Mar28/12     Pos
001   Jan30/13      Neg
001   May04/12     Pos
001   Jun20/10      Neg
002   Jan01/10      Neg
002   Mar28/12     Neg
002   Feb28/13      Neg
002   May04/12     Neg
002   Jun20/10      Neg
003   Oct01/14      Neg
003   Mar28/12      Pos
003   Jan30/14      Neg
004   May04/12     Pos
005   Jun29/16      Neg
006   Jan01/10      Pos
006   Mar28/12     Pos
;

%let nPoints=3;

proc means data=have noprint;
by id;
var date;
output out=ranges min= max= range= / autoname;
run;

data bins;
set ranges;
if date_range = 0 then do;
    binStart = date_min;
    binEnd = date_min;
    binMid = date_min;
    output;
    end;
else do;
    width = date_range / (&nPoints-1);
    do binStart = date_min - 0.5*width to date_max - 0.5*width by width;
        binEnd = binStart + width;
        binMid = ( binStart + binEnd ) / 2;
        output;
        end;
    end;
format bin: yymmdd10.;
keep id bin: ;
run;

proc sql;
create table cases as
select unique
    b.id,
    round(b.binMid) as midDate format=yymmdd10.,
    a.date,
    a.result
from have as a right join
    bins as b on a.id=b.id and a.date between b.binStart and b.binEnd
group by b.id, b.binMid
having abs(a.date-b.binMid) = min(abs(a.date-b.binMid));
select * from cases;
quit;

                           ID     midDate        date  Result
                     ------------------------------------------
                            1  2010-01-01  2010-01-01  Pos
                            1  2011-07-18  2012-03-28  Pos
                            1  2013-01-30  2013-01-30  Neg
                            2  2010-01-01  2010-01-01  Neg
                            2  2011-08-01  2012-03-28  Neg
                            2  2013-02-28  2013-02-28  Neg
                            3  2012-03-28  2012-03-28  Pos
                            3  2013-06-30  2014-01-30  Neg
                            3  2014-10-01  2014-10-01  Neg
                            4  2012-05-04  2012-05-04  Pos
                            5  2016-06-29  2016-06-29  Neg
                            6  2010-01-01  2010-01-01  Pos
                            6  2011-02-14           .
                            6  2012-03-28  2012-03-28  Pos

PG

Select observations with evenly distributed dates across a long time interval

Re: Select observations with evenly distributed dates across a long time interval

Re: Select observations with evenly distributed dates across a long time interval

Re: Select observations with evenly distributed dates across a long time interval

Re: Select observations with evenly distributed dates across a long time interval

Re: Select observations with evenly distributed dates across a long time interval

Re: Select observations with evenly distributed dates across a long time interval

Re: Select observations with evenly distributed dates across a long time interval

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away