Hi everyone,
I'm dealing with a longitudinal dataset with about 10-year follow-up.
Each subject has a unique ID, dates of clinic visits, test results. Each subject had multiple clinic visits (different dates).The data looks as:
ID Date Result
001 Jan01/10 Pos
001 Mar28/12 Pos
001 Jan30/13 Neg
001 May04/12 Pos
001 Jun20/10 Neg
002 Jan01/10 Neg
002 Mar28/12 Neg
002 Feb30/13 Neg
002 May04/12 Neg
002 Jun20/10 Neg
003 Oct01/14 Neg
003 Mar28/12 Pos
003 Jan30/14 Neg
004 May04/12 Pos
005 Jun29/16 Neg
.....
I am trying to select observations across their entire follow-up time with evenly distributed length of time. i.e. if a subject has a 5-year follow-up with multiple visits, I'd like to select the most evenly distributed observations across the 5 years. Let's say I need select 3 time points, ideally it'd be the the first date, the last date (first and last dates are required), and the 2.5(half of 5-year) date (or any dates close to the 2.5 point). I'm able to filter the first and last dates by using if first. / if last. statements, but having problems with the time points in between.
Thanks all ahead
Can you be little specific, provide the sample data you have and the result you want
So if a person had 10 visits, is it acceptable to take the 1st, 5th, and 10th, or do we need to measure the interval between the first and last visits to find a midpoint based on the dates?
If the 1st, 5th, and 10th will do, here's a way. It assumes your data is in order (leaving that part to you, since the dates look like they might contain character strings):
proc freq data=have;
tables id / noprint out=visit_counts;
run;
data want;
merge have visit_counts (keep=id count);
by id;
if first.patient then do;
visit_num=1;
output;
end;
else do;
visit_num + 1;
if visit_num = ceil(count/2) then output;
end;
else if last.patient then output;
drop visit_num count;
run;
We need to estimate the time interval. Some subjects had tons of visits within the first year, then less over the follow-up. So I need to consider the interval to find the evenly distributed visits in terms of "time".
For time intervals, you need to fix the data first. These dates are not in a form that SAS can treat as dates. That's a topic you will need to read up on ... the solution below will handle that but won't become clear until you understand how SAS handles dates.
data temp;
length date $ 9;
set have;
date = cats( substr(date,4,2), substr(date,1,3), '20', substr(date,7,2));
visit_sasdate = input(date, date9.);
run;
proc sort data=temp;
by id visit_sasdate;
run;
proc summary data=temp;
by id;
var visit_sasdate;
output out=daterange (keep=id mindate maxdate) min=mindate max=maxdate;
run;
data want;
merge temp daterange;
by id;
if first.id then do;
midpoint_flag=0;
output;
end;
else if visit_sasdate >= (mindate + maxdate) / 2 then do;
if midpoint_flag=0 then output;
midpoint_flag + 1;
end;
else if last.id then output;
drop midpoint_flag mindate maxdate;
run;
It's untested code, so might need a tweak here or there.
Technically, this always selects the "midpoint visit" as the earliest visit from the second half of the date range. It is conceivable that the last visit in the first half of the date range might be closer to the actual midpoint. But it seemed like this would be close enough.
I would be tempted to see how well the median value of the date fits in here:
proc summary data=temp qntldef=2; by id; var visit_sasdate; output out=daterange (keep=id mindate maxdate mediandate) min=mindate max=maxdate median=mediandate; run;
You want the observed dates that most closely match NT target dates from NO observed dates, where
This can be addressed by a double DO loop separated by an internal step.
The code below (untested) follows the logic above:
%let max_obs=30;
%let n_targets=4;
data want (drop=o t);
array obs_dats{&max_obs} _temporary_;
array targ_dats{&n_targets} _temporary_;
** Populate the obs_dats array **;
do o=1 by 1 until (last.id);
set have;
by id;
obs_dats{o}=date;
end;
** Populate targ_dats array **;
targ_dats{1}=obs_dats{1};
targ_dats{&n_targets}=obs_dats{o};
do t=2 to &n_targets-1;
do o=1 by 1 until (obs_dats{o}<=targ_dats{t}<=obs_dats{o+1});
end;
if targ_dats{t} >= mean(obs_dats{o},obs_dats{o+1}) then targ_dats{t}=obs_dats{o+1};
else targ_dats{t}=obs_dats{o};
end;
** Now re-read and keep dates matching target dates **;
do until (last.id);
set have;
by id;
if whichn(date,of targ_dats{*}) then output;
end;
run;
I divided the date range into bins and required that the best date (closest to the center of the bin) be Inside the bin. It should work for any number of bins > 1.
data have;
input ID DateStr $ Result $;
/* Grrrr! */
date = input(cats(substr(dateStr,4,2),substr(dateStr,1,3),substr(dateStr,7)),date7.);
format date yymmdd10.;
drop dateStr;
datalines;
001 Jan01/10 Pos
001 Mar28/12 Pos
001 Jan30/13 Neg
001 May04/12 Pos
001 Jun20/10 Neg
002 Jan01/10 Neg
002 Mar28/12 Neg
002 Feb28/13 Neg
002 May04/12 Neg
002 Jun20/10 Neg
003 Oct01/14 Neg
003 Mar28/12 Pos
003 Jan30/14 Neg
004 May04/12 Pos
005 Jun29/16 Neg
006 Jan01/10 Pos
006 Mar28/12 Pos
;
%let nPoints=3;
proc means data=have noprint;
by id;
var date;
output out=ranges min= max= range= / autoname;
run;
data bins;
set ranges;
if date_range = 0 then do;
binStart = date_min;
binEnd = date_min;
binMid = date_min;
output;
end;
else do;
width = date_range / (&nPoints-1);
do binStart = date_min - 0.5*width to date_max - 0.5*width by width;
binEnd = binStart + width;
binMid = ( binStart + binEnd ) / 2;
output;
end;
end;
format bin: yymmdd10.;
keep id bin: ;
run;
proc sql;
create table cases as
select unique
b.id,
round(b.binMid) as midDate format=yymmdd10.,
a.date,
a.result
from have as a right join
bins as b on a.id=b.id and a.date between b.binStart and b.binEnd
group by b.id, b.binMid
having abs(a.date-b.binMid) = min(abs(a.date-b.binMid));
select * from cases;
quit;
ID midDate date Result ------------------------------------------ 1 2010-01-01 2010-01-01 Pos 1 2011-07-18 2012-03-28 Pos 1 2013-01-30 2013-01-30 Neg 2 2010-01-01 2010-01-01 Neg 2 2011-08-01 2012-03-28 Neg 2 2013-02-28 2013-02-28 Neg 3 2012-03-28 2012-03-28 Pos 3 2013-06-30 2014-01-30 Neg 3 2014-10-01 2014-10-01 Neg 4 2012-05-04 2012-05-04 Pos 5 2016-06-29 2016-06-29 Neg 6 2010-01-01 2010-01-01 Pos 6 2011-02-14 . 6 2012-03-28 2012-03-28 Pos
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.