Hello,
I have a listing of patients and their dates of use of drugs.
For example,
for patient n°1 Drug A I have 2 lines, while in fact this patient never stopped his drug consumption, so instead of having
first line start=22/10/2015 - end=08/11/2015 and
second line start=09/11/2015 - end=16/12/2015,
I would like in my final table to have only 1 line: start=22/10/2015 - end=16/12/2015.
However, for patient n°1 Drug B, the treatment is NOT continuous (as he was without the drug on 05/11/2015, so for Drug B I need to keep both 2 lines in my final table. Please help, I tried with retain, with LAG... but without success... Thank you.
my initial table:
patientid | name | date_start | date_end |
1 | Drug A | 22.10.2015 | 08.11.2015 |
1 | Drug A | 09.11.2015 | 16.12.2015 |
1 | Drug B | 22.10.2015 | 04.11.2015 |
1 | Drug B | 06.11.2015 | . |
and my final table would be:
patientid | name | date_start | date_end |
1 | Drug A | 22.10.2015 | 16.12.2015 |
1 | Drug B | 22.10.2015 | 04.11.2015 |
1 | Drug B | 06.11.2015 | . |
Your initial comment about treatment of missing date values (apparently you never have missing values for DATE_START, only for DATE_END, is that right?) referred to introducing some sort of max date. I initially took this to mean some sort of study-wide maximum date, but I don't see any "max date" rule being applied here.
I do see instances of tied, consecutive, START_DATE values, in which the first record has a missing value for DATE_END and the last has a valid value for DATE_END. So, if you are saying that the only instances in which a missing DATE_END is to be replaced by a valid value is when there are consecutive records with the same DATE_START, then you have a fairly straightforward problem.
Namely when you encounter a missing DATE_END value in the case of consecutive identical DATE_START records, you can assign DATE_END=_NEXT_START-1, which just supports continuing the loop to the next record, where the real DATE_END becomes available. Here's my suggestion with a single new statement:
data have;
infile cards expandtabs truncover;
input patientid name & $10. (date_start date_end) (:ddmmyy10.);
format date_start date_end ddmmyy10.;
datalines;
1 drug A 23.06.2015 .
1 drug A 23.06.2015 25.09.2015
1 drug A 25.09.2015 .
1 drug A 25.09.2015 12.11.2015
1 drug A 13.11.2015 22.12.2015
2 drug B 12.11.2015 25.11.2015
2 drug B 26.11.2015 .
2 drug B 26.11.2015 22.06.2016
2 drug B 12.11.2016 25.11.2016
3 drug C 12.11.2015 25.11.2015
3 drug C 26.11.2015 .
3 drug C 12.11.2016 25.11.2016
4 drug D 12.11.2015 25.11.2015
4 drug D 26.11.2015 .
4 drug D 26.11.2015 22.06.2016
4 drug D 12.11.2016 25.11.2016
run;
data want (drop=_:);
do until ((date_end+1 < _next_start) or last.name=1);
set have;
by patientid name date_start;
merge have have (firstobs=2 keep=date_start rename=(date_start=_next_start));
_min_start=min(_min_start,date_start);
if date_end=. and last.date_start=0 then date_end=_next_start-1;
end;
date_start=_min_start;
run;
The "if date_end=. and last.date_start=0" condition is a test for missing date_end, but the record can not be the last one for a given date_start. And the BY statement has been changed to
by patientid name date_start.
HI @Aljosa Keeping it simple in 2 steps
data have;
input patientid name & $10. (date_start date_end) (:ddmmyy10.);
format date_start date_end ddmmyy10.;
cards;
1 Drug A 22.10.2015 08.11.2015
1 Drug A 09.11.2015 16.12.2015
1 Drug B 22.10.2015 04.11.2015
1 Drug B 06.11.2015 .
;
data temp;
set have;
by patientid name;
retain _d;
if first.name then _d=date_end;
else if date_start-_d>1 then _d=date_start;
format _d ddmmyy10.;
run;
proc sql;
create table want as
select patientid,name,min(date_start) as date_start format=ddmmyy10.,min(date_end) as date_end format=ddmmyy10.
from temp
group by patientid,name,_d;
quit;
Sure, can offer a complex solution but I don't see why.
Dear @novinosrin ,
I doesn't work good. If I have :
patientid | name | date_start | date_end |
1 | Drug A | 22.10.2015 | 08.11.2015 |
1 | Drug A | 09.11.2015 | . |
Instead of having this :
patientid | name | date_start | date_end |
1 | Drug A | 22.10.2015 | . |
...I have this:
patientid | name | date_start | date_end |
1 | Drug A | 22.10.2015 | 08.11.2015 |
Is there something to change in program so it can work? Thank you.
Kind regards.
Aljosa
Good morning @Aljosa , Please try this change and let me know. Thank you!
data have;
input patientid name & $10. (date_start date_end) (:ddmmyy10.);
format date_start date_end ddmmyy10.;
cards;
1 Drug A 22.10.2015 08.11.2015
1 Drug A 09.11.2015 16.12.2015
1 Drug B 22.10.2015 04.11.2015
1 Drug B 06.11.2015 .
2 Drug A 22.10.2015 08.11.2015
2 Drug A 09.11.2015 .
;
/*Step 1*/
data temp;
set have;
by patientid name;
retain _d;
if first.name then _d=date_end;
else if date_start-_d>1 then _d=date_start;
format _d ddmmyy10.;
run;
/*Step 2*/
data want;
set temp;
by patientid name _d;
if last._d;
drop _d;
run;
Hi @Aljosa Please ignore the previous as I overlooked the date_start , Please try the one below. Sorry about that
data have;
input patientid name & $10. (date_start date_end) (:ddmmyy10.);
format date_start date_end ddmmyy10.;
cards;
1 Drug A 22.10.2015 08.11.2015
1 Drug A 09.11.2015 16.12.2015
1 Drug B 22.10.2015 04.11.2015
1 Drug B 06.11.2015 .
2 Drug A 22.10.2015 08.11.2015
2 Drug A 09.11.2015 .
;
/*Step 1*/
data temp;
set have;
by patientid name;
retain _d _s;
if first.name then do; _d=date_end;_s=date_start;end;
else if date_start-_d>1 then do; _d=date_start;_s=date_start;end;
format _d _s ddmmyy10.;
run;
/*Step 2*/
data want;
set temp;
by patientid name _d;
if last._d;
date_start=_s;
drop _:;
run;
If you do not have a big table .
data have;
infile cards expandtabs truncover;
input patientid name & $10. (date_start date_end) (:ddmmyy10.);
format date_start date_end ddmmyy10.;
cards;
1 Drug A 22.10.2015 08.11.2015
1 Drug A 09.11.2015 16.12.2015
1 Drug B 22.10.2015 04.11.2015
1 Drug B 06.11.2015 .
;
data temp;
set have;
do date=date_start to coalesce(date_end,date_start);
output;
end;
format date ddmmyyp10.;
drop date_: ;
run;
proc sort data=temp nodupkey;
by patientid name date;
run;
data temp;
set temp;
by patientid name;
if first.name or dif(date) ne 1 then group+1;
run;
proc summary data=temp;
by patientid name group ;
var date;
output out=want min=start max=end;
run;
data want;
set want;
if start=end then end=.;
format start end ddmmyyp10.;
run;
proc print;run;
Hi @Aljosa
I once wrote a macro to do this.
If the input records have identical values except the start- and end-values, they are compressed to one record in the output data set if they have consecutive or overlapping intervals. The macro is designed to work fast on big data sets. I have posted it as an attachment because of the size.
data have;
infile cards expandtabs truncover;
input patientid name & $10. (date_start date_end) (:ddmmyy10.);
format date_start date_end ddmmyy10.;
cards;
1 Drug A 22.10.2015 08.11.2015
1 Drug A 09.11.2015 16.12.2015
1 Drug B 22.10.2015 04.11.2015
1 Drug B 06.11.2015 .
;
%compressintervals(have,want,datefirst=date_start,datelast=date_end);
Result:
Assuming the data are sorted, there is a fairly straightforward single-step solution:
Edit note: data assumed to be sorted by PATIENTID NAME DATE_START.
data have;
infile cards expandtabs truncover;
input patientid name & $10. (date_start date_end) (:ddmmyy10.);
format date_start date_end ddmmyy10.;
cards;
1 Drug A 22.10.2015 08.11.2015
1 Drug A 09.11.2015 16.12.2015
1 Drug B 22.10.2015 04.11.2015
1 Drug B 06.11.2015 .
run;
data want (drop=_:);
do until ((date_end+1 ^= _next_start) or last.name=1);
set have;
by patientid name;
_min_start=min(_min_start,date_start);
if eod2=0 then set have (firstobs=2 keep=date_start rename=(date_start=_next_start)) end=eod2;
end;
date_start=_min_start;
run;
The two SET statements are in effect two parallel streams of data, with the second SET starting at observation 2. I.e. the second SET is always one record ahead of the first SET. Since the second SET reads only the DATA_START variable (renamed to _NEXT_START_), you have a convenient way to know when to stop reading succession of records that cover contiguous date ranges.
This kind of task yields well to what I call the key-indexed paintbrush approach: An array with a cell for any possible date value is "painted" with 1's for every date within each start-end interval. After that, the indexes of the beginning and end of each "painted" strip with contiguous 1's indicate the sought-for start and end dates, respectively. For example, in this case:
data have ;
input patientid name & :$10. (date_start date_end) (:ddmmyy10.) ;
format date_start date_end yymmdd10. ;
cards ;
1 Drug A 22.10.2015 08.11.2015
1 Drug A 09.11.2015 16.12.2015
1 Drug B 22.10.2015 04.11.2015
1 Drug B 06.11.2015 .
;
run ;
data want (drop = _:) ;
array k [-138061:380217] _temporary_ ; *thru year 3000 @ 4 Mb of RAM;
do until (last.name) ;
set have ;
by patientid name ;
do _dt = date_start to max (date_start, date_end) ; *max handles missing date_end;
k[_dt] = n (date_end) ;
end ;
_a = min (_a, date_start) ;
_z = max (_z, date_start, date_end) ;
end ;
do _dt = _a to _z ;
if nmiss (k[_dt]) then continue ;
if nmiss (k[_dt - 1]) then date_start = _dt ;
if nmiss (k[_dt + 1]) then do ;
date_end = ifn (k[_dt], _dt, .) ; *output missing if k[_dt]=0;
output ;
end ;
k[_dt] = . ; *clean-up for next iteration of DoW-loop;
end ;
run ;
The lower bound of the array at -138061 is not arbitrary: It's set to the earliest data value SAS can handle. The upper bound is chosen just as "big enough".
The method is somewhat uneconomical; but it has its pluses:
Kind regards
Paul D.
Guru @hashman Very nice. I clearly understood. And to help my DePaul college/present Citizens works mates, a few fans of yours(besides me) , what i did was shorten(date range, however inclusive of OP's sample) the array dims to
array k [20382:20442] k20382-k20442 ;
demonstrate what's happening to makes sense of the loop. They were happy and thankful to you. I'm conveying on their behalf.
Okay now,
*thru year 3000 @ 4 Mb of RAM;
One can't let you escape from this thread without a word or two on how you calculate or determine that. Could I/we get your invaluable time. If this one might deviate the original thread, Can you please plug the points in
I will take notes accordingly. Thank you for letting me bother in advance!
PS
In Philly or in FL right now?
@novinosrin: Thanks for inspecting its guts.
While testing, I was too lazy to recode the array with PDV variables in the min-to-max range, so instead just included PUT K[_dt]= in the DO loop following the DoW.
A pretty accurate estimate of RAM taken by a numeric temp array is 8*dim(K) bytes. So, for about 500,000 items or so, it's about 4 MB. But in this case, I just looked at the log with FULLSTIMER on. For a character temp array, the relationship is much more complex, but in short, the RAM per one byte allocated to an array sharply decreases as the item length grows. Had to do some research into that for this year's SGF bitmap paper in Dallas since it was critical to the issue:
https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3101-2019.pdf
Look at page 12 for the findings.
Kind regards
Paul D.
p.s. Back in FL.
Paul @hashman :
I agree that the key-indexed paintbrush technique is logically obvious, but I'm a bit demoralized to think that the splicing logic - at least in this application - is not acceptably obvious. But before I offer a tweak to improve obviousness, I would note that adjusting the splicing logic to accommodate overlapping ranges is almost as easy as the paint-brush technique. It is just a matter of a changing the UNTIL condition from
((date_end+1 ^= _next_start) or last.name=1)
to
((date_end+1 < _next_start) or last.name=1)
And I might argue that paintbrush technique using key-indexing would not adapt well to continuous time measurement (admittedly not the issue in this topic). It would have to be replaced by some use of a hash object, in which a lot of ordered hash looping would be required to determine overlap.
But in the splicing logic, it would be just a change in the UNTIL condition to:
((date_end < _next_start) or last.name=1)
I suspect, for some viewers of this topic, it is the usage of "IF EOD2=0 then SET …"
if eod2=0 then set have (firstobs=2 keep=…..) end=eod2;
that is a bit off-putting. I agree it is certainly not usage that we are shown in any "learning SAS for beginners" training that I am aware of. But it should be there for "intermediate" learners.
Maybe this is a case where using MERGE would be more user-friendly, as in:
data want2 (drop=_:);
do until ((date_end+1 < _next_start) or last.name=1);
set have;
by patientid name;
merge have have (firstobs=2 keep=date_start rename=(date_start=_next_start));
_min_start=min(_min_start,date_start);
end;
date_start=_min_start;
run;
regards,
Mark
Mark @mkeintz; I second all your sentiments.
Of course, for datetime values (whose universe is 86400 times larger) the paintbrush is ill-suited, unless the min-to-max values cover a rather short interval of time, such as a few days. Using a character bitmap instead of the array can shorten the offensive 86400 factor to 86400/64=1350, but it would introduce a level of complexity I wouldn't even call "intermediate". And surely it's completely off if you have to deal with milliseconds and such when the interval arithmetic/logic wins hands down both in terms of space and speed.
A much better application for paintbrushing than splicing is continuous enrollment (ubiquitous in the managed health care business where it is a prerequisite for the calculation of any HEDIS measure). This is because after the brushing is done, it's a simple matter to determine the number of enrollment gaps and their lengths to see if a measure's requirements are satisfied for every enrolled member. Plus, while calculating CE, we deal only with whole days covering at most a few years.
Kind regards
Paul D.
Hi all,
Thank you for your answers, I am looking to all your programs, I will come back to you as soon as I finish.
I think major issues for most of programs are when we have dots in dates (=missing dates), they must be taken into consideration, as max dates.
Kind regards.
In your example, the missing date is for the DATE_END value in the final record for an ID. Is that always the case for missing dates?
And what is the value you intend to use for such missing dates?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.