Solved: continuous date

Aljosa · Posted 10-24-2019 09:33 AM

Hello,

I have a listing of patients and their dates of use of drugs.

For example,

for patient n°1 Drug A I have 2 lines, while in fact this patient never stopped his drug consumption, so instead of having

first line start=22/10/2015 - end=08/11/2015 and

second line start=09/11/2015 - end=16/12/2015,

I would like in my final table to have only 1 line: start=22/10/2015 - end=16/12/2015.

However, for patient n°1 Drug B, the treatment is NOT continuous (as he was without the drug on 05/11/2015, so for Drug B I need to keep both 2 lines in my final table. Please help, I tried with retain, with LAG... but without success... Thank you.

my initial table:

patientid	name	date_start	date_end
1	Drug A	22.10.2015	08.11.2015
1	Drug A	09.11.2015	16.12.2015
1	Drug B	22.10.2015	04.11.2015
1	Drug B	06.11.2015	.

and my final table would be:

patientid	name	date_start	date_end
1	Drug A	22.10.2015	16.12.2015
1	Drug B	22.10.2015	04.11.2015
1	Drug B	06.11.2015	.

mkeintz · Posted 10-28-2019 11:56 AM

Your initial comment about treatment of missing date values (apparently you never have missing values for DATE_START, only for DATE_END, is that right?) referred to introducing some sort of max date. I initially took this to mean some sort of study-wide maximum date, but I don't see any "max date" rule being applied here.

I do see instances of tied, consecutive, START_DATE values, in which the first record has a missing value for DATE_END and the last has a valid value for DATE_END. So, if you are saying that the only instances in which a missing DATE_END is to be replaced by a valid value is when there are consecutive records with the same DATE_START, then you have a fairly straightforward problem.

Namely when you encounter a missing DATE_END value in the case of consecutive identical DATE_START records, you can assign DATE_END=_NEXT_START-1, which just supports continuing the loop to the next record, where the real DATE_END becomes available. Here's my suggestion with a single new statement:

data have;
  infile cards expandtabs truncover;
  input patientid	name & $10.	(date_start	date_end) (:ddmmyy10.);
  format date_start	date_end ddmmyy10.;
datalines;
1	drug A	23.06.2015	.
1	drug A	23.06.2015	25.09.2015
1	drug A	25.09.2015	.
1	drug A	25.09.2015	12.11.2015
1	drug A	13.11.2015	22.12.2015
2	drug B	12.11.2015	25.11.2015
2	drug B	26.11.2015	.
2	drug B	26.11.2015	22.06.2016
2	drug B	12.11.2016	25.11.2016
3	drug C	12.11.2015	25.11.2015
3	drug C	26.11.2015	.
3	drug C	12.11.2016	25.11.2016
4	drug D	12.11.2015	25.11.2015
4	drug D	26.11.2015	.
4	drug D	26.11.2015	22.06.2016
4	drug D	12.11.2016	25.11.2016
run;

data want (drop=_:);
  do until ((date_end+1 < _next_start) or last.name=1);
    set have;
    by patientid name  date_start;
    merge have have (firstobs=2 keep=date_start rename=(date_start=_next_start));
    _min_start=min(_min_start,date_start);
    if date_end=. and last.date_start=0 then date_end=_next_start-1;
  end;
  date_start=_min_start;
run;

The "if date_end=. and last.date_start=0" condition is a test for missing date_end, but the record can not be the last one for a given date_start. And the BY statement has been changed to
by patientid name date_start.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

novinosrin · Posted 10-24-2019 10:37 AM

HI @Aljosa Keeping it simple in 2 steps


data have;
input patientid	name & $10.	(date_start	date_end) (:ddmmyy10.);
format date_start	date_end ddmmyy10.;
cards;
1	Drug A	22.10.2015	08.11.2015
1	Drug A	09.11.2015	16.12.2015
1	Drug B	22.10.2015	04.11.2015
1	Drug B	06.11.2015	.
;

data temp;
 set have;
 by patientid name;
 retain _d;
 if first.name then _d=date_end;
 else if date_start-_d>1 then _d=date_start;
 format _d ddmmyy10.;
run;

proc sql;
create table want as
select patientid,name,min(date_start) as date_start format=ddmmyy10.,min(date_end) as date_end format=ddmmyy10.
from temp
group by patientid,name,_d;
quit;

Sure, can offer a complex solution but I don't see why.

Aljosa · Posted 10-25-2019 03:07 AM

Dear @novinosrin ,

I doesn't work good. If I have :

patientid	name	date_start	date_end
1	Drug A	22.10.2015	08.11.2015
1	Drug A	09.11.2015	.

Instead of having this :

patientid	name	date_start	date_end
1	Drug A	22.10.2015	.

...I have this:

patientid	name	date_start	date_end
1	Drug A	22.10.2015	08.11.2015

Is there something to change in program so it can work? Thank you.

Kind regards.

Aljosa

novinosrin · Posted 10-25-2019 07:49 AM

Good morning @Aljosa , Please try this change and let me know. Thank you!


data have;
input patientid	name & $10.	(date_start	date_end) (:ddmmyy10.);
format date_start	date_end ddmmyy10.;
cards;
1	Drug A	22.10.2015	08.11.2015
1	Drug A	09.11.2015	16.12.2015
1	Drug B	22.10.2015	04.11.2015
1	Drug B	06.11.2015	.
2	Drug A	22.10.2015	08.11.2015
2	Drug A	09.11.2015	.
;
/*Step 1*/
data temp;
 set have;
 by patientid name;
 retain _d;
 if first.name then _d=date_end;
 else if date_start-_d>1 then _d=date_start;
 format _d ddmmyy10.;
run;
/*Step 2*/
data want;
 set temp;
 by patientid name _d;
 if last._d;
 drop _d;
run;

novinosrin · Posted 10-25-2019 08:08 AM

Hi @Aljosa Please ignore the previous as I overlooked the date_start , Please try the one below. Sorry about that


data have;
input patientid	name & $10.	(date_start	date_end) (:ddmmyy10.);
format date_start	date_end ddmmyy10.;
cards;
1	Drug A	22.10.2015	08.11.2015
1	Drug A	09.11.2015	16.12.2015
1	Drug B	22.10.2015	04.11.2015
1	Drug B	06.11.2015	.
2	Drug A	22.10.2015	08.11.2015
2	Drug A	09.11.2015	.
;
/*Step 1*/
data temp;
 set have;
 by patientid name;
 retain _d _s;
 if first.name then do; _d=date_end;_s=date_start;end;
 else if date_start-_d>1 then do; _d=date_start;_s=date_start;end;
 format _d _s ddmmyy10.;
run;
/*Step 2*/
data want;
 set temp;
 by patientid name _d;
 if last._d;
 date_start=_s;
 drop _:;
run;

Ksharp · Posted 10-25-2019 09:05 AM

If you do not have a big table .


data have;
infile cards expandtabs truncover;
input patientid	name & $10.	(date_start	date_end) (:ddmmyy10.);
format date_start	date_end ddmmyy10.;
cards;
1	Drug A	 22.10.2015	08.11.2015
1	Drug A	 09.11.2015	16.12.2015
1	Drug B	 22.10.2015	04.11.2015
1	Drug B	 06.11.2015	.
;

data temp;
 set have;
 do date=date_start to coalesce(date_end,date_start);
  output;
 end;
 format date ddmmyyp10.;
 drop date_: ;
run;
proc sort data=temp nodupkey;
by patientid	name date;
run;
data temp;
 set temp;
 by patientid name;
 if first.name or dif(date) ne 1 then group+1;
run;
proc summary data=temp;
by patientid name group ;
var date;
output out=want min=start max=end;
run;
data want;
 set want;
 if start=end then end=.;
 format start end ddmmyyp10.;
run;

proc print;run;

ErikLund_Jensen · Posted 10-25-2019 10:13 AM

Hi @Aljosa

I once wrote a macro to do this.

If the input records have identical values except the start- and end-values, they are compressed to one record in the output data set if they have consecutive or overlapping intervals. The macro is designed to work fast on big data sets. I have posted it as an attachment because of the size.

data have;
infile cards expandtabs truncover;
input patientid	name & $10.	(date_start	date_end) (:ddmmyy10.);
format date_start	date_end ddmmyy10.;
cards;
1	Drug A	 22.10.2015	08.11.2015
1	Drug A	 09.11.2015	16.12.2015
1	Drug B	 22.10.2015	04.11.2015
1	Drug B	 06.11.2015	.
;

%compressintervals(have,want,datefirst=date_start,datelast=date_end);

Result:

mkeintz · Posted 10-25-2019 09:40 PM

Assuming the data are sorted, there is a fairly straightforward single-step solution:

Edit note: data assumed to be sorted by PATIENTID NAME DATE_START.

data have;
  infile cards expandtabs truncover;
  input patientid	name & $10.	(date_start	date_end) (:ddmmyy10.);
  format date_start	date_end ddmmyy10.;
cards;
1	Drug A	 22.10.2015	08.11.2015
1	Drug A	 09.11.2015	16.12.2015
1	Drug B	 22.10.2015	04.11.2015
1	Drug B	 06.11.2015	.
run;
data want (drop=_:);
  do until ((date_end+1 ^= _next_start) or last.name=1);
    set have;
    by patientid name;
    _min_start=min(_min_start,date_start);
    if eod2=0 then set have (firstobs=2 keep=date_start rename=(date_start=_next_start)) end=eod2;
  end;
  date_start=_min_start;
run;

The two SET statements are in effect two parallel streams of data, with the second SET starting at observation 2. I.e. the second SET is always one record ahead of the first SET. Since the second SET reads only the DATA_START variable (renamed to _NEXT_START_), you have a convenient way to know when to stop reading succession of records that cover contiguous date ranges.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hashman · Posted 10-26-2019 08:38 PM

@Aljosa:

This kind of task yields well to what I call the key-indexed paintbrush approach: An array with a cell for any possible date value is "painted" with 1's for every date within each start-end interval. After that, the indexes of the beginning and end of each "painted" strip with contiguous 1's indicate the sought-for start and end dates, respectively. For example, in this case:

data have ;                                                                                                                             
  input patientid name & :$10. (date_start date_end) (:ddmmyy10.) ;                                                                     
  format date_start date_end yymmdd10. ;                                                                                                
  cards ;                                                                                                                               
1  Drug A  22.10.2015  08.11.2015                                                                                                       
1  Drug A  09.11.2015  16.12.2015                                                                                                       
1  Drug B  22.10.2015  04.11.2015                                                                                                       
1  Drug B  06.11.2015  .                                                                                                                
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  array k [-138061:380217] _temporary_ ; *thru year 3000 @ 4 Mb of RAM;                                                                 
  do until (last.name) ;                                                                                                                
    set have ;                                                                                                                          
    by patientid name ;                                                                                                                 
    do _dt = date_start to max (date_start, date_end) ; *max handles missing date_end;                                                  
      k[_dt] = n (date_end) ;                                                                                                           
    end ;                                                                                                                               
    _a = min (_a, date_start) ;                                                                                                         
    _z = max (_z, date_start, date_end) ;                                                                                               
  end ;                                                                                                                                 
  do _dt = _a to _z ;                                                                                                                   
    if nmiss (k[_dt]) then continue ;                                                                                                   
    if nmiss (k[_dt - 1]) then date_start = _dt ;                                                                                       
    if nmiss (k[_dt + 1]) then do ;                                                                                                     
      date_end = ifn (k[_dt], _dt, .) ; *output missing if k[_dt]=0;                                                                    
      output ;                                                                                                                          
    end ;                                                                                                                               
    k[_dt] = . ; *clean-up for next iteration of DoW-loop;                                                                              
  end ;                                                                                                                                 
run ;

The lower bound of the array at -138061 is not arbitrary: It's set to the earliest data value SAS can handle. The upper bound is chosen just as "big enough".

The method is somewhat uneconomical; but it has its pluses:

It's logically obvious and simpler than working with lags and splicing logic (at least to me)
There's no need to write any logic in case of overlaps - they are auto-handled due to the nature of the algorithm
It doesn't matter if the start_date is sorted within [patientid,name] or not

Kind regards

Paul D.

novinosrin · Posted 10-27-2019 10:56 AM

Guru @hashman Very nice. I clearly understood. And to help my DePaul college/present Citizens works mates, a few fans of yours(besides me) , what i did was shorten(date range, however inclusive of OP's sample) the array dims to

array k [20382:20442] k20382-k20442 ;

demonstrate what's happening to makes sense of the loop. They were happy and thankful to you. I'm conveying on their behalf.

Okay now,

*thru year 3000 @ 4 Mb of RAM;

One can't let you escape from this thread without a word or two on how you calculate or determine that. Could I/we get your invaluable time. If this one might deviate the original thread, Can you please plug the points in

https://communities.sas.com/t5/SAS-Programming/what-s-the-limit-to-how-many-elements-variables-a-SAS...

I will take notes accordingly. Thank you for letting me bother in advance!

PS

In Philly or in FL right now?

hashman · Posted 10-27-2019 02:14 PM

@novinosrin: Thanks for inspecting its guts.

While testing, I was too lazy to recode the array with PDV variables in the min-to-max range, so instead just included PUT K[_dt]= in the DO loop following the DoW.

A pretty accurate estimate of RAM taken by a numeric temp array is 8*dim(K) bytes. So, for about 500,000 items or so, it's about 4 MB. But in this case, I just looked at the log with FULLSTIMER on. For a character temp array, the relationship is much more complex, but in short, the RAM per one byte allocated to an array sharply decreases as the item length grows. Had to do some research into that for this year's SGF bitmap paper in Dallas since it was critical to the issue:

https://www.sas.com/content/dam/SAS/support/en/sas-global-forum-proceedings/2019/3101-2019.pdf

Look at page 12 for the findings.

Kind regards

Paul D.

p.s. Back in FL.

mkeintz · Posted 10-27-2019 02:22 PM

Paul @hashman :

I agree that the key-indexed paintbrush technique is logically obvious, but I'm a bit demoralized to think that the splicing logic - at least in this application - is not acceptably obvious. But before I offer a tweak to improve obviousness, I would note that adjusting the splicing logic to accommodate overlapping ranges is almost as easy as the paint-brush technique. It is just a matter of a changing the UNTIL condition from

((date_end+1 ^= _next_start) or last.name=1)

to

((date_end+1 < _next_start) or last.name=1)

And I might argue that paintbrush technique using key-indexing would not adapt well to continuous time measurement (admittedly not the issue in this topic). It would have to be replaced by some use of a hash object, in which a lot of ordered hash looping would be required to determine overlap.

But in the splicing logic, it would be just a change in the UNTIL condition to:

((date_end < _next_start) or last.name=1)

I suspect, for some viewers of this topic, it is the usage of "IF EOD2=0 then SET …"

if eod2=0 then set have (firstobs=2 keep=…..) end=eod2;

that is a bit off-putting. I agree it is certainly not usage that we are shown in any "learning SAS for beginners" training that I am aware of. But it should be there for "intermediate" learners.

Maybe this is a case where using MERGE would be more user-friendly, as in:

data want2 (drop=_:);
  do until ((date_end+1 < _next_start) or last.name=1);
    set have;
    by patientid name;
    merge have have (firstobs=2 keep=date_start rename=(date_start=_next_start));
    _min_start=min(_min_start,date_start);
  end;
  date_start=_min_start;
run;

regards,

Mark

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hashman · Posted 10-27-2019 02:49 PM

Mark @mkeintz; I second all your sentiments.

Of course, for datetime values (whose universe is 86400 times larger) the paintbrush is ill-suited, unless the min-to-max values cover a rather short interval of time, such as a few days. Using a character bitmap instead of the array can shorten the offensive 86400 factor to 86400/64=1350, but it would introduce a level of complexity I wouldn't even call "intermediate". And surely it's completely off if you have to deal with milliseconds and such when the interval arithmetic/logic wins hands down both in terms of space and speed.

A much better application for paintbrushing than splicing is continuous enrollment (ubiquitous in the managed health care business where it is a prerequisite for the calculation of any HEDIS measure). This is because after the brushing is done, it's a simple matter to determine the number of enrollment gaps and their lengths to see if a measure's requirements are satisfied for every enrolled member. Plus, while calculating CE, we deal only with whole days covering at most a few years.

Kind regards

Paul D.

Aljosa · Posted 10-28-2019 05:32 AM

Hi all,

Thank you for your answers, I am looking to all your programs, I will come back to you as soon as I finish.

I think major issues for most of programs are when we have dots in dates (=missing dates), they must be taken into consideration, as max dates.

Kind regards.

mkeintz · Posted 10-28-2019 08:13 AM

In your example, the missing date is for the DATE_END value in the final record for an ID. Is that always the case for missing dates?

And what is the value you intend to use for such missing dates?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Re: continuous date

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away