Solved: Collapsing rows of consecutive days while keeping last date and most c...

twerwath · Posted 05-27-2021 02:20 PM

I want to say first how great a resource this forum has been for me while dealing with some very (extremely!) messy data. Thank you for your help and I have one more issue I would like to see if I can find an answer to.

I have records containing IDs, type, begin date, and various other indicators. They are sorted by id and begin date. For each ID, I would like to collapse the rows that are consecutive days in type ‘a’ into one row, while keeping the most common value of the other indicators (dx1 dx2), moving the last ‘begin date’ of the range to an ‘end date’ column, and keeping other type ‘b’ rows intact. See below:

data have;

input id type begin_date end_date dx1 dx2

datalines;

1 A 28JAN2019 . a b

1 A 29JAN2019 . a b

1 A 30JAN2019 . a a

1 A 31JAN2019 . b b

1 B 2FEB2019 . a b

2 B 29JAN2019 . b c

2 A 1FEB2019 . b d

2 A 2FEB2019 . b d

2 A 3FEB2019 . a b

2 A 4FEB2019 . a d

2 A 5FEB2019 . b b

;

run;

data want;

input source id begin_date end_date dx1 dx2

datalines;

1 A 28JAN2019 31JAN2019 a b

1 B 2FEB2019 . a b

2 B 29JAN2019 . b c

2 A 1FEB2019 5FEB2019 b d

;

run;

I’m not sure where even to begin with this one, so any help that can be provided will be much appreciated!

novinosrin · Posted 05-28-2021 10:02 AM

Hi @twerwath I think I have understood the issue this time around and have made the necessary fix. Please try now-

data have;
input id type $ begin_date :date9. end_date (dx1 dx2) ($);
format begin_date date9.;
datalines;
1 A 26jan2019 . e h
1 A 28JAN2019 . a b
1 A 29JAN2019 . a b
1 A 30JAN2019 . a a
1 A 31JAN2019 . b b
1 A 05feb2019 . q f
1 A 18feb2019 . a b
1 A 19feb2019 . a b
1 A 20feb2019 . a a
1 A 21feb2019 . b b
1 A 24feb2019 . d g
1 A 27feb2019 . e t
1 B 2FEB2019 . a b
2 B 29JAN2019 . b c
2 A 1FEB2019 . b d
2 A 2FEB2019 . b d
2 A 3FEB2019 . a b
2 A 4FEB2019 . a d
2 A 5FEB2019 . b b
2 A 8feb2019 . q j
;
run;


data want;
 if _n_=1 then do;
  dcl hash h(dataset:'have(obs=0)');
  h.definekey('id','type');
  h.definedata(all:'y');
  h.definedone();
 end;
 do until(last.type);
  set have curobs=k;
  by id type notsorted;
  call missing(end_date);
  if n then do;
   if begin_date=n+1 then do;
     n=begin_date;
     h.find();
	 end_date=n;
	 if last.type then do; output;h.clear();n=.;end;
     else h.replace();
   end;
   else do;
     h.find();
	 output;
	 set have point=k;
     n=begin_date;
	 if last.type then do; output;h.clear();n=.;end;
	 h.replace();
   end;
  end;
  else do;
   n=begin_date;
   if last.type then do; output;h.clear();n=.;end;
   else h.replace();
  end;
 end;
 h.clear();
 format end_date date9.;
 drop n;
run;

proc print noobs;run;

id	type	begin_date	end_date	dx1	dx2
1	A	26-Jan-19	.	e	h
1	A	28-Jan-19	31-Jan-19	a	b
1	A	5-Feb-19	.	q	f
1	A	18-Feb-19	21-Feb-19	a	b
1	A	24-Feb-19	.	d	g
1	A	27-Feb-19	.	e	t
1	B	2-Feb-19	.	a	b
2	B	29-Jan-19	.	b	c
2	A	1-Feb-19	5-Feb-19	b	d
2	A	8-Feb-19	.	q	j

View solution in original post

ErikLund_Jensen · Posted 05-27-2021 02:51 PM

Hi @twerwath

Two questions:

1. Are the possible values of dx1 and dx2 in a finite list? - this would make the solution simpler.

2. When two values of dx1 or dx2 have the same count, which value should be choosen for output?

twerwath · Posted 05-27-2021 03:00 PM

Hi @ErikLund_Jensen

1) No, unfortunately the are character variables with values that are not in a finite list.

2) If there are two values with the same count, there is no preference - it could be either one or the other.

ErikLund_Jensen · Posted 05-27-2021 03:42 PM

Hi @twerwath

It takes a number of steps, and something might be done smarter, but I think this works:

data have;
	infile datalines truncover;
	input id type$ begin_date date9. end_date dx1$ dx2$;
	format begin_date date9.;
	datalines;
1 A 28JAN2019 . a b
1 A 29JAN2019 . a b
1 A 30JAN2019 . a a
1 A 31JAN2019 . b b
1 B 2FEB2019 . a b
2 B 29JAN2019 . b c
2 A 1FEB2019 . b d
2 A 2FEB2019 . b d
2 A 3FEB2019 . a b
2 A 4FEB2019 . a d
2 A 5FEB2019 . b b
;
run;

* Sort values;
proc sort data=have;
	by id type begin_date;
run;

* Mark groups to collapse with an ID;
data w1 (drop=tmp); 
	set have;
	by id type;
	retain grpid 0;
	tmp = dif(begin_date);
	if first.type then grpid = grpid + 1;
	if not first.type then do;
		if tmp > 1 then grpid = grpid + 1;
	end;
run;

* Collapse groups;
data w2 (drop=xdate dx1 dx2); 
	set w1;
	by grpid;
	retain xdate;
	format xdate end_date date9.;
	if first.grpid then xdate = begin_date;
	if last.grpid then do;
		if xdate ne begin_date then end_date = begin_date;
		begin_date = xdate;
		output;
	end;
run;

* Get value-count of dx1;
proc sql;
	create table dx1 as 
		select distinct grpid, dx1, count(*) as number
		from w1
		group by grpid, dx1
		order by grpid, number desc;
quit;

data dx1s;
	set dx1;
	by grpid;
	if first.grpid;
run;

* Get value-count of dx2;
proc sql;
	create table dx2 as 
		select distinct grpid, dx2, count(*) as number
		from w1
		group by grpid, dx2
		order by grpid, number desc;
quit;

data dx2s;
	set dx2;
	by grpid;
	if first.grpid;
run;

* Merge it all together;
data want (drop=grpid);
	merge w2 dx1s(drop=number) dx2s(drop=number);
	by grpid;
run;

s_lassen · Posted 05-27-2021 03:17 PM

That's not so easy. Given these data

data have;
input id type $ begin_date: date9. dx1 $ dx2 $;
format begin_date date9.;
datalines;
1 A 28JAN2019 a b
1 A 29JAN2019 a b
1 A 30JAN2019 a a
1 A 31JAN2019 b b
1 B 2FEB2019  a b
2 B 29JAN2019 b c
2 A 1FEB2019  b d
2 A 2FEB2019  b d
2 A 3FEB2019  a b
2 A 4FEB2019 a d
2 A 5FEB2019 b b
;run;

(I dropped the end_date as it is always missing, and made the rest actually work)

You should start by identifying the intervals of consecutive dates:

data intervals;
  set have;
  by id type notsorted;
  interval_no+first.type+(dif(begin_date) ne 1);
run;

The INTERVAL_NO values are not consecutive this way, but who cares? It is the simplest way to do it. We get a new interval at the right places.

Now we have the data with the consecutive dates identified. Then find the number of each occurrence of DX1 and DX2:

proc sql;
  create table count_dx1 as 
  select interval_no,dx1,count(*) as n
  from intervals
  group by interval_no,dx1
  order by 1,3 desc,2;
  
  create table count_dx2 as 
  select interval_no,dx2,count(*) as n
  from intervals
  group by interval_no,dx2
  order by 1,3 desc,2;
quit;

Now we have the data points with the most common first (and if there are ties, we take the lowest value). These are the values we want:

data values;
  merge count_dx1 count_dx2;
  by interval_no;
  if first.interval_no;
  drop n;
run;

And then for the grand final:

proc sql;
  create table want as select distinct 
    intervals.id,
    intervals.type,
    min(intervals.begin_date) as begin_date format=date9.,
    case
      when count(*)>1 then max(begin_date)
      else .
    end as end_date format=date9.,
    values.dx1,
    values.dx2
  from intervals join values
  on intervals.interval_no=values.interval_no
  group by intervals.interval_no
  order by 1,3
  ;
quit;

Which gives you the data you want.

twerwath · Posted 05-27-2021 03:30 PM

Hi @s_lassen

Thank you for this. I provided two DX variables for simplicity here, but in reality the dataset has around 80 of these DX indicators. So I can see that this code could technically still work, but it would have to be expanded times 40. I can try to reduce it down to around 10 essential DX variables and test it on the real data to see if it works.

ErikLund_Jensen · Posted 05-27-2021 03:48 PM

Hi @twerwath

@s_lassen Seems we have been working on the same ideas.

It wouldn't be difficult to build a macro loop around the counting part, but could something be done with a proc freq output data set?

ballardw · Posted 05-27-2021 05:22 PM

@twerwath wrote:

Hi @s_lassen

Thank you for this. I provided two DX variables for simplicity here, but in reality the dataset has around 80 of these DX indicators. So I can see that this code could technically still work, but it would have to be expanded times 40. I can try to reduce it down to around 10 essential DX variables and test it on the real data to see if it works.

So how many "most common" values of DX do you expect?

What do you want if you have ties in the count for the smallest level. Such suppose you have tests with counts like:

a = 8

b = 6

c =4

d =4

for a given id/type/date range and want the "3 most common". Since c and d have the same count you should supply a tie breaker.

This shows a slightly more complicated data set, i.e. it has breaks in dates for an ID and type, and show a way to get down to a count of tests across X variables for date ranges.

data have;
   input id type $ date date9.  dx1 $ dx2 $ dx3 $ dx4 $;
   format date date9.;
datalines;
1 A 28JAN2019  a b d a
1 A 29JAN2019  a b a d
1 A 30JAN2019  a a e c
1 A 31JAN2019  b b a e
1 B 2FEB2019  a b  a c
2 B 29JAN2019  b c a c
2 A 1FEB2019  b d  a c
2 A 2FEB2019  b d  a c
2 A 3FEB2019  a b  c c
2 A 4FEB2019  a d  c c
2 A 5FEB2019  b b  c c
2 B 29MAR2019  a c a c
2 B 30MAR2019  d d a d

;
run;

proc sort data=have;
   by id type date;
run;
data daterange;
   set have;
   by id type date;
   retain begin_date end_date;
   format begin_date end_date date9.;
   datedif = dif(date);
   if first.type then do;
      begin_date=date;
      end_date=date;
   end;
   else if datedif=1 then do;
       /* consecutive*/
       end_date=date;
   end;
   else if datedif > 1 then do;
       /*no longer consecutive*/
       output;
       begin_date=date;
       end_date=date;
   end;
   if last.type then do;
      end_date=date;
      output;
   end;
   keep id type begin_date end_date;
run;


proc transpose data=have out=trans;
   by id type date;
   /* below will use all variables whose names
      start with DX. If you have other names
      but your DX variables are consecutively numbered
      us Dx1 - Dx80 or whatever instead of DX: 
   */
   var dx:;
run;

/* combine daterange info with single test data
  and get the counts, most common first for each
  id type date range
*/
proc sql;
   create table temp as
   select a.id, a.type, a.col1, count(*) as dxcount 
          ,b.begin_date, b.end_date
   from trans as a left join daterange as b
   on  a.id=b.id 
    and a.type=b.type
    and b.begin_date le a.date le b.end_date
   group by  a.id, a.type,b.begin_date, b.end_date, a.col1
   order by   a.id, a.type,b.begin_date, b.end_date, dxcount descending
   ;
quit;

The above will have the count of each dx value by descending frequency with each id, type and date range.

It will be pretty easy to select x numbers of values and transpose them back to columns to show the x most common for the date range. But you may want to consider how you are using that information. Quite often the long version lends itself better to many sorts of analysis.

I'm pretty sure there is more about the Type B involvement not stated.

twerwath · Posted 05-28-2021 08:32 AM

Hi @ballardw

In the vast majority of cases, values for observations within dx variables will be 99% the same across rows of consecutive days. These are basically daily records of patient's hospital stays, and we want each hospital stay to be one row instead of many. So while there will be some information lost, generally that loss will be very minimal and this format will be much more appropriate for our analyses. I doubt that in all there would be any ties, but if so it does not matter which most common value would be chosen since they are likely similar anyways.

Just to clarify, if some of these "other" variables are not named dx, I would need to rename them all to dx1-80 sequeuantially in order for this to run? That would not be difficult to do and then change back the variable names when done but just wanted to make sure this is what you meant before I do.

ballardw · Posted 05-28-2021 10:36 AM

@twerwath wrote:

Hi @ballardw

In the vast majority of cases, values for observations within dx variables will be 99% the same across rows of consecutive days. These are basically daily records of patient's hospital stays, and we want each hospital stay to be one row instead of many. So while there will be some information lost, generally that loss will be very minimal and this format will be much more appropriate for our analyses. I doubt that in all there would be any ties, but if so it does not matter which most common value would be chosen since they are likely similar anyways.

Just to clarify, if some of these "other" variables are not named dx, I would need to rename them all to dx1-80 sequeuantially in order for this to run? That would not be difficult to do and then change back the variable names when done but just wanted to make sure this is what you meant before I do.

No you don't have to rename variables. There are a couple of ways to make shorthand variable lists and I just pointed out two of them that might be possible with common names.

If all of the variables, or even groups of them, are in sequential columns you can use two dashes between two variable names such as:

variablename -- thename10columnslater

which would use all the variables in those columns. If there aren't common names or sequential order you can list the variables by name on the VAR statement in Proc Transpose. You can mix and match these lists if your names lend itself to that. Suppose the first 8 of your DX variables are actually named DX1 to DX8 and then you several more differently named variables that are in sequential columns, then another block of variables with sequential names that start with DDX (for some esoteric medical reason), then a few isolated variable. You could use a list like:

Var Dx1-Dx8 Blockvar -- EndofSequential DDX: thisvar thatvar anothervar.

Caution: If using DX1-DXn type list then all of 1, 2, 3, ...,n have to be present or it will be an error. If there are gaps in the sequence then make sublists: Dx1-Dx5 Dx7-Dx15 for example when you don't have a Dx6.

Hint for future questions: If you actual variables do not have "nice" names, like DX1 through DX80 then in your example data use names that actually vary a bit so you don't have to try to deal with the shortcuts those of us that get to build our own data build in such as similar variables named in sequences.

Tom · Posted 05-28-2021 10:54 AM

@s_lassen wrote:

You should start by identifying the intervals of consecutive dates:
data intervals;
  set have;
  by id type notsorted;
  interval_no+first.type+(dif(begin_date) ne 1);
run;
The INTERVAL_NO values are not consecutive this way, but who cares? It is the simplest way to do it. We get a new interval at the right places.

Just change the SUM statement:

  interval_no+(first.type or (dif(begin_date) ne 1));

novinosrin · Posted 05-27-2021 05:11 PM

HI @twerwath I kinda wrote this in a hurry and haven't tested extensively as I am afraid I need to run to catch the bus to get home. This should get close, however I would request you to test and let me know if there's any shortcomings, I will try and fix perhaps tomorrow if i get time.


data have;
input id type $ begin_date :date9. end_date (dx1 dx2) ($);
format begin_date date9.;
datalines;
1 A 28JAN2019 . a b
1 A 29JAN2019 . a b
1 A 30JAN2019 . a a
1 A 31JAN2019 . b b
1 B 2FEB2019 . a b
2 B 29JAN2019 . b c
2 A 1FEB2019 . b d
2 A 2FEB2019 . b d
2 A 3FEB2019 . a b
2 A 4FEB2019 . a d
2 A 5FEB2019 . b b
;
run;


data want;
 if _n_=1 then do;
  dcl hash h(dataset:'have(obs=0)');
  h.definekey('id','type');
  h.definedata(all:'y');
  h.definedone();
 end;
 do until(last.type);
  set have curobs=k;
  by id type notsorted;
  call missing(end_date);
  if n then do;
   if begin_date=n+1 then do;
     n=begin_date;
     h.find();
	 end_date=n;
	 if last.type then do; output;h.clear();n=.;end;
     else h.replace();
   end;
   else do;
     h.find();
	 output;
	 set have point=k;
     h.replace();
	 n=begin_date;
   end;
  end;
  else do;
   n=begin_date;
   if last.type then do; output;h.clear();n=.;end;
   else h.replace();
  end;
 end;
 h.clear();
 format end_date date9.;
 drop n;
run;

proc print noobs;run;

id	type	begin_date	end_date	dx1	dx2
1	A	28JAN2019	31JAN2019	a	b
1	B	02FEB2019	.	a	b
2	B	29JAN2019	.	b	c
2	A	01FEB2019	05FEB2019	b	d

twerwath · Posted 05-28-2021 09:15 AM

Thank you. This seems to be on the right track but did not work as intended. Instead, if there were a string of non consecutive days for type "a" records, instead of collapsing them it deletes them but leaves other non-consecutive records intact (as it should).

novinosrin · Posted 05-28-2021 09:26 AM

Hi @twerwath Could you please post a more comprehensive sample+expected results if and when you have a moment plz? I could modify the code accordingly

Oh also, you want to collapse only Type A. Jeez I overlooked that line

novinosrin · Posted 05-28-2021 10:02 AM

Hi @twerwath I think I have understood the issue this time around and have made the necessary fix. Please try now-

data have;
input id type $ begin_date :date9. end_date (dx1 dx2) ($);
format begin_date date9.;
datalines;
1 A 26jan2019 . e h
1 A 28JAN2019 . a b
1 A 29JAN2019 . a b
1 A 30JAN2019 . a a
1 A 31JAN2019 . b b
1 A 05feb2019 . q f
1 A 18feb2019 . a b
1 A 19feb2019 . a b
1 A 20feb2019 . a a
1 A 21feb2019 . b b
1 A 24feb2019 . d g
1 A 27feb2019 . e t
1 B 2FEB2019 . a b
2 B 29JAN2019 . b c
2 A 1FEB2019 . b d
2 A 2FEB2019 . b d
2 A 3FEB2019 . a b
2 A 4FEB2019 . a d
2 A 5FEB2019 . b b
2 A 8feb2019 . q j
;
run;


data want;
 if _n_=1 then do;
  dcl hash h(dataset:'have(obs=0)');
  h.definekey('id','type');
  h.definedata(all:'y');
  h.definedone();
 end;
 do until(last.type);
  set have curobs=k;
  by id type notsorted;
  call missing(end_date);
  if n then do;
   if begin_date=n+1 then do;
     n=begin_date;
     h.find();
	 end_date=n;
	 if last.type then do; output;h.clear();n=.;end;
     else h.replace();
   end;
   else do;
     h.find();
	 output;
	 set have point=k;
     n=begin_date;
	 if last.type then do; output;h.clear();n=.;end;
	 h.replace();
   end;
  end;
  else do;
   n=begin_date;
   if last.type then do; output;h.clear();n=.;end;
   else h.replace();
  end;
 end;
 h.clear();
 format end_date date9.;
 drop n;
run;

proc print noobs;run;

id	type	begin_date	end_date	dx1	dx2
1	A	26-Jan-19	.	e	h
1	A	28-Jan-19	31-Jan-19	a	b
1	A	5-Feb-19	.	q	f
1	A	18-Feb-19	21-Feb-19	a	b
1	A	24-Feb-19	.	d	g
1	A	27-Feb-19	.	e	t
1	B	2-Feb-19	.	a	b
2	B	29-Jan-19	.	b	c
2	A	1-Feb-19	5-Feb-19	b	d
2	A	8-Feb-19	.	q	j

Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Re: Collapsing rows of consecutive days while keeping last date and most common values

Classroom Training Available!