I want to say first how great a resource this forum has been for me while dealing with some very (extremely!) messy data. Thank you for your help and I have one more issue I would like to see if I can find an answer to.
I have records containing IDs, type, begin date, and various other indicators. They are sorted by id and begin date. For each ID, I would like to collapse the rows that are consecutive days in type ‘a’ into one row, while keeping the most common value of the other indicators (dx1 dx2), moving the last ‘begin date’ of the range to an ‘end date’ column, and keeping other type ‘b’ rows intact. See below:
data have;
input id type begin_date end_date dx1 dx2
datalines;
1 A 28JAN2019 . a b
1 A 29JAN2019 . a b
1 A 30JAN2019 . a a
1 A 31JAN2019 . b b
1 B 2FEB2019 . a b
2 B 29JAN2019 . b c
2 A 1FEB2019 . b d
2 A 2FEB2019 . b d
2 A 3FEB2019 . a b
2 A 4FEB2019 . a d
2 A 5FEB2019 . b b
;
run;
data want;
input source id begin_date end_date dx1 dx2
datalines;
1 A 28JAN2019 31JAN2019 a b
1 B 2FEB2019 . a b
2 B 29JAN2019 . b c
2 A 1FEB2019 5FEB2019 b d
;
run;
I’m not sure where even to begin with this one, so any help that can be provided will be much appreciated!
Hi @twerwath I think I have understood the issue this time around and have made the necessary fix. Please try now-
data have;
input id type $ begin_date :date9. end_date (dx1 dx2) ($);
format begin_date date9.;
datalines;
1 A 26jan2019 . e h
1 A 28JAN2019 . a b
1 A 29JAN2019 . a b
1 A 30JAN2019 . a a
1 A 31JAN2019 . b b
1 A 05feb2019 . q f
1 A 18feb2019 . a b
1 A 19feb2019 . a b
1 A 20feb2019 . a a
1 A 21feb2019 . b b
1 A 24feb2019 . d g
1 A 27feb2019 . e t
1 B 2FEB2019 . a b
2 B 29JAN2019 . b c
2 A 1FEB2019 . b d
2 A 2FEB2019 . b d
2 A 3FEB2019 . a b
2 A 4FEB2019 . a d
2 A 5FEB2019 . b b
2 A 8feb2019 . q j
;
run;
data want;
if _n_=1 then do;
dcl hash h(dataset:'have(obs=0)');
h.definekey('id','type');
h.definedata(all:'y');
h.definedone();
end;
do until(last.type);
set have curobs=k;
by id type notsorted;
call missing(end_date);
if n then do;
if begin_date=n+1 then do;
n=begin_date;
h.find();
end_date=n;
if last.type then do; output;h.clear();n=.;end;
else h.replace();
end;
else do;
h.find();
output;
set have point=k;
n=begin_date;
if last.type then do; output;h.clear();n=.;end;
h.replace();
end;
end;
else do;
n=begin_date;
if last.type then do; output;h.clear();n=.;end;
else h.replace();
end;
end;
h.clear();
format end_date date9.;
drop n;
run;
proc print noobs;run;
id | type | begin_date | end_date | dx1 | dx2 |
1 | A | 26-Jan-19 | . | e | h |
1 | A | 28-Jan-19 | 31-Jan-19 | a | b |
1 | A | 5-Feb-19 | . | q | f |
1 | A | 18-Feb-19 | 21-Feb-19 | a | b |
1 | A | 24-Feb-19 | . | d | g |
1 | A | 27-Feb-19 | . | e | t |
1 | B | 2-Feb-19 | . | a | b |
2 | B | 29-Jan-19 | . | b | c |
2 | A | 1-Feb-19 | 5-Feb-19 | b | d |
2 | A | 8-Feb-19 | . | q | j |
Hi @twerwath
Two questions:
1. Are the possible values of dx1 and dx2 in a finite list? - this would make the solution simpler.
2. When two values of dx1 or dx2 have the same count, which value should be choosen for output?
1) No, unfortunately the are character variables with values that are not in a finite list.
2) If there are two values with the same count, there is no preference - it could be either one or the other.
Hi @twerwath
It takes a number of steps, and something might be done smarter, but I think this works:
data have;
infile datalines truncover;
input id type$ begin_date date9. end_date dx1$ dx2$;
format begin_date date9.;
datalines;
1 A 28JAN2019 . a b
1 A 29JAN2019 . a b
1 A 30JAN2019 . a a
1 A 31JAN2019 . b b
1 B 2FEB2019 . a b
2 B 29JAN2019 . b c
2 A 1FEB2019 . b d
2 A 2FEB2019 . b d
2 A 3FEB2019 . a b
2 A 4FEB2019 . a d
2 A 5FEB2019 . b b
;
run;
* Sort values;
proc sort data=have;
by id type begin_date;
run;
* Mark groups to collapse with an ID;
data w1 (drop=tmp);
set have;
by id type;
retain grpid 0;
tmp = dif(begin_date);
if first.type then grpid = grpid + 1;
if not first.type then do;
if tmp > 1 then grpid = grpid + 1;
end;
run;
* Collapse groups;
data w2 (drop=xdate dx1 dx2);
set w1;
by grpid;
retain xdate;
format xdate end_date date9.;
if first.grpid then xdate = begin_date;
if last.grpid then do;
if xdate ne begin_date then end_date = begin_date;
begin_date = xdate;
output;
end;
run;
* Get value-count of dx1;
proc sql;
create table dx1 as
select distinct grpid, dx1, count(*) as number
from w1
group by grpid, dx1
order by grpid, number desc;
quit;
data dx1s;
set dx1;
by grpid;
if first.grpid;
run;
* Get value-count of dx2;
proc sql;
create table dx2 as
select distinct grpid, dx2, count(*) as number
from w1
group by grpid, dx2
order by grpid, number desc;
quit;
data dx2s;
set dx2;
by grpid;
if first.grpid;
run;
* Merge it all together;
data want (drop=grpid);
merge w2 dx1s(drop=number) dx2s(drop=number);
by grpid;
run;
That's not so easy. Given these data
data have;
input id type $ begin_date: date9. dx1 $ dx2 $;
format begin_date date9.;
datalines;
1 A 28JAN2019 a b
1 A 29JAN2019 a b
1 A 30JAN2019 a a
1 A 31JAN2019 b b
1 B 2FEB2019 a b
2 B 29JAN2019 b c
2 A 1FEB2019 b d
2 A 2FEB2019 b d
2 A 3FEB2019 a b
2 A 4FEB2019 a d
2 A 5FEB2019 b b
;run;
(I dropped the end_date as it is always missing, and made the rest actually work)
You should start by identifying the intervals of consecutive dates:
data intervals;
set have;
by id type notsorted;
interval_no+first.type+(dif(begin_date) ne 1);
run;
The INTERVAL_NO values are not consecutive this way, but who cares? It is the simplest way to do it. We get a new interval at the right places.
Now we have the data with the consecutive dates identified. Then find the number of each occurrence of DX1 and DX2:
proc sql;
create table count_dx1 as
select interval_no,dx1,count(*) as n
from intervals
group by interval_no,dx1
order by 1,3 desc,2;
create table count_dx2 as
select interval_no,dx2,count(*) as n
from intervals
group by interval_no,dx2
order by 1,3 desc,2;
quit;
Now we have the data points with the most common first (and if there are ties, we take the lowest value). These are the values we want:
data values;
merge count_dx1 count_dx2;
by interval_no;
if first.interval_no;
drop n;
run;
And then for the grand final:
proc sql;
create table want as select distinct
intervals.id,
intervals.type,
min(intervals.begin_date) as begin_date format=date9.,
case
when count(*)>1 then max(begin_date)
else .
end as end_date format=date9.,
values.dx1,
values.dx2
from intervals join values
on intervals.interval_no=values.interval_no
group by intervals.interval_no
order by 1,3
;
quit;
Which gives you the data you want.
Hi @s_lassen
Thank you for this. I provided two DX variables for simplicity here, but in reality the dataset has around 80 of these DX indicators. So I can see that this code could technically still work, but it would have to be expanded times 40. I can try to reduce it down to around 10 essential DX variables and test it on the real data to see if it works.
@twerwath wrote:
Hi @s_lassen
Thank you for this. I provided two DX variables for simplicity here, but in reality the dataset has around 80 of these DX indicators. So I can see that this code could technically still work, but it would have to be expanded times 40. I can try to reduce it down to around 10 essential DX variables and test it on the real data to see if it works.
So how many "most common" values of DX do you expect?
What do you want if you have ties in the count for the smallest level. Such suppose you have tests with counts like:
a = 8
b = 6
c =4
d =4
for a given id/type/date range and want the "3 most common". Since c and d have the same count you should supply a tie breaker.
This shows a slightly more complicated data set, i.e. it has breaks in dates for an ID and type, and show a way to get down to a count of tests across X variables for date ranges.
data have; input id type $ date date9. dx1 $ dx2 $ dx3 $ dx4 $; format date date9.; datalines; 1 A 28JAN2019 a b d a 1 A 29JAN2019 a b a d 1 A 30JAN2019 a a e c 1 A 31JAN2019 b b a e 1 B 2FEB2019 a b a c 2 B 29JAN2019 b c a c 2 A 1FEB2019 b d a c 2 A 2FEB2019 b d a c 2 A 3FEB2019 a b c c 2 A 4FEB2019 a d c c 2 A 5FEB2019 b b c c 2 B 29MAR2019 a c a c 2 B 30MAR2019 d d a d ; run; proc sort data=have; by id type date; run; data daterange; set have; by id type date; retain begin_date end_date; format begin_date end_date date9.; datedif = dif(date); if first.type then do; begin_date=date; end_date=date; end; else if datedif=1 then do; /* consecutive*/ end_date=date; end; else if datedif > 1 then do; /*no longer consecutive*/ output; begin_date=date; end_date=date; end; if last.type then do; end_date=date; output; end; keep id type begin_date end_date; run; proc transpose data=have out=trans; by id type date; /* below will use all variables whose names start with DX. If you have other names but your DX variables are consecutively numbered us Dx1 - Dx80 or whatever instead of DX: */ var dx:; run; /* combine daterange info with single test data and get the counts, most common first for each id type date range */ proc sql; create table temp as select a.id, a.type, a.col1, count(*) as dxcount ,b.begin_date, b.end_date from trans as a left join daterange as b on a.id=b.id and a.type=b.type and b.begin_date le a.date le b.end_date group by a.id, a.type,b.begin_date, b.end_date, a.col1 order by a.id, a.type,b.begin_date, b.end_date, dxcount descending ; quit;
The above will have the count of each dx value by descending frequency with each id, type and date range.
It will be pretty easy to select x numbers of values and transpose them back to columns to show the x most common for the date range. But you may want to consider how you are using that information. Quite often the long version lends itself better to many sorts of analysis.
I'm pretty sure there is more about the Type B involvement not stated.
Hi @ballardw
In the vast majority of cases, values for observations within dx variables will be 99% the same across rows of consecutive days. These are basically daily records of patient's hospital stays, and we want each hospital stay to be one row instead of many. So while there will be some information lost, generally that loss will be very minimal and this format will be much more appropriate for our analyses. I doubt that in all there would be any ties, but if so it does not matter which most common value would be chosen since they are likely similar anyways.
Just to clarify, if some of these "other" variables are not named dx, I would need to rename them all to dx1-80 sequeuantially in order for this to run? That would not be difficult to do and then change back the variable names when done but just wanted to make sure this is what you meant before I do.
@twerwath wrote:
Hi @ballardw
In the vast majority of cases, values for observations within dx variables will be 99% the same across rows of consecutive days. These are basically daily records of patient's hospital stays, and we want each hospital stay to be one row instead of many. So while there will be some information lost, generally that loss will be very minimal and this format will be much more appropriate for our analyses. I doubt that in all there would be any ties, but if so it does not matter which most common value would be chosen since they are likely similar anyways.
Just to clarify, if some of these "other" variables are not named dx, I would need to rename them all to dx1-80 sequeuantially in order for this to run? That would not be difficult to do and then change back the variable names when done but just wanted to make sure this is what you meant before I do.
No you don't have to rename variables. There are a couple of ways to make shorthand variable lists and I just pointed out two of them that might be possible with common names.
If all of the variables, or even groups of them, are in sequential columns you can use two dashes between two variable names such as:
variablename -- thename10columnslater
which would use all the variables in those columns. If there aren't common names or sequential order you can list the variables by name on the VAR statement in Proc Transpose. You can mix and match these lists if your names lend itself to that. Suppose the first 8 of your DX variables are actually named DX1 to DX8 and then you several more differently named variables that are in sequential columns, then another block of variables with sequential names that start with DDX (for some esoteric medical reason), then a few isolated variable. You could use a list like:
Var Dx1-Dx8 Blockvar -- EndofSequential DDX: thisvar thatvar anothervar.
Caution: If using DX1-DXn type list then all of 1, 2, 3, ...,n have to be present or it will be an error. If there are gaps in the sequence then make sublists: Dx1-Dx5 Dx7-Dx15 for example when you don't have a Dx6.
Hint for future questions: If you actual variables do not have "nice" names, like DX1 through DX80 then in your example data use names that actually vary a bit so you don't have to try to deal with the shortcuts those of us that get to build our own data build in such as similar variables named in sequences.
@s_lassen wrote:
You should start by identifying the intervals of consecutive dates:
data intervals; set have; by id type notsorted; interval_no+first.type+(dif(begin_date) ne 1); run;
The INTERVAL_NO values are not consecutive this way, but who cares? It is the simplest way to do it. We get a new interval at the right places.
Just change the SUM statement:
interval_no+(first.type or (dif(begin_date) ne 1));
HI @twerwath I kinda wrote this in a hurry and haven't tested extensively as I am afraid I need to run to catch the bus to get home. This should get close, however I would request you to test and let me know if there's any shortcomings, I will try and fix perhaps tomorrow if i get time.
data have;
input id type $ begin_date :date9. end_date (dx1 dx2) ($);
format begin_date date9.;
datalines;
1 A 28JAN2019 . a b
1 A 29JAN2019 . a b
1 A 30JAN2019 . a a
1 A 31JAN2019 . b b
1 B 2FEB2019 . a b
2 B 29JAN2019 . b c
2 A 1FEB2019 . b d
2 A 2FEB2019 . b d
2 A 3FEB2019 . a b
2 A 4FEB2019 . a d
2 A 5FEB2019 . b b
;
run;
data want;
if _n_=1 then do;
dcl hash h(dataset:'have(obs=0)');
h.definekey('id','type');
h.definedata(all:'y');
h.definedone();
end;
do until(last.type);
set have curobs=k;
by id type notsorted;
call missing(end_date);
if n then do;
if begin_date=n+1 then do;
n=begin_date;
h.find();
end_date=n;
if last.type then do; output;h.clear();n=.;end;
else h.replace();
end;
else do;
h.find();
output;
set have point=k;
h.replace();
n=begin_date;
end;
end;
else do;
n=begin_date;
if last.type then do; output;h.clear();n=.;end;
else h.replace();
end;
end;
h.clear();
format end_date date9.;
drop n;
run;
proc print noobs;run;
id | type | begin_date | end_date | dx1 | dx2 |
---|---|---|---|---|---|
1 | A | 28JAN2019 | 31JAN2019 | a | b |
1 | B | 02FEB2019 | . | a | b |
2 | B | 29JAN2019 | . | b | c |
2 | A | 01FEB2019 | 05FEB2019 | b | d |
Thank you. This seems to be on the right track but did not work as intended. Instead, if there were a string of non consecutive days for type "a" records, instead of collapsing them it deletes them but leaves other non-consecutive records intact (as it should).
Hi @twerwath Could you please post a more comprehensive sample+expected results if and when you have a moment plz? I could modify the code accordingly
Oh also, you want to collapse only Type A. Jeez I overlooked that line
Hi @twerwath I think I have understood the issue this time around and have made the necessary fix. Please try now-
data have;
input id type $ begin_date :date9. end_date (dx1 dx2) ($);
format begin_date date9.;
datalines;
1 A 26jan2019 . e h
1 A 28JAN2019 . a b
1 A 29JAN2019 . a b
1 A 30JAN2019 . a a
1 A 31JAN2019 . b b
1 A 05feb2019 . q f
1 A 18feb2019 . a b
1 A 19feb2019 . a b
1 A 20feb2019 . a a
1 A 21feb2019 . b b
1 A 24feb2019 . d g
1 A 27feb2019 . e t
1 B 2FEB2019 . a b
2 B 29JAN2019 . b c
2 A 1FEB2019 . b d
2 A 2FEB2019 . b d
2 A 3FEB2019 . a b
2 A 4FEB2019 . a d
2 A 5FEB2019 . b b
2 A 8feb2019 . q j
;
run;
data want;
if _n_=1 then do;
dcl hash h(dataset:'have(obs=0)');
h.definekey('id','type');
h.definedata(all:'y');
h.definedone();
end;
do until(last.type);
set have curobs=k;
by id type notsorted;
call missing(end_date);
if n then do;
if begin_date=n+1 then do;
n=begin_date;
h.find();
end_date=n;
if last.type then do; output;h.clear();n=.;end;
else h.replace();
end;
else do;
h.find();
output;
set have point=k;
n=begin_date;
if last.type then do; output;h.clear();n=.;end;
h.replace();
end;
end;
else do;
n=begin_date;
if last.type then do; output;h.clear();n=.;end;
else h.replace();
end;
end;
h.clear();
format end_date date9.;
drop n;
run;
proc print noobs;run;
id | type | begin_date | end_date | dx1 | dx2 |
1 | A | 26-Jan-19 | . | e | h |
1 | A | 28-Jan-19 | 31-Jan-19 | a | b |
1 | A | 5-Feb-19 | . | q | f |
1 | A | 18-Feb-19 | 21-Feb-19 | a | b |
1 | A | 24-Feb-19 | . | d | g |
1 | A | 27-Feb-19 | . | e | t |
1 | B | 2-Feb-19 | . | a | b |
2 | B | 29-Jan-19 | . | b | c |
2 | A | 1-Feb-19 | 5-Feb-19 | b | d |
2 | A | 8-Feb-19 | . | q | j |
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: