Hi,
I'm working with health care utilization data for 2011 to 2015. I need to identify all patients with records for 5 or more ER visits during any 12 month period of the study.
The table I'm using has 1 observation per visit (repeated measures data) so a patient could have multiple observations. I generated this table by selecting only those observations where SERVICE_TYPE = 'ER' and kept the following variables:
PATID - patient ID
APPT_DATE - date of medical event (SAS date)
SERVICE_TYPE - type of medical care (all values = 'ER'; variable retained to check)
PCMID - unique medical provider ID
The table is sorted by PATSSN and APPT_DATE.
Sample data:
PATID APPT_DATE SERVICE_TYPE PCMID
12345 11/19/2013 ER A1B2
12345 03/25/2014 ER S8U7
12345 05/15/2014 ER H5J8
12345 08/14/2014 ER A1B2
12345 11/06/2014 ER D4D5
23456 04/15/2011 ER H5J8
34567 02/11/2012 ER K9L6
34567 04/15/2012 ER B5D4
34567 08/26/2012 ER X4P8
34567 01/14/2013 ER J8D6
34567 02/27/2013 ER G7H5
From this data, PATID = 12345 is the only patient that should be identified as meeting the criteria of having 5 or more ER visits within any 12 month period.
I've been trying to modify PROC SQL syntax from a colleague - - but since her criteria was any 2 visits within a rolling 12 month period, I can't figure out how to make it work.
Any suggestions for the best way to approach this problem?
Thanks!
Assuming your dataset is not big. Pick up the obs whose COUNT is greater than 5 .
data have;
input PATID APPT_DATE : mmddyy10. SERVICE_TYPE $ PCMID $;
format APPT_DATE mmddyy10.;
cards;
12345 11/19/2013 ER A1B2
12345 03/25/2014 ER S8U7
12345 05/15/2014 ER H5J8
12345 08/14/2014 ER A1B2
12345 11/06/2014 ER D4D5
23456 04/15/2011 ER H5J8
34567 02/11/2012 ER K9L6
34567 04/15/2012 ER B5D4
34567 08/26/2012 ER X4P8
34567 01/14/2013 ER J8D6
34567 02/27/2013 ER G7H5
;
run;
proc sql;
create table want as
select *,(select count(*) from have
where PATID=a.PATID and APPT_DATE between intnx('year',a.APPT_DATE,-1,'s') and a.APPT_DATE)
as count
from have as a;
quit;
Hi Ksharp,
I think addition of one more line to your fantastic code will give a better summarized picture:
proc sql;
create table want as
select *,(select count(*) from have
where PATID=a.PATID and APPT_DATE between intnx('year',a.APPT_DATE,-1,'s') and a.APPT_DATE)
as count
from have as a
where calculated count ge 5;
quit;
Regards,
Ksharp,
Thank you for the reply . . . and I apologize for the very long delay in responding.
I'm working with a really large dataset so running your suggested syntax does take a long time.
I do have a question.
I ran the following syntax on the dataset after I de-duplicated by PATID and APPT_DATE. Some patients had multiple ER visits on the same day, but I wanted only those patients with 5+ ER visits on unique days.
proc sql;
create table er_ge5 as
select *,(select count(*) from er_patients
where PATID=a.PATID and APPT_DATE between intnx('year',a.APPT_DATE,-1,'s') and a.APPT_DATE)
as count
from er_patients as a
where calculated count ge 5;
quit;
I then sorted the new table (er_ge5) by PATID and appt_date and confirmed that count = 5 for the 1st event for each patient.
I've provided output for 1 patient. Everything looks fine until the row with count '16' - there are multiple rows with identical counts, but these are records for the same patient.
PATID | APPT_DATE | COUNT |
4444444444 | 12/20/2012 | 5 |
4444444444 | 12/26/2012 | 6 |
4444444444 | 12/27/2012 | 7 |
4444444444 | 1/6/2013 | 8 |
4444444444 | 1/8/2013 | 9 |
4444444444 | 1/12/2013 | 10 |
4444444444 | 1/20/2013 | 11 |
4444444444 | 2/10/2013 | 12 |
4444444444 | 2/23/2013 | 13 |
4444444444 | 3/11/2013 | 14 |
4444444444 | 5/18/2013 | 15 |
4444444444 | 5/30/2013 | 16 |
4444444444 | 8/21/2013 | 16 |
4444444444 | 11/6/2013 | 16 |
4444444444 | 11/17/2013 | 17 |
4444444444 | 12/23/2013 | 15 |
4444444444 | 12/28/2013 | 14 |
4444444444 | 12/30/2013 | 15 |
4444444444 | 1/15/2014 | 13 |
4444444444 | 1/16/2014 | 14 |
4444444444 | 1/17/2014 | 15 |
4444444444 | 1/21/2014 | 15 |
4444444444 | 2/5/2014 | 16 |
4444444444 | 2/7/2014 | 17 |
4444444444 | 2/8/2014 | 18 |
4444444444 | 3/2/2014 | 17 |
4444444444 | 3/5/2014 | 18 |
4444444444 | 3/9/2014 | 19 |
4444444444 | 3/10/2014 | 20 |
4444444444 | 3/23/2014 | 20 |
4444444444 | 4/1/2014 | 21 |
4444444444 | 4/3/2014 | 22 |
4444444444 | 4/16/2014 | 23 |
4444444444 | 5/6/2014 | 24 |
4444444444 | 5/17/2014 | 25 |
4444444444 | 5/22/2014 | 25 |
4444444444 | 6/5/2014 | 25 |
4444444444 | 6/7/2014 | 26 |
I'm not sure why the count variable seems to be having problems. Should I be concerned?
Thanks!
"
Everything looks fine until the row with count '16' - there are multiple rows with identical counts, but these are records for the same patient.
"
I don't know what you mean.
Where is wrong ? Better post some data to demonstrate it .
Do you want unique COUNT for each ID ? That is easy by PROC SORT + NODUPKEY.
If your table is big ,try ARRAY or HashTable:
data have;
input PATID APPT_DATE : mmddyy10. SERVICE_TYPE $ PCMID $;
format APPT_DATE mmddyy10.;
cards;
12345 11/19/2013 ER A1B2
12345 03/25/2014 ER S8U7
12345 05/15/2014 ER H5J8
12345 08/14/2014 ER A1B2
12345 11/06/2014 ER D4D5
23456 04/15/2011 ER H5J8
34567 02/11/2012 ER K9L6
34567 04/15/2012 ER B5D4
34567 08/26/2012 ER X4P8
34567 01/14/2013 ER J8D6
34567 02/27/2013 ER G7H5
;
run;
data want;
array x{99999} _temporary_;
call missing(of x{*});
do i=1 by 1 until(last.PATID);
set have;
by PATID;
x{i}=APPT_DATE;
end;
do until(last.PATID);
set have;
by PATID;
n=0;
do i=intnx('year',APPT_DATE,-1,'s') to APPT_DATE;
if i in x then n+1;
end;
if n ge 5 then output;
end;
drop i;
run;
Fairly efficient way to do this with an array:
data have;
input PATID APPT_DATE : mmddyy10. SERVICE_TYPE $ PCMID $;
format APPT_DATE mmddyy10.;
datalines;
12345 11/19/2013 ER A1B2
12345 03/25/2014 ER S8U7
12345 05/15/2014 ER H5J8
12345 08/14/2014 ER A1B2
12345 11/06/2014 ER D4D5
23456 04/15/2011 ER H5J8
34567 02/11/2012 ER K9L6
34567 04/15/2012 ER B5D4
34567 08/26/2012 ER X4P8
34567 01/14/2013 ER J8D6
34567 02/27/2013 ER G7H5
;
data want;
array d{0:4} _temporary_;
call missing(of d{*});
do i = 1 by 1 until(last.patid);
set have; by patid;
d{mod(i,5)} = appt_date;
if i >=5 then
/* Note d{mod(i+1,5)} is the oldest date in array d */
if intck("YEAR", d{mod(i+1,5)}, appt_date, "CONTINUOUS") < 1 then output;
end;
drop i;
run;
/* Optional : remove multiple entries for same patient */
proc sort data=want nodupkey; by patid; run;
PG Stats,
Thank you so much for the suggested syntax.
This syntax runs much faster on my dataset then K Sharp's PROC SQL syntax, but there seems to be a problem.
I deduplicated my data set by PATID and APPT_DATE (so that each ER visit has a unique APPT_DATE). I then ran your suggested syntax (although I dropped the line to delete 'i'):
data er_patients_count;
array d{0:4} _temporary_;
call missing(of d{*});
do i = 1 by 1 until(last.PATID);
set er_patients; by PATID;
d{mod(i,5)} = APPT_DATE;
if i >=5 then
/* Note d{mod(i+1,5)} is the oldest date in array d */
if intck("YEAR", d{mod(i+1,5)}, APPT_DATE, "CONTINUOUS") < 1 then output;
end;
run;
The biggest concern is that the variable 'i' is not equal to 5 for each patient's 1st appt_date.
I sorted the new table by PATID and APPT_DATE, and created a new variable FIRST_ALL = FIRST.PATID. The syntax and frequency output for my table (N=2,086) follow.
proc freq data = er_patients_count;
where first_all = 1;
tables i;
run;
SAS Output
i | Frequency | Percent | Cumulative | Cumulative |
Frequency | Percent | |||
5 | 1152 | 55.23 | 1152 | 55.23 |
6 | 324 | 15.53 | 1476 | 70.76 |
7 | 205 | 9.83 | 1681 | 80.58 |
8 | 150 | 7.19 | 1831 | 87.78 |
9 | 126 | 6.04 | 1957 | 93.82 |
10 | 65 | 3.12 | 2022 | 96.93 |
11 | 31 | 1.49 | 2053 | 98.42 |
12 | 15 | 0.72 | 2068 | 99.14 |
13 | 10 | 0.48 | 2078 | 99.62 |
14 | 5 | 0.24 | 2083 | 99.86 |
15 | 3 | 0.14 | 2086 | 100 |
Here's sample output for 4 different patients:
i | PATID | APPT_DATE |
8 | 4444444444 | 5/10/2013 |
9 | 4444444444 | 5/26/2013 |
10 | 4444444444 | 6/27/2013 |
11 | 4444444444 | 11/6/2013 |
12 | 4444444444 | 1/21/2014 |
13 | 4444444444 | 2/23/2014 |
15 | 4444444444 | 9/16/2014 |
16 | 4444444444 | 10/10/2014 |
17 | 4444444444 | 11/25/2014 |
18 | 4444444444 | 12/13/2014 |
19 | 4444444444 | 12/21/2014 |
20 | 4444444444 | 12/27/2014 |
5 | 5555555555 | 7/1/2014 |
6 | 5555555555 | 9/4/2014 |
10 | 6666666666 | 11/12/2014 |
9 | 7777777777 | 4/7/2014 |
10 | 7777777777 | 7/1/2014 |
11 | 7777777777 | 7/6/2014 |
12 | 7777777777 | 8/10/2014 |
13 | 7777777777 | 9/6/2014 |
14 | 7777777777 | 9/15/2014 |
15 | 7777777777 | 10/9/2014 |
16 | 7777777777 | 10/30/2014 |
17 | 7777777777 | 11/7/2014 |
18 | 7777777777 | 11/24/2014 |
19 | 7777777777 | 4/17/2015 |
20 | 7777777777 | 7/31/2015 |
Patient 4444444444 - 1st event i = 8, no i= 14
Patient 5555555555 - looks perfect
Patient 6666666666 - only 1 event, i = 10
Patient 7777777777 - 1st event i = 9, but everything else looks perfect
I'm not sure why there's a problem with counting the events.
Any suggestions?
Thanks so much!
The output from the data step is the last visit of a sequence of 5 visits that occured within one year. Every set, even if they overlap, is output that way. If a patient has 7 visits within a year, you will get visits 5 (representing the sequence 1-5), 6 (2-6) and 7 (3-7).
Please describe what you need as output,
PG Stats,
I'm trying to achieve 2 objectives:
1. Identify all patients with 5+ ER visits during any rolling 12-month period between 1 OCT 2013 and 31 OCT 2015.
To accomplish this objective, I need an output variable to indicate that a patient met this basic criteria and should therefore be included in the sample. I'm not concerned with the number of ER visits or anything else.
2. Once I've identified the patients in my sample, I will extract all of their ER records for a larger period of time (1 OCT 2010 to 1 JUN 2016). Some patients have a number of ER visits across multiple years that meet the criteria of 5+ ER visits in any 12-month rolling period when the anchor date can slide from visit 1 to visit 2 to visit 3 and so on.
For example:
Visit 1 (appt_date = 15 JAN 2012)
Visit 2 (appt_date = 1 MAR 2012)
Visit 3 (appt_date = 1 JUN 2012)
Visit 4 (appt_date = 1 AUG 2012)
Visit 5 (appt_date = 1 OCT 2012)
Visit 6 (appt_date = 1 JAN 2013)
Visit 7 (appt_date = 1 APR 2013)
Visit 8 (appt_date = 1 JUL 2013)
Visit 9 (appt_date = 1 SEP 2013)
Visit 10 (appt_date = 1 DEC 2013)
Visit 11 (appt_date = 1 MAR 2014)
Visit 12 (appt_date = 1 MAY 2014)
For this patient, I need to identify 2 dates:
1. The date of the 1st ER visit so I can create a new variable "START_DATE" = 15 JAN 2012
2. The date of the final ER visit so I can create a new variable "END_DATE" = 1 MAY 2014
These 2 new variables will be used to define the patient's period of high use of ER services.
To accomplish this objective, I need an output variable that would flag the 1st ER visit and the final ER visit for the series of qualifying ER visits. Or output the 1st and the final ER records to a new table.
So, given that your "ER records for a larger period of time" are in dataset largerPeriodAppt, you would be looking for something like this:
data fiveER;
array d{0:4} _temporary_;
call missing(of d{*});
do i = 1 by 1 until(last.patid);
set have; by patid;
d{mod(i,5)} = appt_date;
if i >=5 then
/* Note d{mod(i+1,5)} is the oldest date in array d */
first_of5_Appt_date = d{mod(i+1,5)};
if intck("YEAR", first_of5_Appt_date, appt_date, "CONTINUOUS") < 1 then output;
end;
format first_of5_Appt_date yymmdd10.;
drop i;
run;
proc sql;
create table fiveERcriteriaPatients as
select
PATID,
min(APPT_DATE) as START_DATE format=yymmdd10.,
max(APPT_DATE) as END_DATE format=yymmdd10.
from largerPeriodAppt
where PATID in (select PATID from fiveER)
group by PATID;
quit;
(untested)
PG Stats,
I ran your syntax as follows:
data fiveER;
retain patid appt_date first_of5_Appt_date i;
array d{0:4} _temporary_;
call missing(of d{*});
do i = 1 by 1 until(last.patid);
set ER_patients;
by patid;
d{mod(i,5)} = appt_date;
if i >=5 then
/* Note d{mod(i+1,5)} is the oldest date in array d */
first_of5_Appt_date = d{mod(i+1,5)};
if intck("YEAR", first_of5_Appt_date, appt_date, "CONTINUOUS") < 1 then output;
end;
format first_of5_Appt_date yymmdd10.;
run;
It looks like the syntax worked perfectly for patient 1. For each subsequent patient, it seems like the syntax failed to recognize last.PATID, so the value of APPT_DATE from the prior patient was retained.
Here's the output for patients 1, 2, 3, and 4:
PATID | APPT_DATE | FIRST_OF5_APPT_DATE | i |
1111111111 | 11/6/2013 | . | 1 |
1111111111 | 11/17/2013 | . | 2 |
1111111111 | 12/23/2013 | . | 3 |
1111111111 | 12/28/2013 | . | 4 |
1111111111 | 12/30/2013 | 11/6/2013 | 5 |
1111111111 | 1/15/2014 | 11/17/2013 | 6 |
1111111111 | 1/16/2014 | 12/23/2013 | 7 |
1111111111 | 2/5/2014 | 12/28/2013 | 8 |
1111111111 | 2/8/2014 | 12/30/2013 | 9 |
1111111111 | 3/2/2014 | 1/15/2014 | 10 |
1111111111 | 3/10/2014 | 1/16/2014 | 11 |
1111111111 | 4/16/2014 | 2/5/2014 | 12 |
1111111111 | 5/6/2014 | 2/8/2014 | 13 |
1111111111 | 12/9/2013 | 2/8/2014 | 1 |
1111111111 | 12/15/2013 | 2/8/2014 | 2 |
1111111111 | 12/26/2013 | 2/8/2014 | 3 |
1111111111 | 5/23/2014 | 2/8/2014 | 4 |
2222222222 | 11/6/2013 | 2/8/2014 | 1 |
2222222222 | 1/21/2014 | 2/8/2014 | 2 |
2222222222 | 2/23/2014 | 2/8/2014 | 3 |
2222222222 | 8/11/2014 | 2/8/2014 | 4 |
2222222222 | 9/16/2014 | 11/6/2013 | 5 |
2222222222 | 10/10/2014 | 1/21/2014 | 6 |
2222222222 | 11/25/2014 | 2/23/2014 | 7 |
2222222222 | 12/13/2014 | 8/11/2014 | 8 |
2222222222 | 12/27/2014 | 9/16/2014 | 9 |
3333333333 | 12/27/2013 | 9/16/2014 | 1 |
3333333333 | 8/28/2014 | 9/16/2014 | 2 |
3333333333 | 9/22/2014 | 9/16/2014 | 3 |
3333333333 | 11/16/2014 | 9/16/2014 | 4 |
4444444444 | 10/13/2013 | 9/16/2014 | 1 |
4444444444 | 11/11/2013 | 9/16/2014 | 2 |
4444444444 | 11/13/2013 | 9/16/2014 | 3 |
4444444444 | 4/6/2014 | 9/16/2014 | 4 |
I can't do any testing without data.
I've attached an Excel file with 134 records for approximately 20 patients. The only variables included are APPT_DATE and PATID.
Ah! The code was missing a do block. I also added a sorting step, just in case.
proc sort data=have; by patid appt_date; run;
data fiveER;
array d{0:4} _temporary_;
call missing(of d{*});
do i = 1 by 1 until(last.patid);
set have; by patid;
d{mod(i,5)} = appt_date;
if i >=5 then do;
/* Note d{mod(i+1,5)} is the oldest date in array d */
first_of5_Appt_date = d{mod(i+1,5)};
if intck("YEAR", first_of5_Appt_date, appt_date, "CONTINUOUS") < 1 then output;
end;
end;
format first_of5_Appt_date appt_date yymmdd10.;
drop i;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.