I have several very large data files with administrative medical data. I need to identify patients who had 3 or more healthcare encounters during the year. The problem is that the data has a separate record for each cost incurred. For example, a 2-day inpatient hospital stay might have 20 records -- a record with a charge for medication, another record for IV bag, another record for the room charge, etc. But all 20 of these records have the same admission date. So those 20 records should be combined to represent 1 healthcare encounter for that patient.
Here is some sample data:
ID ADMSN_DT | CHARGECODE |
1111 03012008 | 2345 |
1111 03012008 | 4567 |
1111 03012008 | 5678 |
1111 04012008 | 2345 |
1111 04012008 | 9876 |
1111 09012008 | 2345 |
2222 03012008 | 2345 |
2222 03012008 | 4567 |
2222 03012008 | 5678 |
2222 03012008 | 6789 |
2222 03012008 | 7890 |
2222 03012008 | 0123 |
2222 03012008 | 1212 |
2222 04012008 | 2345 |
2222 04012008 | 9876 |
The resulting data file should look like this:
ID ADMSN_DT
1111 03012008
1111 04012008
1111 09012008
2222 03012008
2222 04012008
As I stated, these are extremely large files (150GB+), So I'd really like to use Proc SQL for the sake of efficiency. I've tried several variations on the following syntax, but no luck:
Proc SQL;
Create Table mi.medrecords_deduped
as Select Distinct ID,
Count (ADMSN_DT) as ADMSNCount,
From mi.medrecords
Group by ID;
Quit;
Will this do? (NB untested code). Count the admissions in a subquery. If you have more than one year's data you will need to extract the year in the subquery and include it in the where conditions
Proc SQL;
Create Table mi.medrecords_deduped as
Select Distinct a.ID
, a.ADMSN_DT
From mi.medrecords a
, (Select ID
, Count (Distinct ADMSN_DT) as Admissions
From mi.medrecords
) b
Where a.ID = b.ID
And Admissions >= 3
Order by a.ID
, a.ADMSN_DT
;
Quit;
Richard
Sorry I haven't gotten back to this... unfortunately the project has become larger in scope, there are other files that have to be merged in, variable names have be changed so they match across files, etc. So I had to rewrite what u wrote above: Proc SQL; Create Table mi.ipot2007_mi_Meas13_dedup as Select Distinct a.MSIS_ID, a.SRVC_BGN_DT From mi.ipot2007_mi_Meas13_CUT a, (Select MSIS_ID, Count (Distinct SRVC_BGN_DT) as SRVC_BGNCount From mi.ipot2007_mi_Meas13_CUT) b Where a.MSIS_ID = b.MSIS_ID and SRVC_BGNCount >= 3 Order by a.MSIS_ID, a.SRVC_BGN_DT; Quit; title1 'ipot2007_mi_Meas13_dedup'; proc freq; tables MSIS_ID SRVC_BGN_DT /*SRVC_BGNCount*/; run; The problem is that SAS seems to ignore the subquery -- SRVC_BGNCount doesn't even exist in the resulting data file. Likewise, it also ignores the "and SRVC_BGNCount >= 3" part of the Where statement. It doesn't even give me an error, which is should becuz that part of the Where statement refers to a variable that doesn't exist in the file. I tried removing it and that didn't change the results at all.
Isn't there any way to get this forum to display plain text w/o reformatting it??
Wolverine
I had to reformat your SQL before I could answer your question. I assume that the concatenation occurred when you pasted from source code - I've never had that problem.
Anyway: ? The problem is that SAS seems to ignore the subquery -- SRVC_BGNCount doesn't even exist in the resulting data file.
@ Well, if you want the count in the output dataset you need to include it in the outer select statement.
? Likewise, it also ignores the "and SRVC_BGNCount >= 3" part of the Where statement. It doesn't even give me an error, which is should becuz that part of the Where statement refers to a variable that doesn't exist in the file.
@ It does not exist in the output file but it is available to SQL during the query
Proc SQL;
Create Table mi.ipot2007_mi_Meas13_dedup as
Select Distinct a.MSIS_ID
, a.SRVC_BGN_DT
, b.SRVC_BGNCount
From mi.ipot2007_mi_Meas13_CUT a
, (Select MSIS_ID,
Count (Distinct SRVC_BGN_DT) as SRVC_BGNCount
From mi.ipot2007_mi_Meas13_CUT
) b
Where a.MSIS_ID = b.MSIS_ID
and SRVC_BGNCount >= 3
Order by a.MSIS_ID
, a.SRVC_BGN_DT
;
Quit ;
With the above version of your code you should be able to test that the count is indeed >= 3 for all records output to the table.
Richard
Ok, I figured out a little bit more about what's going on. SRVC_BGNCount is counting the total number of unique service begin dates (SRVC_BGN_DT) in the file. In other words, it's eliminating duplicates across the entire file, not just w/ respect to the ID variable. The value is 365, which is greater than 3, so every record is included in the output file. Sorry about the confusion from the syntax I pasted... the forum software kept reformatting it. I tried pasting it several times and even tried to manually move it around, but to no avail.
If your data sources are SAS tables then another approach for de-dupping is loading the data into a hash table. Depending on your actual data this could perform better.
/* de-dup source keys by loading into hash */
data _null_;
set
have1 (keep=id admsn_dt)
have2 (keep=id admsn_dt)
end=last;
;
if _n_=1 then
do;
declare hash h0(ordered:'y',multidata:'n',hashexp:8);
_rc=h0.defineKey('id','admsn_dt');
_rc=h0.defineData('id','admsn_dt');
_rc=h0.defineDone();
end;
_rc=h0.ref();
/* write hash to target data set "want" */
if last then
h0.output(dataset:'inter');
run;
proc sql;
create table want as
select *,count(*) as N_Admissions
from inter
group by id
having count(*)>3
;
quit;
I tried the hash table approach as well, and that seems to work. However, it doesn't count the ID's properly. So for example, if the file "want" looks like this:
1111 03012008
1111 04012008
1111 09012008
2222 03012008
2222 04012008
2222 06172008
3333 03012008
3333 04012008
3333 09012008
3333 09082008
Then the final count should be 3, becuz there are 3 different IDs that have at least 3 healthcare encounters on separate dates. Instead, the count will be 10, becuz there are 10 different records.
So how should your final output data set look like? So far I've understood you want one row per id and date.
If so then it's only a question of how you're counting. So far the hash is only used for de-dupping (and though reducing volumes as well). The SQL does all the counting and it's only a question of how you formulate it.
proc sql;
create table want as
select
*,
count(id) as Total_Admissions,
count(distinct(id)) as Total_Patients
from
(
select *,count(*) as Admissions_Per_Patient
from inter
group by id
having count(*)>3
)
;
quit;
Or this code version which does everything in one big SQL which would be the approach to take if your data lives in a data base:
proc sql;
create table want as
select
*,
count(id) as Total_Admissions,
count(distinct(id)) as Total_Patients
from
(
select *,count(*) as Admissions_Per_Patient
from
(
select id, admsn_dt
from have1
union corr
select id, admsn_dt
from have2
)
group by id
having count(*)>3
)
;
quit;
Patrick wrote:
So how should your final output data set look like? So far I've understood you want one row per id and date.
Yeah that wasn't very clear. Part of the problem is that the higher-ups at my work redefined (and expanded) the task after I had already started working on it. I had working syntax that could have done the counting, but I'm not sure how to adapt it to the expanded task. I've also never worked with hash tables before, so I'd have to adapt it to work w/ that as well.
The final data file should have a single record for each ID. Ideally, I like it to have the ID variable and the count variable, so I can spot-check the output. The cumulative frequency for the ID variable in the output would provide the total number of people identified. Based on my example above, the output data file would be
1111 3
2222 3
3333 4
And the frequency table would like something like this (ID, freq, cumulative freq)
1111 1 1
2222 1 2
3333 1 3
Also, I don't really understand why there are 2 "have" tables. I'm not sure which of my file names I should put there.
You wrote " there are other files that have to be merged in, variable names have be changed so they match across files, etc"
As this is about performance the one thing you want to minimize are passes through data. So instead of merging/appending multiple files in an extra data step you can also do it in the same data step where you use the hash table for de-dupping. That's what I've tried to express in the code I've posted by using 2 source tables "have". I don't know your real data / source tables so you might have to implement a bit different. Just try to minimize read/write operations from/to disk as this is normally the bottleneck.
As for your final output table it appears you don't need the count of admissions per patients at all except for result checking. Again as this is about performance I wouldn't go for costly operations if not needed. Below code collects all the data you need for your frequency table in a hash table.
You should read in the doc how hash tables work. Especially setting a good value for "hashexp" determines how efficient the hash table can operate - and setting the value depends on the expected size of the hash.
/* de-dup source keys by loading into hash */
data _null_;
set
have1 (keep=id admsn_dt)
have2 (keep=id admsn_dt)
end=last;
;
if _n_=1 then
do;
declare hash H_ID(ordered:'y',multidata:'n',hashexp:7);
_rc=H_ID.defineKey('id');
_rc=H_ID.defineData('id');
_rc=H_ID.defineDone();
end;
_rc=H_ID.ref();
/* write hash to target data set "want" */
if last then
H_ID.output(dataset:'inter2');
run;
data want2;
set inter2;
freq=1;
cum_freq=_n_;
run;
And here the code version which includes also the count of admissions per patients. Comparing with the version above you can see that there are many more read/write operations - also most of them happen in-memory so it should still be quite fast.
/* de-dup source keys by loading into hash */
data _null_;
set
have1 (keep=id admsn_dt)
have2 (keep=id admsn_dt)
end=last;
;
length N_Admissions 8.;
if _n_=1 then
do;
declare hash H_dedup(multidata:'n',hashexp:7);
_rc=H_dedup.defineKey('id','admsn_dt');
_rc=H_dedup.defineDone();
declare hash H_ID(ordered:'y',multidata:'n',hashexp:7);
_rc=H_ID.defineKey('id');
_rc=H_ID.defineData('id','N_Admissions');
_rc=H_ID.defineDone();
end;
if H_dedup.check() ne 0 then
do;
_rc=H_dedup.add();
if H_ID.find()=0 then
do;
N_Admissions=sum(N_Admissions,1);
_rc=H_ID.replace();
end;
else
do;
N_Admissions=1;
_rc=H_ID.add();
end;
end;
/* write hash to target data set "want" */
if last then
H_ID.output(dataset:'inter1');
run;
data want1;
set inter1;
freq=1;
cum_freq=_n_;
run;
There are a lot of restrictions and criteria that have to be met for in this syntax, and accordingly, the file merges, record counts, deduplicating, etc have to be done in a certain order to accurately identify the correct IDs. So I can't do the merges inside the hash table. I've created a flowchart for the syntax to help me visualize all the steps, and I've been careful about making it as efficient as possible while still finding the correct IDs.
Anyway, I'm using the syntax below and it works great -- it finds the correct IDs (which I verified by spot-checking) and is relatively fast. There is just 1 more issue. Given the size of the files I'm dealing w/, I'd prefer to write the temp files to a specific directory, rather than the default work directory. The C:\ drive on this computer is a 256GB SSD, while the D:\ drive is a 6TB RAID 10 array. Is there any way to assign the dataset "inter" to the "mi." data library?
/* de-dup source keys by loading into hash */
data mi.ipot2007_mi_meas13_dedup;
set
mi.ipot2007_mi_meas13_cut (keep=MSIS_ID SRVC_BGN_DT)
end=last;
;
if _n_=1 then
do;
declare hash h0(ordered:'y',multidata:'n',hashexp:8);
_rc=h0.defineKey('MSIS_ID','SRVC_BGN_DT');
_rc=h0.defineData('MSIS_ID','SRVC_BGN_DT');
_rc=h0.defineDone();
end;
_rc=h0.ref();
/* write hash to target data set "want" */
if last then
h0.output(dataset:'inter');
/*IS IT POSSIBLE TO ASSIGN A DATA LIBRARY TO 'INTER'?*/
run;
proc sql;
create table mi.ipot2007_mi_Meas13Had3 as
select MSIS_ID,
count(distinct SRVC_BGN_DT) as numsepvisits
from work.inter
group by MSIS_ID
having count(distinct SRVC_BGN_DT) >= 3;
quit;
libname mi 'd:\whateverdirectory';
You can use the normal 2 level syntax to write the hash from memory to disk as you would use for any other SAS table:
h0.output(dataset:'<libref>.inter');
I hope your RAID10 is fiber attached so that you don't loose too much performance by writing "inter" to a permanent location.
If you're having several source tables then you could also use a union set operator for de-dupping.
proc sql;
create table want as
select *, count(*) as N_Admissions
from
(
select id, admsn_dt
from have1
union corr
select id, admsn_dt
from have2
)
group by id
having count(*)> 3
;
quit;
Identifying the patients should be fairly straightforward:
proc sql;
create table patients as
select
id,
count(distinct admsn_dt) as Admits
from sample
group by id
having count(distinct ADMSN_DT) >= 3;
quit;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.