Re: Eliminating duplicate dates with respect to ID# using Proc SQL

Wolverine · Posted 03-21-2013 03:16 PM

I have several very large data files with administrative medical data. I need to identify patients who had 3 or more healthcare encounters during the year. The problem is that the data has a separate record for each cost incurred. For example, a 2-day inpatient hospital stay might have 20 records -- a record with a charge for medication, another record for IV bag, another record for the room charge, etc. But all 20 of these records have the same admission date. So those 20 records should be combined to represent 1 healthcare encounter for that patient.

Here is some sample data:

ID ADMSN_DT	CHARGECODE
1111 03012008	2345
1111 03012008	4567
1111 03012008	5678
1111 04012008	2345
1111 04012008	9876
1111 09012008	2345
2222 03012008	2345
2222 03012008	4567
2222 03012008	5678
2222 03012008	6789
2222 03012008	7890
2222 03012008	0123
2222 03012008	1212
2222 04012008	2345
2222 04012008	9876

The resulting data file should look like this:

ID ADMSN_DT

1111 03012008

1111 04012008

1111 09012008

2222 03012008

2222 04012008

As I stated, these are extremely large files (150GB+), So I'd really like to use Proc SQL for the sake of efficiency. I've tried several variations on the following syntax, but no luck:

Proc SQL;

Create Table mi.medrecords_deduped

as Select Distinct ID,

Count (ADMSN_DT) as ADMSNCount,

From mi.medrecords

Group by ID;

Quit;

RichardinOz · Posted 03-21-2013 06:02 PM

Will this do? (NB untested code). Count the admissions in a subquery. If you have more than one year's data you will need to extract the year in the subquery and include it in the where conditions

Proc SQL;

Create Table mi.medrecords_deduped as

Select Distinct a.ID

, a.ADMSN_DT

From mi.medrecords a

, (Select ID

, Count (Distinct ADMSN_DT) as Admissions

From mi.medrecords

) b

Where a.ID = b.ID

And Admissions >= 3

Order by a.ID

, a.ADMSN_DT

;

Quit;

Richard

Wolverine · Posted 03-29-2013 12:35 PM

Sorry I haven't gotten back to this... unfortunately the project has become larger in scope, there are other files that have to be merged in, variable names have be changed so they match across files, etc. So I had to rewrite what u wrote above: Proc SQL; Create Table mi.ipot2007_mi_Meas13_dedup as Select Distinct a.MSIS_ID, a.SRVC_BGN_DT From mi.ipot2007_mi_Meas13_CUT a, (Select MSIS_ID, Count (Distinct SRVC_BGN_DT) as SRVC_BGNCount From mi.ipot2007_mi_Meas13_CUT) b Where a.MSIS_ID = b.MSIS_ID and SRVC_BGNCount >= 3 Order by a.MSIS_ID, a.SRVC_BGN_DT; Quit; title1 'ipot2007_mi_Meas13_dedup'; proc freq; tables MSIS_ID SRVC_BGN_DT /*SRVC_BGNCount*/; run; The problem is that SAS seems to ignore the subquery -- SRVC_BGNCount doesn't even exist in the resulting data file. Likewise, it also ignores the "and SRVC_BGNCount >= 3" part of the Where statement. It doesn't even give me an error, which is should becuz that part of the Where statement refers to a variable that doesn't exist in the file. I tried removing it and that didn't change the results at all.

Isn't there any way to get this forum to display plain text w/o reformatting it??

RichardinOz · Posted 03-30-2013 08:20 PM

Wolverine

I had to reformat your SQL before I could answer your question. I assume that the concatenation occurred when you pasted from source code - I've never had that problem.

Anyway: ? The problem is that SAS seems to ignore the subquery -- SRVC_BGNCount doesn't even exist in the resulting data file.

@ Well, if you want the count in the output dataset you need to include it in the outer select statement.

? Likewise, it also ignores the "and SRVC_BGNCount >= 3" part of the Where statement. It doesn't even give me an error, which is should becuz that part of the Where statement refers to a variable that doesn't exist in the file.

@ It does not exist in the output file but it is available to SQL during the query

Proc SQL;

Create Table mi.ipot2007_mi_Meas13_dedup as

Select Distinct a.MSIS_ID

, a.SRVC_BGN_DT

, b.SRVC_BGNCount

From mi.ipot2007_mi_Meas13_CUT a

, (Select MSIS_ID,

Count (Distinct SRVC_BGN_DT) as SRVC_BGNCount

From mi.ipot2007_mi_Meas13_CUT

) b

Where a.MSIS_ID = b.MSIS_ID

and SRVC_BGNCount >= 3

Order by a.MSIS_ID

, a.SRVC_BGN_DT

;

Quit ;

With the above version of your code you should be able to test that the count is indeed >= 3 for all records output to the table.

Richard

Wolverine · Posted 04-01-2013 12:01 PM

Ok, I figured out a little bit more about what's going on. SRVC_BGNCount is counting the total number of unique service begin dates (SRVC_BGN_DT) in the file. In other words, it's eliminating duplicates across the entire file, not just w/ respect to the ID variable. The value is 365, which is greater than 3, so every record is included in the output file. Sorry about the confusion from the syntax I pasted... the forum software kept reformatting it. I tried pasting it several times and even tried to manually move it around, but to no avail.

Patrick · Posted 03-31-2013 12:58 AM

If your data sources are SAS tables then another approach for de-dupping is loading the data into a hash table. Depending on your actual data this could perform better.

/* de-dup source keys by loading into hash */
data _null_;
set
have1 (keep=id admsn_dt)
have2 (keep=id admsn_dt)

end=last;
;

if _n_=1 then
    do;
      declare hash h0(ordered:'y',multidata:'n',hashexp:8);
      _rc=h0.defineKey('id','admsn_dt');
      _rc=h0.defineData('id','admsn_dt');
      _rc=h0.defineDone();
    end;

_rc=h0.ref();

/* write hash to target data set "want" */
if last then
h0.output(dataset:'inter');

run;

proc sql;
create table want as
select *,count(*) as N_Admissions
from inter
group by id
having count(*)>3
;
quit;

Wolverine · Posted 04-01-2013 02:06 PM

I tried the hash table approach as well, and that seems to work. However, it doesn't count the ID's properly. So for example, if the file "want" looks like this:

1111 03012008

1111 04012008

1111 09012008

2222 03012008

2222 04012008

2222 06172008

3333 03012008

3333 04012008

3333 09012008

3333 09082008

Then the final count should be 3, becuz there are 3 different IDs that have at least 3 healthcare encounters on separate dates. Instead, the count will be 10, becuz there are 10 different records.

Patrick · Posted 04-01-2013 09:45 PM

So how should your final output data set look like? So far I've understood you want one row per id and date.

If so then it's only a question of how you're counting. So far the hash is only used for de-dupping (and though reducing volumes as well). The SQL does all the counting and it's only a question of how you formulate it.

proc sql;

create table want as

select

*,

count(id) as Total_Admissions,

count(distinct(id)) as Total_Patients

from

(

select *,count(*) as Admissions_Per_Patient

from inter

group by id

having count(*)>3

)

;

quit;

Or this code version which does everything in one big SQL which would be the approach to take if your data lives in a data base:

proc sql;

create table want as

select

*,

count(id) as Total_Admissions,

count(distinct(id)) as Total_Patients

from

(

select *,count(*) as Admissions_Per_Patient

from

(

select id, admsn_dt

from have1

union corr

select id, admsn_dt

from have2

)

group by id

having count(*)>3

)

;

quit;

Wolverine · Posted 04-02-2013 01:38 PM

Patrick wrote:

So how should your final output data set look like? So far I've understood you want one row per id and date.

Yeah that wasn't very clear. Part of the problem is that the higher-ups at my work redefined (and expanded) the task after I had already started working on it. I had working syntax that could have done the counting, but I'm not sure how to adapt it to the expanded task. I've also never worked with hash tables before, so I'd have to adapt it to work w/ that as well.

The final data file should have a single record for each ID. Ideally, I like it to have the ID variable and the count variable, so I can spot-check the output. The cumulative frequency for the ID variable in the output would provide the total number of people identified. Based on my example above, the output data file would be

1111 3

2222 3

3333 4

And the frequency table would like something like this (ID, freq, cumulative freq)

1111 1 1

2222 1 2

3333 1 3

Also, I don't really understand why there are 2 "have" tables. I'm not sure which of my file names I should put there.

Patrick · Posted 04-02-2013 08:26 PM

You wrote " there are other files that have to be merged in, variable names have be changed so they match across files, etc"

As this is about performance the one thing you want to minimize are passes through data. So instead of merging/appending multiple files in an extra data step you can also do it in the same data step where you use the hash table for de-dupping. That's what I've tried to express in the code I've posted by using 2 source tables "have". I don't know your real data / source tables so you might have to implement a bit different. Just try to minimize read/write operations from/to disk as this is normally the bottleneck.

As for your final output table it appears you don't need the count of admissions per patients at all except for result checking. Again as this is about performance I wouldn't go for costly operations if not needed. Below code collects all the data you need for your frequency table in a hash table.

You should read in the doc how hash tables work. Especially setting a good value for "hashexp" determines how efficient the hash table can operate - and setting the value depends on the expected size of the hash.

/* de-dup source keys by loading into hash */
data _null_;
set
have1 (keep=id admsn_dt)
have2 (keep=id admsn_dt)

end=last;
;

if _n_=1 then
    do;
      declare hash H_ID(ordered:'y',multidata:'n',hashexp:7);
      _rc=H_ID.defineKey('id');
      _rc=H_ID.defineData('id');
      _rc=H_ID.defineDone();
    end;

_rc=H_ID.ref();

/* write hash to target data set "want" */
if last then
H_ID.output(dataset:'inter2');
run;

data want2;
set inter2;
freq=1;
cum_freq=_n_;
run;

And here the code version which includes also the count of admissions per patients. Comparing with the version above you can see that there are many more read/write operations - also most of them happen in-memory so it should still be quite fast.

/* de-dup source keys by loading into hash */
data _null_;
set
have1 (keep=id admsn_dt)
have2 (keep=id admsn_dt)

end=last;
;
length N_Admissions 8.;

if _n_=1 then
    do;
      declare hash H_dedup(multidata:'n',hashexp:7);
      _rc=H_dedup.defineKey('id','admsn_dt');
      _rc=H_dedup.defineDone();

      declare hash H_ID(ordered:'y',multidata:'n',hashexp:7);
      _rc=H_ID.defineKey('id');
      _rc=H_ID.defineData('id','N_Admissions');
      _rc=H_ID.defineDone();
    end;

if H_dedup.check() ne 0 then
do;
_rc=H_dedup.add();

      if H_ID.find()=0 then
        do;
          N_Admissions=sum(N_Admissions,1);
          _rc=H_ID.replace();
        end;
      else
        do;
          N_Admissions=1;
          _rc=H_ID.add();
        end;
    end;

/* write hash to target data set "want" */
if last then
H_ID.output(dataset:'inter1');
run;

data want1;
set inter1;
freq=1;
cum_freq=_n_;
run;

Wolverine · Posted 04-05-2013 10:59 AM

There are a lot of restrictions and criteria that have to be met for in this syntax, and accordingly, the file merges, record counts, deduplicating, etc have to be done in a certain order to accurately identify the correct IDs. So I can't do the merges inside the hash table. I've created a flowchart for the syntax to help me visualize all the steps, and I've been careful about making it as efficient as possible while still finding the correct IDs.

Anyway, I'm using the syntax below and it works great -- it finds the correct IDs (which I verified by spot-checking) and is relatively fast. There is just 1 more issue. Given the size of the files I'm dealing w/, I'd prefer to write the temp files to a specific directory, rather than the default work directory. The C:\ drive on this computer is a 256GB SSD, while the D:\ drive is a 6TB RAID 10 array. Is there any way to assign the dataset "inter" to the "mi." data library?

/* de-dup source keys by loading into hash */

data mi.ipot2007_mi_meas13_dedup;

set

mi.ipot2007_mi_meas13_cut (keep=MSIS_ID SRVC_BGN_DT)

end=last;

;

if _n_=1 then

do;

declare hash h0(ordered:'y',multidata:'n',hashexp:8);

_rc=h0.defineKey('MSIS_ID','SRVC_BGN_DT');

_rc=h0.defineData('MSIS_ID','SRVC_BGN_DT');

_rc=h0.defineDone();

end;

_rc=h0.ref();

/* write hash to target data set "want" */

if last then

h0.output(dataset:'inter');

/*IS IT POSSIBLE TO ASSIGN A DATA LIBRARY TO 'INTER'?*/

run;

proc sql;

create table mi.ipot2007_mi_Meas13Had3 as

select MSIS_ID,

count(distinct SRVC_BGN_DT) as numsepvisits

from work.inter

group by MSIS_ID

having count(distinct SRVC_BGN_DT) >= 3;

quit;

DBailey · Posted 04-05-2013 03:39 PM

libname mi 'd:\whateverdirectory';

Patrick · Posted 04-05-2013 07:26 PM

You can use the normal 2 level syntax to write the hash from memory to disk as you would use for any other SAS table:

h0.output(dataset:'<libref>.inter');

I hope your RAID10 is fiber attached so that you don't loose too much performance by writing "inter" to a permanent location.

Patrick · Posted 03-31-2013 12:29 AM

If you're having several source tables then you could also use a union set operator for de-dupping.

proc sql;
create table want as
select *, count(*) as N_Admissions
from
    (
    select id, admsn_dt
    from have1
    union corr
    select id, admsn_dt
    from have2
    )
group by id
having count(*)> 3
;
quit;

DBailey · Posted 04-02-2013 03:06 PM

Identifying the patients should be fairly straightforward:

proc sql;

create table patients as

select

id,

count(distinct admsn_dt) as Admits

from sample

group by id

having count(distinct ADMSN_DT) >= 3;

quit;

Registration is open

SAS Training: Just a Click Away