BookmarkSubscribeRSS Feed
Wolverine
Quartz | Level 8

I have several very large data files with administrative medical data.  I need to identify patients who had 3 or more healthcare encounters during the year.  The problem is that the data has a separate record for each cost incurred.  For example, a 2-day inpatient hospital stay might have 20 records -- a record with a charge for medication, another record for IV bag, another record for the room charge, etc.  But all 20 of these records have the same admission date.  So those 20 records should be combined to represent 1 healthcare encounter for that patient.

Here is some sample data:

ID    ADMSN_DT  CHARGECODE
1111    03012008   2345
1111    03012008   4567
1111    03012008   5678
1111    04012008   2345
1111    04012008   9876
1111    09012008   2345
2222    03012008   2345
2222    03012008   4567
2222    03012008   5678
2222    03012008   6789
2222    03012008   7890
2222    03012008   0123
2222    03012008   1212
2222    04012008   2345
2222    04012008   9876

The resulting data file should look like this:

ID    ADMSN_DT

1111    03012008       

1111    04012008       

1111    09012008       

2222    03012008       

2222    04012008

As I stated, these are extremely large files (150GB+), So I'd really like to use Proc SQL for the sake of efficiency.  I've tried several variations on the following syntax, but no luck:

Proc SQL;

    Create Table mi.medrecords_deduped

    as Select Distinct ID,

        Count (ADMSN_DT) as ADMSNCount,

    From mi.medrecords

    Group by ID;

Quit;   

14 REPLIES 14
RichardinOz
Quartz | Level 8

Will this do? (NB untested code).  Count the admissions in a subquery.  If you have more than one year's data you will need to extract the year in the subquery and include it in the where conditions

Proc SQL;

    Create Table mi.medrecords_deduped as

         Select Distinct a.ID

               ,   a.ADMSN_DT

         From     mi.medrecords   a

               ,     (Select  ID

                         ,     Count (Distinct ADMSN_DT) as Admissions

                     From   mi.medrecords

                    )    b

          Where a.ID  = b.ID

               And Admissions >= 3

         Order by a.ID

               ,   a.ADMSN_DT

          ;

Quit; 

Richard

Wolverine
Quartz | Level 8

Sorry I haven't gotten back to this... unfortunately the project has become larger in scope, there are other files that have to be merged in, variable names have be changed so they match across files, etc.  So I had to rewrite what u wrote above:    Proc SQL;  Create Table mi.ipot2007_mi_Meas13_dedup  as Select Distinct a.MSIS_ID,          a.SRVC_BGN_DT  From mi.ipot2007_mi_Meas13_CUT a,  (Select MSIS_ID,  Count (Distinct SRVC_BGN_DT) as SRVC_BGNCount  From mi.ipot2007_mi_Meas13_CUT) b  Where a.MSIS_ID = b.MSIS_ID and SRVC_BGNCount >= 3    Order by a.MSIS_ID, a.SRVC_BGN_DT;  Quit; title1 'ipot2007_mi_Meas13_dedup'; proc freq; tables MSIS_ID SRVC_BGN_DT /*SRVC_BGNCount*/; run; The problem is that SAS seems to ignore the subquery -- SRVC_BGNCount doesn't even exist in the resulting data file. Likewise, it also ignores the "and SRVC_BGNCount >= 3" part of the Where statement.  It doesn't even give me an error, which is should becuz that part of the Where statement refers to a variable that doesn't exist in the file.  I tried removing it and that didn't change the results at all.

Isn't there any way to get this forum to display plain text w/o reformatting it??

RichardinOz
Quartz | Level 8

Wolverine

I had to reformat your SQL before I could answer your question.  I assume that the concatenation occurred when you pasted from source code - I've never had that problem.

Anyway: ? The problem is that SAS seems to ignore the subquery -- SRVC_BGNCount doesn't even exist in the resulting data file.

@ Well, if you want the count in the output dataset you need to include it in the outer select statement.

? Likewise, it also ignores the "and SRVC_BGNCount >= 3" part of the Where statement.  It doesn't even give me an error, which is should becuz that part of the Where statement refers to a variable that doesn't exist in the file.

@ It does not exist in the output file but it is available to SQL during the query

Proc SQL;

  Create Table mi.ipot2007_mi_Meas13_dedup  as

    Select Distinct a.MSIS_ID

         ,    a.SRVC_BGN_DT

         ,   b.SRVC_BGNCount

    From mi.ipot2007_mi_Meas13_CUT a

         ,   (Select MSIS_ID,

                  Count (Distinct SRVC_BGN_DT) as SRVC_BGNCount

             From mi.ipot2007_mi_Meas13_CUT

             ) b

    Where a.MSIS_ID = b.MSIS_ID

            and SRVC_BGNCount >= 3 

    Order by a.MSIS_ID

        ,    a.SRVC_BGN_DT

    ;

Quit ;

With the above version of your code you should be able to test that the count is indeed >= 3 for all records output to the table.

Richard

Wolverine
Quartz | Level 8

Ok, I figured out a little bit more about what's going on.  SRVC_BGNCount is counting the total number of unique service begin dates (SRVC_BGN_DT) in the file.  In other words, it's eliminating duplicates across the entire file, not just w/ respect to the ID variable.  The value is 365, which is greater than 3, so every record is included in the output file. Sorry about the confusion from the syntax I pasted... the forum software kept reformatting it.  I tried pasting it several times and even tried to manually move it around, but to no avail.

Patrick
Opal | Level 21

If your data sources are SAS tables then another approach for de-dupping is loading the data into a hash table. Depending on your actual data this could perform better.

/* de-dup source keys by loading into hash */
data _null_;
  set
    have1 (keep=id admsn_dt)
    have2 (keep=id admsn_dt)

    end=last;
  ;

  if _n_=1 then
    do;
      declare hash h0(ordered:'y',multidata:'n',hashexp:8);
      _rc=h0.defineKey('id','admsn_dt');
      _rc=h0.defineData('id','admsn_dt');
      _rc=h0.defineDone();
    end;

  _rc=h0.ref();

  /* write hash to target data set "want" */
  if last then
    h0.output(dataset:'inter');

run;

proc sql;
  create table want as
  select *,count(*) as N_Admissions
  from inter
  group by id
  having count(*)>3
  ;
quit;

Wolverine
Quartz | Level 8

I tried the hash table approach as well, and that seems to work.  However, it doesn't count the ID's properly.  So for example, if the file "want" looks like this:

1111    03012008    

1111    04012008    

1111    09012008    

2222    03012008    

2222    04012008

2222    06172008

3333    03012008    

3333    04012008    

3333    09012008

3333    09082008

Then the final count should be 3, becuz there are 3 different IDs that have at least 3 healthcare encounters on separate dates.  Instead, the count will be 10, becuz there are 10 different records.

Patrick
Opal | Level 21

So how should your final output data set look like? So far I've understood you want one row per id and date.

If so then it's only a question of how you're counting. So far the hash is only used for de-dupping (and though reducing volumes as well). The SQL does all the counting and it's only a question of how you formulate it.

proc sql;

  create table want as

  select

    *,

    count(id) as Total_Admissions,

    count(distinct(id)) as Total_Patients

  from

    (

      select *,count(*) as Admissions_Per_Patient

      from inter

      group by id

      having count(*)>3

    )

  ;

quit;

Or this code version which does everything in one big SQL which would be the approach to take if your data lives in a data base:

proc sql;

  create table want as

  select

    *,

    count(id) as Total_Admissions,

    count(distinct(id)) as Total_Patients

  from

    (

      select *,count(*) as Admissions_Per_Patient

      from

        (

          select id, admsn_dt

          from have1

          union corr

          select id, admsn_dt

          from have2

        )

      group by id

      having count(*)>3

    )

  ;

quit;

Wolverine
Quartz | Level 8

Patrick wrote:

So how should your final output data set look like? So far I've understood you want one row per id and date.

Yeah that wasn't very clear.  Part of the problem is that the higher-ups at my work redefined (and expanded) the task after I had already started working on it.  I had working syntax that could have done the counting, but I'm not sure how to adapt it to the expanded task.  I've also never worked with hash tables before, so I'd have to adapt it to work w/ that as well.

The final data file should have a single record for each ID.  Ideally, I like it to have the ID variable and the count variable, so I can spot-check the output.  The cumulative frequency for the ID variable in the output would provide the total number of people identified.  Based on my example above, the output data file would be

1111      3

2222     3

3333     4

And the frequency table would like something like this (ID, freq, cumulative freq)

1111     1     1

2222     1     2

3333     1     3

Also, I don't really understand why there are 2 "have" tables.  I'm not sure which of my file names I should put there. 

Patrick
Opal | Level 21

You wrote " there are other files that have to be merged in, variable names have be changed so they match across files, etc"

As this is about performance the one thing you want to minimize are passes through data. So instead of merging/appending multiple files in an extra data step you can also do it in the same data step where you use the hash table for de-dupping. That's what I've tried to express in the code I've posted by using 2 source tables "have". I don't know your real data / source tables so you might have to implement a bit different. Just try to minimize read/write operations from/to disk as this is normally the bottleneck.

As for your final output table it appears you don't need the count of admissions per patients at all except for result checking. Again as this is about performance I wouldn't go for costly operations if not needed. Below code collects all the data you need for your frequency table in a hash table.

You should read in the doc how hash tables work. Especially setting a good value for "hashexp" determines how efficient the hash table can operate - and setting the value depends on the expected size of the hash.


/* de-dup source keys by loading into hash */
data _null_;
  set
    have1 (keep=id admsn_dt)
    have2 (keep=id admsn_dt)

    end=last;
  ;

  if _n_=1 then
    do;
      declare hash H_ID(ordered:'y',multidata:'n',hashexp:7);
      _rc=H_ID.defineKey('id');
      _rc=H_ID.defineData('id');
      _rc=H_ID.defineDone();
    end;

  _rc=H_ID.ref();

  /* write hash to target data set "want" */
  if last then
    H_ID.output(dataset:'inter2');
run;

data want2;
  set inter2;
  freq=1;
  cum_freq=_n_;
run;

And here the code version which includes also the count of admissions per patients. Comparing with the version above you can see that there are many more read/write operations - also most of them happen in-memory so it should still be quite fast.


/* de-dup source keys by loading into hash */
data _null_;
  set
    have1 (keep=id admsn_dt)
    have2 (keep=id admsn_dt)

    end=last;
  ;
  length N_Admissions 8.;

  if _n_=1 then
    do;
      declare hash H_dedup(multidata:'n',hashexp:7);
      _rc=H_dedup.defineKey('id','admsn_dt');
      _rc=H_dedup.defineDone();

      declare hash H_ID(ordered:'y',multidata:'n',hashexp:7);
      _rc=H_ID.defineKey('id');
      _rc=H_ID.defineData('id','N_Admissions');
      _rc=H_ID.defineDone();
    end;

  if H_dedup.check() ne 0 then
    do;
      _rc=H_dedup.add();

      if H_ID.find()=0 then
        do;
          N_Admissions=sum(N_Admissions,1);
          _rc=H_ID.replace();
        end;
      else
        do;
          N_Admissions=1;
          _rc=H_ID.add();
        end;
    end;

  /* write hash to target data set "want" */
  if last then
    H_ID.output(dataset:'inter1');
run;

data want1;
  set inter1;
  freq=1;
  cum_freq=_n_;
run;

Wolverine
Quartz | Level 8

There are a lot of restrictions and criteria that have to be met for in this syntax, and accordingly, the file merges, record counts, deduplicating, etc have to be done in a certain order to accurately identify the correct IDs.  So I can't do the merges inside the hash table.  I've created a flowchart for the syntax to help me visualize all the steps, and I've been careful about making it as efficient as possible while still finding the correct IDs.

Anyway, I'm using the syntax below and it works great -- it finds the correct IDs (which I verified by spot-checking) and is relatively fast.  There is just 1 more issue.  Given the size of the files I'm dealing w/, I'd prefer to write the temp files to a specific directory, rather than the default work directory.  The C:\ drive on this computer is a 256GB SSD, while the D:\ drive is a 6TB RAID 10 array.  Is there any way to assign the dataset "inter" to the "mi." data library?

/* de-dup source keys by loading into hash */

data mi.ipot2007_mi_meas13_dedup;

  set

    mi.ipot2007_mi_meas13_cut (keep=MSIS_ID SRVC_BGN_DT)

    end=last;

  ;

  if _n_=1 then

    do;

      declare hash h0(ordered:'y',multidata:'n',hashexp:8);

      _rc=h0.defineKey('MSIS_ID','SRVC_BGN_DT');

      _rc=h0.defineData('MSIS_ID','SRVC_BGN_DT');

      _rc=h0.defineDone();

    end;

  _rc=h0.ref();

  /* write hash to target data set "want" */

  if last then

    h0.output(dataset:'inter');

   

    /*IS IT POSSIBLE TO ASSIGN A DATA LIBRARY TO 'INTER'?*/

run;

proc sql;

    create table mi.ipot2007_mi_Meas13Had3 as

    select MSIS_ID,

        count(distinct SRVC_BGN_DT) as numsepvisits

    from work.inter

    group by MSIS_ID

    having count(distinct SRVC_BGN_DT) >= 3;

quit;

DBailey
Lapis Lazuli | Level 10

libname mi 'd:\whateverdirectory';

Patrick
Opal | Level 21

You can use the normal 2 level syntax to write the hash from memory to disk as you would use for any other SAS table:

h0.output(dataset:'<libref>.inter');

I hope your RAID10 is fiber attached so that you don't loose too much performance by writing "inter" to a permanent location.

Patrick
Opal | Level 21

If you're having several source tables then you could also use a union set operator for de-dupping.

proc sql;
  create table want as
  select *, count(*) as N_Admissions
  from
    (
    select id, admsn_dt
    from have1
    union corr
    select id, admsn_dt
    from have2
    )
  group by id
  having count(*)> 3
  ;
quit;

DBailey
Lapis Lazuli | Level 10

Identifying the patients should be fairly straightforward:

proc sql;

create table patients as

select

id,

count(distinct admsn_dt) as Admits

from sample

group by id

having count(distinct ADMSN_DT) >= 3;

quit;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1442 views
  • 6 likes
  • 4 in conversation