Solved: Re: multi-label formats and moving average

Shad · Posted 05-24-2023 04:28 PM

Hi,

The goal is to create a new variable that captures the mean number of unique days of opioid use (all_opioid_sum) by hospital for the 365 days preceding each patient's hospital admission.

The variable all_opioid_sum already tabulates the number of unique days that a patient was on opioids. For instance, if a patient was admitted 1/1/2009 I want to create a new variable for the average number of days that patients at that particular hospital received from the prior year (1/1/2008 to 12/31/2008).

Not sure what the best approach for performing this operation sql, proc expand, etc. The complete dataset contains 180K observations, across 42 hospitals.

data Have:
 informat admit_date MMDDYY10.; 
input Hospital_Number$ Admit_Date all_opioid_Sum MRN$; 
format admit_date MMDDYY10.; 
datalines;
Hospital_Number Admit_Date all_opioid_Sum MRN 
1001 7/15/2008 80 8962978 
1001 11/20/2008 6 28686764 
1001 11/28/2008 8 1309558 
1001 12/23/2008 22 31605284 
1001 12/29/2008 3 353898 
1001 12/30/2008 4 13648844
1001 2/1/2009 2 9719056
1001 2/17/2009 1 15761449
1001 2/20/2009 4 17761002
1001 2/21/2009 3 31147054
1001 3/13/2009 5 25914846
1001 3/20/2009 13 3557915
1001 3/23/2009 1 13953572
1001 3/24/2009 3 1310975
1001 3/31/2009 5 5924690
1001 4/1/2009 15 4135498
1001 4/30/2009 4 3179381
1001 5/3/2009 3 13644661
1001 5/8/2009 1 24288280
1001 5/19/2009 4 5022238
1001 6/14/2009 4 7286813
1001 6/27/2009 3 15265696
1001 7/9/2009 24 12747530
1001 7/16/2009 4 10413193
1001 7/17/2009 19 17500057
1001 7/20/2009 27 1383178 
1001 7/21/2009 5 5014019 
1001 8/3/2009 10 4971055
1001 8/6/2009 3 29479789 
1001 8/18/2009 3 4512650 
1001 8/29/2009 15 8777446 
1001 9/8/2009 5 22316882 
1001 9/10/2009 4 4073376
1001 9/13/2009 29 8523048 
1001 9/23/2009 4 4412511 
1001 9/25/2009 6 21713280 
1001 10/9/2009 3 26379026
1001 10/19/2009 52 7694071 
1001 10/21/2009 8 18468515 
1001 11/16/2009 6 26551281 
;
run;

Want:

Hospital_Number Admit_Date all_opioid_Sum MRN Move_avg365
1001 7/15/2008 80 8962978 80.0
1001 11/20/2008 6 28686764 43.0
1001 11/28/2008 8 1309558 31.3
1001 12/23/2008 22 31605284 29.0
1001 12/29/2008 3 353898 23.8
1001 12/30/2008 4 13648844 20.5
1001 2/1/2009 2 9719056 17.9
1001 2/17/2009 1 15761449 15.8
1001 2/20/2009 4 17761002 14.4
1001 2/21/2009 3 31147054 13.3
1001 3/13/2009 5 25914846 12.5
1001 3/20/2009 13 3557915 12.6
1001 3/23/2009 1 13953572 11.7
1001 3/24/2009 3 1310975 11.1
1001 3/31/2009 5 5924690 10.7
1001 4/1/2009 15 4135498 10.9
1001 4/30/2009 4 3179381 10.5
1001 5/3/2009 3 13644661 10.1
1001 5/8/2009 1 24288280 9.6
1001 5/19/2009 4 5022238 9.4
1001 6/14/2009 4 7286813 9.1
1001 6/27/2009 3 15265696 8.8
1001 7/9/2009 24 12747530 9.5
1001 7/16/2009 4 10413193 9.3
1001 7/17/2009 19 17500057 9.6
1001 7/20/2009 27 1383178 10.3
1001 7/21/2009 5 5014019 10.1
1001 8/3/2009 10 4971055 10.1
1001 8/6/2009 3 29479789 9.9
1001 8/18/2009 3 4512650 9.6
1001 8/29/2009 15 8777446 9.8
1001 9/8/2009 5 22316882 9.7
1001 9/10/2009 4 4073376 9.5
1001 9/13/2009 29 8523048 10.1
1001 9/23/2009 4 4412511 9.9
1001 9/25/2009 6 21713280 9.8
1001 10/9/2009 3 26379026 9.6
1001 10/19/2009 52 7694071 10.7
1001 10/21/2009 8 18468515 10.6
1001 11/16/2009 6 26551281 8.7

Attached is is a sample (N=30,000) of my data.

Much appreciated.

mkeintz · Posted 05-24-2023 10:20 PM

Assuming

Your data is sorted by hospital_num / admit_date
No admit_date has multiple MRN's

then you can build a history array for each hospital, indexed by admit_date. This will allow you to add new admissions (i.e. add all_opioid_sum to a numerator and add 1 to a denominator), and correspondingly subtract stale admissions as you step through a given hospital:

%let data_beg=01jan2007;
%let data_end=31dec2009;


data want (drop=_:);
  array history {%sysfunc(intnx(year,%sysevalf("&data_beg"d),-1,sameday)):%sysevalf("&data_end"d)} _temporary_;
  set have;
  by hospital_number  admit_date;
  if first.hospital_number then call missing(of history{*],_numerator,_denominator);

  history{admit_date} = all_opioid_sum;   /* Restored to uploaded code*/
  _numerator + all_opioid_sum;
  _denominator + 1;
  _lag_admit_date=lag(admit_date);
  if first.hospital_number=0 then _start_stale_period = intnx('year',_lag_admit_date,-1,'sameday');
  _end_stale_period   = intnx('year',admit_date,-1,'sameday');

  if first.hospital_number=0 then do _date=_start_stale_period to _end_stale_period;
    if history{_date}^=. then do;
      _numerator = _numerator - history{_date};
      _denominator = _denominator - 1;
      history{_date}=.;
    end;
  end;
  move_avg_1year=_numerator/_denominator;
run;

Set the macro variables DATA_BEG and DATA_END to the date range of your study. This will allow the HISTORY array to be indexed by date, with a lower bound exactly 1 year prior to DATA_BEG.

For each incoming observation, the cleaning up of stale date covers all dates from 1 year prior to the preceding admission through 1 year prior to the current admission. Note this is NOT always 365 days, since the INTNX functions honors leap years.

BTW, it would not be too difficult to accept multiple admissions on a single date. But check the above first to see if it meets your needs.

You might want to adjust the calculation of _start_stale_period and _end_stale_period depending on whether you want to include exactly 1 year prior to the current (and exactly 1 year prior to the preceding) admission. You can check the stale_period range by eliminating the "drop=" parameter, and formatting them as date formats.

A line of code (history{admit_date}=all_opioid_sum;), deleted from the initial upload, has been re-inserted.

Editted note: Below, I have revised the code to accommodate multiple admissions on a given admit_date. This program is untested against sample date with duplicated admission dates:

data want_with_dupes (drop=_:);
  array history {%sysfunc(intnx(year,%sysevalf("&data_beg"d),-1,sameday)):%sysevalf("&data_end"d)} _temporary_;
  array n_mrn   {%sysfunc(intnx(year,%sysevalf("&data_beg"d),-1,sameday)):%sysevalf("&data_end"d)} _temporary_;

  set have;
  by hospital_number admit_date;

  if first.hospital_number then call missing(of history{*],of n_mrn{*},_numerator,_denominator,_n_unique_dates);
  history{admit_date} + all_opioid_sum;
  n_mrn{admit_date} + 1;

  _numerator + all_opioid_sum;
  _denominator + 1;

  if last.admit_date ;

  _n_unique_dates+1;
  _lag_admit_date=lag(admit_date);

  if _n_unique_dates>1 then _start_stale_period = intnx('year',_lag_admit_date,-1,'sameday');
  _end_stale_period   = intnx('year',admit_date,-1,'sameday');

  if _n_unique_dates>1 then do _date=_start_stale_period to _end_stale_period;
    if history{_date}^=. then do;
      _numerator = _numerator - history{_date};
      _denominator = _denominator - n_mrn{_date};
      history{_date}=.;
      n_mrn{_date}=.;
    end;
  end;
  move_avg_1year=_numerator/_denominator;

  /* Now re-read all obs for this date and output with the current moving average*/
  do until (last.admit_date);
    set have;
    by hospital_number admit_date;
    output;
  end;

run;

The revised program has an additional array (N_MRN), which provides a "history" of the number of obs on each admission date. Also, note the program now tracks the number of unique admission dates, and (after calculating the moving average) re-reads and outputs all obs for the current date.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

PaigeMiller · Posted 05-24-2023 04:39 PM

This appears to be a continuation of your question here: https://communities.sas.com/t5/SAS-Programming/Moving-average-by-site-and-date/m-p/751533#M236595

The error message is quite clear, you can't have more than 255 levels when you use a multi-level format.

So, you need a different solution. Perhaps something like this, where average replaces regression: https://pages.stern.nyu.edu/~adesouza/sasfinphd/index/node25.html

--
Paige Miller

Shad · Posted 05-24-2023 07:22 PM

Thanks I appreciate your guidance, I'm so unfamiliar with multilabel formats that I was hoping there was a simple work around for the error message. I'll explore your link, perhaps that's the solution I need.

ballardw · Posted 05-24-2023 04:57 PM

Lets parse out your description and some places that may need clarification.

" I want to capture in a new variable for each patient the mean number of opioid's prescribed "

By "new variable for each patient" you do mean one variable. Or is each patient getting their own named variable?

You aren't very clear on what you are taking a mean of here. If a patient was prescribed two opioids but received 500 doses what would the mean be? Mean needs to know what you are counting: doses, prescriptions, different named drugs. Then what is divisor to get the mean? Mean doses per patient? Mean prescribed number of opioids (not prescriptions, names of opioids) per hospital? Mean per year per hospital? Patients? Here use of VARIABLE names and description of what that variable contains may be important. Do not use verbiage that does not very clearly tie to variables.

This is a case of you know your data (hopefully) and where you are going but not describing it clearly enough that those of us without all of the experience understand the short cuts your though processes are taking.

"For instance, if a patient was admitted 1/1/2009 I want to create a new variable for the average number of pills that any patients at that particular hospital received from the prior year (1/1/2008 to 12/31/2008)." Is this saying that if Patient XYZ was adimitted but not given any "pills" (not consistent with use of Opioids in the first bit) that we look at ALL the patients in that hospital for "pills"?

Provide some dummy data with fake hospital, patient data as long as it has a enough detail to calculate what you want.

Then using that data walk us through the actual calculation of counting what it is you want.

Note: part of the reason this doesn't make much sense is your multilabel format would be applying a value from every admission to many other admissions. Which makes no sense to me.

Use a small subset of your Analytic data set, maybe 50 observations (data subset; set analytic (obs=50); run;) and use that set to make the MLF format and see if the results make any sense.

Shad · Posted 05-24-2023 07:21 PM

Apologies, I've edited original post and reframed my question. Hopefully it's more detailed enough to help facilitate pointing me in the right direction.

mkeintz · Posted 05-24-2023 10:20 PM

Assuming

Your data is sorted by hospital_num / admit_date
No admit_date has multiple MRN's

then you can build a history array for each hospital, indexed by admit_date. This will allow you to add new admissions (i.e. add all_opioid_sum to a numerator and add 1 to a denominator), and correspondingly subtract stale admissions as you step through a given hospital:

%let data_beg=01jan2007;
%let data_end=31dec2009;


data want (drop=_:);
  array history {%sysfunc(intnx(year,%sysevalf("&data_beg"d),-1,sameday)):%sysevalf("&data_end"d)} _temporary_;
  set have;
  by hospital_number  admit_date;
  if first.hospital_number then call missing(of history{*],_numerator,_denominator);

  history{admit_date} = all_opioid_sum;   /* Restored to uploaded code*/
  _numerator + all_opioid_sum;
  _denominator + 1;
  _lag_admit_date=lag(admit_date);
  if first.hospital_number=0 then _start_stale_period = intnx('year',_lag_admit_date,-1,'sameday');
  _end_stale_period   = intnx('year',admit_date,-1,'sameday');

  if first.hospital_number=0 then do _date=_start_stale_period to _end_stale_period;
    if history{_date}^=. then do;
      _numerator = _numerator - history{_date};
      _denominator = _denominator - 1;
      history{_date}=.;
    end;
  end;
  move_avg_1year=_numerator/_denominator;
run;

Set the macro variables DATA_BEG and DATA_END to the date range of your study. This will allow the HISTORY array to be indexed by date, with a lower bound exactly 1 year prior to DATA_BEG.

For each incoming observation, the cleaning up of stale date covers all dates from 1 year prior to the preceding admission through 1 year prior to the current admission. Note this is NOT always 365 days, since the INTNX functions honors leap years.

BTW, it would not be too difficult to accept multiple admissions on a single date. But check the above first to see if it meets your needs.

You might want to adjust the calculation of _start_stale_period and _end_stale_period depending on whether you want to include exactly 1 year prior to the current (and exactly 1 year prior to the preceding) admission. You can check the stale_period range by eliminating the "drop=" parameter, and formatting them as date formats.

A line of code (history{admit_date}=all_opioid_sum;), deleted from the initial upload, has been re-inserted.

Editted note: Below, I have revised the code to accommodate multiple admissions on a given admit_date. This program is untested against sample date with duplicated admission dates:

data want_with_dupes (drop=_:);
  array history {%sysfunc(intnx(year,%sysevalf("&data_beg"d),-1,sameday)):%sysevalf("&data_end"d)} _temporary_;
  array n_mrn   {%sysfunc(intnx(year,%sysevalf("&data_beg"d),-1,sameday)):%sysevalf("&data_end"d)} _temporary_;

  set have;
  by hospital_number admit_date;

  if first.hospital_number then call missing(of history{*],of n_mrn{*},_numerator,_denominator,_n_unique_dates);
  history{admit_date} + all_opioid_sum;
  n_mrn{admit_date} + 1;

  _numerator + all_opioid_sum;
  _denominator + 1;

  if last.admit_date ;

  _n_unique_dates+1;
  _lag_admit_date=lag(admit_date);

  if _n_unique_dates>1 then _start_stale_period = intnx('year',_lag_admit_date,-1,'sameday');
  _end_stale_period   = intnx('year',admit_date,-1,'sameday');

  if _n_unique_dates>1 then do _date=_start_stale_period to _end_stale_period;
    if history{_date}^=. then do;
      _numerator = _numerator - history{_date};
      _denominator = _denominator - n_mrn{_date};
      history{_date}=.;
      n_mrn{_date}=.;
    end;
  end;
  move_avg_1year=_numerator/_denominator;

  /* Now re-read all obs for this date and output with the current moving average*/
  do until (last.admit_date);
    set have;
    by hospital_number admit_date;
    output;
  end;

run;

The revised program has an additional array (N_MRN), which provides a "history" of the number of obs on each admission date. Also, note the program now tracks the number of unique admission dates, and (after calculating the moving average) re-reads and outputs all obs for the current date.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Shad · Posted 05-25-2023 03:59 PM

Amazing, thank you! There is the potential for multiple admissions on the same date (different MRNs), it's not obvious to me at a glance how this would this impact the calculation?

mkeintz · Posted 05-25-2023 11:15 PM

@Shad wrote:

Amazing, thank you! There is the potential for multiple admissions on the same date (different MRNs), it's not obvious to me at a glance how this would this impact the calculation?

Presumably you want to have the same values of moving average for ALL obs on a given date, yes?

But the program I submitted generates an updated numerator, denominator, and moving average with every incoming observation. This means that two obs on a given date will have different numerators, denominators and moving averages. It also means that each date in the history array would have only the last all_opioid_sum for the date, not the total for that date. So removing stale dates from the numerator (and denominator) would be underweighted.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 05-26-2023 12:22 AM

It sounds like you want to calculate the average of the variable ALL_OPIOD_SUM for the 365 day window ending at that current value of ADMIT_DATE.

So what this SQL code would produce.

proc sql;
create table want as 
select distinct a.*
     , mean(b.all_opioid_Sum) as Move_avg365
from have a 
join have b
on a.Hospital_Number = b.Hospital_Number
and b.admit_date between a.admit_date-365 and a.admit_date
group by a.Hospital_Number,a.admit_date
;
quit;

Is the problem that the data is too large to do that self join?

365 day moving average

Re: 365 day moving average

Re: multi-label formats and moving average

Re: multi-label formats and moving average

Re: multi-label formats and moving average

Re: multi-label formats and moving average

Re: 365 day moving average

Re: 365 day moving average

Re: 365 day moving average

Re: 365 day moving average

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!