Hierarchical LOCF (Last Observation Carried Forward) and conditional SET statements

6 Likes

Almost all the LOCF discussions I have seen on SAS-L, sas-communities, and elsewhere have concerned carrying forward non-missing data values within a by group, either to fill holes in a series by creating additional observations, or to yield one record per by-group with the most recent non-missing value in each variable. This almost always focuses on a single input data set.

However, I’ve often needed to “carry forward” data items for time series of differing frequencies - not for preserving non-missing values or filling holes, but rather to propagate low frequency data fields across multiple subsequent high frequency records.

For instance, assume 3 datasets (YEAR, QTR, and MONTH) of company data, with each dataset sorted by ID DATE and having an eponymous frequency. Except for ID and DATE the datasets have unique variables. I often want a monthly file in which each record contains the monthly data, and the most recent preceding quarterly and yearly data (say last quarter sales or prior year profits).

Often I see PROC SQL solutions with joins on ID and filters such as:

(year.date<month.date<=intnx('month',year.date,12,'end'))

That’s an expensive solution, especially given there is a very efficient DATA step alternative, which works with data sets sorted by id/date:

data want (label='Monthly data with most recent PRIOR year and qtr data');

  /* Read all the monthly vars, but only ID/DATE from QTR/YEAR */
  /* This SET plus BY will read month, then qtr, the year for a given date */
  set MONTH (in=inm)
      QTR   (in=inq keep=id date)
      YEAR  (in=iny keep=id date);
  by id date;

  if iny then set YEAR (rename=(date=year_date));  /*Read ALL the year variables */
  if inq then set QTR  (rename=(date=QTR_date));   /*Read ALL the qtr variables  */

  if inm then output;
  if last.id then call missing(of _all_);  /* Don't carry data across ID boundaries*/
run;

The above uses conditional execution of SET (the “if iny …” and “if inq …” statements) to carry values forward across DATE by-groups (but not across ID by-groups - see the “if last.id then call missing(of _all_);” statement). Since any variables in the program data vector (PDV) arising from SET statements are only overwritten when a SET statement including those variables is executed, the annual and quarterly variables are retained across multiple months. Of course this also means that the first SET statement should read only ID and DATE from the low frequency series.

What about low frequency holes? If there are two consecutive missing QTRs, you can end up with a single QTR propagated over 9 months. If you want to prevent that you can:

data want (drop=_sentinel:);
   set month (in=inm)
       qtr  (in=inq keep=id date)
       year (in=iny keep=id date);
   by id date;
   retain _sentinel1 .;
   if inq then set qtr  (rename=(date=qtr_date));
   retain _sentinel2 .;
   if iny then set year (rename=(date=year_date));
   retain _sentinel3 .;
   if qtr_date^=. and intck('month',qtr_date,date)>3   then call missing(of _sentinel1--_sentinel2);
   if year_date^=. and intck('month',year_date,date)>12 then call missing(of _sentinel2--_sentinel3);
   if inm then output;
   if last.id then call missing(of _all_);
run;

The location of the retain statements places the variables _sentinel1, _sentinel2, and _sentinel3 to the (1) left of the QTR vars, (2) between QTR and YEAR vars, and (3) to the right of YEAR vars, respectively. This allows easy resetting of those vars to missing when the QTR or YEAR data get too old.

What about date ties? When MONTH and QTR (or MONTH and YEAR) have the same dates, the QTR or YEAR data are not updated until after the monthly record is output, which my researchers usually want. But if you want tied dates to instantly update the monthly record, then reverse the order of the data sets in the first SET statement, or use a MERGE statement.

The conditional SET technique is also good for LOCF with multiple non-hierarchical series. For example consider three irregular series named LEFT, CENTER, and RIGHT, sorted by the common variables ID and DATE. You might want a series with all the original dates, carrying the most recent values from each of the datasets:

data want;
  merge left   (in=inL keep=id date)
        center (in=inC keep=id date)
        right  (in=inR keep=id date);
  by id date;
  if inL then set left   (rename=(date=date_L));
  if inC then set center (rename=(date=date_C));
  if inR then set right  (rename=(date=date_R));
  if last.date then output;
  if last.id then call missing (of _all_);
run;

pchegoor · ‎10-05-2018

I just wish there was a sample data and the output explaining the above concept .Thanks.

mkeintz · ‎10-05-2018

@pchegoor

Good suggestion. I'll be presenting this topic at the Philadelphia SAS Users' Group near the end of this month. And I'll be generating sample data for that, so I hope to have sample data in this library article after the Philasug talk.

pchegoor · ‎10-05-2018

@mkeintz Thanks. Appreciate your Prompt response.

pchegoor · ‎11-03-2018

@mkeintz Hi, Just a friendly follow up and reminder on this Topic. Is it possible for you to share your paper in this section if you are done presenting it at the Philadelphia SAS Users' Group? I am hoping the paper comes with sample data.

Please do so when it's possible for you. No Rush.

Thanks,

Prashant

Hierarchical LOCF (Last Observation Carried Forward) and conditional SET statements

Ready to see what SAS Viya Copilot can do?

SAS AI and Machine Learning Courses