We’re smarter together. Learn from this collection of community knowledge and add your expertise.

Hierarchical LOCF (Last Observation Carried Forward) and conditional SET statements

by Trusted Advisor ‎01-31-2018 06:03 PM - edited ‎01-31-2018 11:50 PM (1,245 Views)

Almost all the LOCF discussions I have seen on SAS-L, sas-communities, and elsewhere have concerned carrying forward non-missing data values within a by group, either to fill holes in a series by creating additional observations, or to yield one record per by-group with the most recent non-missing value in each variable. This almost always focuses on a single input data set.

 

However, I’ve often needed to “carry forward” data items for time series of differing frequencies - not for preserving non-missing values or filling holes, but rather to propagate low frequency data fields across multiple subsequent high frequency records.

 

For instance, assume 3 datasets (YEAR, QTR, and MONTH) of company data, with each dataset sorted by ID DATE and having an eponymous frequency. Except for ID and DATE the datasets have unique variables.  I often want a monthly file in which each record contains the monthly data, and the most recent preceding quarterly and yearly data (say last quarter sales or prior year profits).

 

Often I see PROC SQL solutions with joins on ID and filters such as:

   (year.date<month.date<=intnx('month',year.date,12,'end'))

 

That’s an expensive solution, especially given there is a very efficient DATA step alternative:

 

data want;
  set MONTH (in=inm)
      QTR   (in=inq keep=id date)
      YEAR  (in=iny keep=id date);
  by id date;
  if iny then set YEAR (rename=(date=year_date));
  if inq then set QTR  (rename=(date=QTR_date));
  if inm then output;
  if last.id then call missing(of _all_);
run;

 

The above uses conditional execution of SET (the “if iny …” and “if inq …” statements) to carry values forward across DATE by-groups (but not across ID by-groups - see the “if last.id then call missing(of _all_);” statement).  Since any variables in the program data vector (PDV) arising from SET statements are only overwritten when a SET statement including those variables is executed, the annual and quarterly variables are retained across multiple months.  Of course this also means that the first SET statement should read only ID and DATE from the low frequency series.

 

What about low frequency holes? If there are two consecutive missing QTRs, you can end up with a single QTR propagated over 9 months.  If you want to prevent that you can:

 

data want (drop=_sentinel:);
   set month (in=inm)
       qtr  (in=inq keep=id date)
       year (in=iny keep=id date);
   by id date;
   retain _sentinel1 .;
   if inq then set qtr  (rename=(date=qtr_date));
   retain _sentinel2 .;
   if iny then set year (rename=(date=year_date));
   retain _sentinel3 .;
   if qtr_date^=. and intck('month',qtr_date,date)>3   then call missing(of _sentinel1--_sentinel2);
   if year_date^=. and intck('month',year_date,date)>12 then call missing(of _sentinel2--_sentinel3);
   if inm then output;
   if last.id then call missing(of _all_);
run;

 

 

The location of the retain statements places the variables _sentinel1, _sentinel2, and _sentinel3 to the (1) left of the QTR vars, (2) between QTR and YEAR vars, and (3) to the right of YEAR vars, respectively. This allows easy resetting of those vars to missing when the QTR or YEAR data get too old.

 

What about date ties? When MONTH and QTR (or MONTH and YEAR) have the same dates, the QTR or YEAR data are not updated until after the monthly record is output, which my researchers usually want.  But if you want tied dates to instantly update the monthly record, then reverse the order of the data sets in the first SET statement, or use a MERGE statement.

 

 

The conditional SET technique is also good for LOCF with multiple non-hierarchical series. For example consider three irregular series named LEFT, CENTER, and RIGHT, sorted by the common variables ID and DATE.  You might want a series with all the original dates, carrying the most recent values from each of the datasets:

 

data want;
  merge left   (in=inL keep=id date)
        center (in=inC keep=id date)
        right  (in=inR keep=id date);
  by id date;
  if inL then set left   (rename=(date=date_L));
  if inC then set center (rename=(date=date_C));
  if inR then set right  (rename=(date=date_R));
  if last.date then output;
  if last.id then call missing (of _all_);
run;

 

 

Contributors
Your turn
Sign In!

Want to write an article? Sign in with your profile.


Looking for the Ask the Expert series? Find it in its new home: communities.sas.com/askexpert.