Hello @abigel,
You know your data best and also what the next steps will be after adding that "shifted date column" to the large dataset.
That said, looking at your code and running it against simple test data (see further below), there are a couple of things I noticed suggesting that aiming at a different result (e.g., a separate rather than augmented dataset) might be more efficient than your current approach:
Contrary to what your subject line suggests, the added column does not contain values of variable DATE, but values of variable VARIABLE.
I would expect a significant portion of the values of the newly added column to be trivially missing: all observations with DATE values other than end-of-month dates. This is a consequence of your left join involving the ON condition
t1.date = intnx('month', t2.date, -1, 'E')
Given that the existing dataset is so large, it seems questionable to me that your code has the potential of adding lots of (copied) observations ("rows") to it, not just a derived variable ("column"). In a simple test dataset with four years' worth of data per DEAL_ID and one observation per day the number of observations almost doubled (see below): For each DEAL_ID and year, the left join inserted 27 (in leap years: 28) copies of the observation of January 31st in order to accommodate the 28 (or 29, resp.) values of VARIABLE from the subsequent February. Similarly, the resulting dataset contained 31 observations (per DEAL_ID and year) for February 28th (or 29th, resp.) -- all identical except for the newly added SHIFTED_VARIABLE (containing March VARIABLE values), and so on. Consider the amount of redundancy with your real 88-variables dataset!
With more than one observation per end-of-month date even more redundant information would be generated (unless there were fewer than two observations from the subsequent month), as each of those observations would be copied so many times as described above.
/* Create simplified test data */
data have;
call streaminit(27182818);
do deal_id=1 to 3;
do date='01JAN2021'd to '31DEC2024'd;
v=rand('integer',100,999);
x=rand('uniform');
output;
end;
end;
format date date9.;
run; /* 4383 obs. */
/* Perform the original left join */
proc sql;
create table result as
select t1.*, t2.v as shifted_v
from have as t1
left join (select date, deal_id, v from have) as t2
on t1.deal_id = t2.deal_id and t1.date = intnx('month', t2.date, -1, 'E')
order by t1.deal_id, t1.date;
quit; /* 8532 obs. */
... View more