Re: Efficiently Adding a Shifted Date Column to a Large SAS Dataset

abigel · Posted 03-27-2025 09:21 AM

Dear All,

I have a large dataset stored on a remote SAS server with approximately 400 million rows and 88 columns. My goal is to create a new column by shifting a date variable by 1 month. I intend to do this with a simple left join, similar to what I would do in a SQL server.

My problem is as follows: when I tried to use the ALTER TABLE command, it ran indefinitely without producing any results. However, when I created a new table, the process finished in 30 minutes. How should I properly execute this process?

Here is the hard-copy version of my code:

rsubmit;
%let start_time = %sysfunc(datetime());

proc sql;
drop table table_v2;
create table table_v2 as
select *,
t2.variable as shifted_variable
from table_v1 as t1
left join (select date, deal_id, variable from table_v1) as t2
on t1.deal_id = t2.deal_id and t1.date = intnx('month', t2.date, -1, 'E')
order by t1.deal_id, t1.date;
quit;

%let end_time = %sysfunc(datetime());
%let duration = %sysevalf(&end_time - &start_time);
%put NOTE: The query took &duration seconds to run.;
endrsubmit;

Thank you for your assistance!

Gabor

mkeintz · Posted 03-27-2025 03:35 PM

Welcome @abigel to the sas support community:

you wrote:

However, when I created a new table, the process finished in 30 minutes. How should I properly execute this process?

I would use the new table creation process. It's simply faster.

As to why this is the case, I write the below as a non-expert in PROC SQL:

I wonder if the ALTER process "modifies" one row at a time, as opposed to writing a collection of rows when creating a new table. That is, perhaps it utilizes much more input/output resources.
Also a question, Does your current table have integrity constraints defined? If so, then note from Creating and Using Integrity Constraints in a Table: (italics mine)

"Integrity constraints are rules that you specify to guarantee the accuracy, completeness, or consistency of data in tables. All integrity constraints are enforced when you insert, delete, or alter data values in the columns of a table for which integrity constraints have been defined. Before a constraint is added to a table that contains existing data, all the data is checked to determine that it satisfies the constraints."

This suggests to me that the ALTER statement would need to allocate time to apply those integrity constraints in modifying every observation in the table. OTOH, I presume that making a new table would not automatically do so using the code you have shown.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SASJedi · Posted 03-27-2025 04:02 PM

In addition to the integrity constraints mentioned by @mkeintz, indexes on the table may also require modification. So, a couple of questions:

1. Is your data in a SAS data set, or is it stored in a database table accessed via a SAS/ACCESS LIBNAME engine?
If so, the INTNX function cannot be passed to the database. This means that the entire table would have to be pulled out into SAS, the data modified, then copied back into the database. Depending on bandwidth and server co-location issues, this could really choke your performance.

2. Does the table have indexes or integrity constraints defined?
If so, consider dropping the indexes before updating and re-buliding them after the update is complete.

If this is a SAS dataset in a SAS library, consider using DATA step processing instead. Something like this:

/* Create some data to experiment with */
data mylib.have;
	length deal_id date variable 8;
	format deal_id z4. date variable yymmddd10.;
	drop _:;
	retain deal_id 100;
	do _m=1 to 12 by 2;
		do _d=1 to 31 by 3;
			date=mdy(_m,_d,2024);
			if not _error_ then do;
				variable=date;
				output;
			end;
		end;
	end;
run;

title "MYLIB.HAVE - Before mods";
proc print data=mylib.have(obs=5);run;

/* If you can, create both data sets in the same library: */

/* Add the new column, write out to a new data set */
data mylib.want;
	set mylib.have;
	format shifted_variable yymmddd10.;
	shifted_variable=intnx('month', date, -1, 'E');
run;

proc datasets library=mylib nolist nodetails;
   /* The, delete the original */
	delete have;
run;
   /* And rename the new data set with the original data set's name */
	change want=have;
run; quit;

title "MYLIB.HAVE - After mods";
proc print data=mylib.have(obs=5);run;

MYLIB.HAVE - Before mods

Obs	deal_id	date	variable
1	0100	2024-01-01	2024-01-01
2	0100	2024-01-04	2024-01-04
3	0100	2024-01-07	2024-01-07
4	0100	2024-01-10	2024-01-10
5	0100	2024-01-13	2024-01-13

MYLIB.HAVE - After mods

Obs	deal_id	date	variable	shifted_variable
1	0100	2024-01-01	2024-01-01	2023-12-31
2	0100	2024-01-04	2024-01-04	2023-12-31
3	0100	2024-01-07	2024-01-07	2023-12-31
4	0100	2024-01-10	2024-01-10	2023-12-31
5	0100	2024-01-13	2024-01-13	2023-12-31

If your SAS data set had indexes or integrity constraints, they will have been deleted by the DATA step process. You can rebuild them on the new dataset using the PROC DATASETS INDEX CREATE and IC CREATE statements.

Check out my Jedi SAS Tricks for SAS Users

Tom · Posted 03-27-2025 07:41 PM

Is this a SAS dataset we are talking about?

Don't use ALTER TABLE. That will make a NEW dataset. Then you will have push individual values into individual observations (think counting grains of sand or replacing all of the orange grains of sand with yellow grains). https://en.wikipedia.org/wiki/File:Asynchronous_Syntropy.jpg

Instead just run a step to make the new dataset.

And if by "shift" you just mean remember the previous value then the step is almost trivial.

data want;
  set have;
  by deal_id date;
  shifted_variable=lag(variable);
  if first.deal_id then call missing(shifted_variable);
run;

ChrisNZ · Posted 03-28-2025 12:16 AM

Performance is fickle and depends on the data and how you use it.

ALTER creates a new data set, and isn't usually faster than CREATE, but it depends..

SQL does things in a different way, and can also exhibit slowness compared to data steps. SAS has really tuned these over the years.

Consider this:

data TEST1 TEST2 TEST3; 
  do I=1 to 5e7;
    output;
  end;
run;
 
 
proc sql; %*** 4.1 seconds ***;
  alter table TEST1 add J int;
quit;
 
  
proc sql; %*** 4.7 seconds ***;
  create table TEST2 as select *, . as J from TEST2 ;
quit; 
 

data TEST3; %*** 2.7 seconds ***;
  set TEST3; 
  retain J .;
run;

A data step is fastest here, CREATE takes 70% more time. ALTER is in the middle.

I ran that code on SAS ODA btw.

Bottom line: You'll build a feel for what methods work best for different tasks, but when performance becomes important, nothing replaces bench-marking your processes with your data on your hardware.

High-Performance SAS Coding - Third Edition

FreelanceReinh · Posted 03-28-2025 02:52 PM

Hello @abigel,

You know your data best and also what the next steps will be after adding that "shifted date column" to the large dataset.

That said, looking at your code and running it against simple test data (see further below), there are a couple of things I noticed suggesting that aiming at a different result (e.g., a separate rather than augmented dataset) might be more efficient than your current approach:

Contrary to what your subject line suggests, the added column does not contain values of variable DATE, but values of variable VARIABLE.
I would expect a significant portion of the values of the newly added column to be trivially missing: all observations with DATE values other than end-of-month dates. This is a consequence of your left join involving the ON condition
```
t1.date = intnx('month', t2.date, -1, 'E')
```
Given that the existing dataset is so large, it seems questionable to me that your code has the potential of adding lots of (copied) observations ("rows") to it, not just a derived variable ("column"). In a simple test dataset with four years' worth of data per DEAL_ID and one observation per day the number of observations almost doubled (see below): For each DEAL_ID and year, the left join inserted 27 (in leap years: 28) copies of the observation of January 31st in order to accommodate the 28 (or 29, resp.) values of VARIABLE from the subsequent February. Similarly, the resulting dataset contained 31 observations (per DEAL_ID and year) for February 28th (or 29th, resp.) -- all identical except for the newly added SHIFTED_VARIABLE (containing March VARIABLE values), and so on. Consider the amount of redundancy with your real 88-variables dataset!
With more than one observation per end-of-month date even more redundant information would be generated (unless there were fewer than two observations from the subsequent month), as each of those observations would be copied so many times as described above.

/*  Create simplified test data */

data have;
call streaminit(27182818);
do deal_id=1 to 3;
  do date='01JAN2021'd to '31DEC2024'd;
    v=rand('integer',100,999);
    x=rand('uniform');
    output;
  end;
end;
format date date9.;
run; /* 4383 obs. */

/* Perform the original left join */

proc sql;
create table result as
select t1.*, t2.v as shifted_v
from have as t1
left join (select date, deal_id, v from have) as t2
on t1.deal_id = t2.deal_id and t1.date = intnx('month', t2.date, -1, 'E')
order by t1.deal_id, t1.date;
quit; /* 8532 obs. */

Efficiently Adding a Shifted Date Column to a Large SAS Dataset