Dear All,
I have a large dataset stored on a remote SAS server with approximately 400 million rows and 88 columns. My goal is to create a new column by shifting a date variable by 1 month. I intend to do this with a simple left join, similar to what I would do in a SQL server.
My problem is as follows: when I tried to use the ALTER TABLE
command, it ran indefinitely without producing any results. However, when I created a new table, the process finished in 30 minutes. How should I properly execute this process?
Here is the hard-copy version of my code:
rsubmit; %let start_time = %sysfunc(datetime()); proc sql; drop table table_v2; create table table_v2 as select *, t2.variable as shifted_variable from table_v1 as t1 left join (select date, deal_id, variable from table_v1) as t2 on t1.deal_id = t2.deal_id and t1.date = intnx('month', t2.date, -1, 'E') order by t1.deal_id, t1.date; quit; %let end_time = %sysfunc(datetime()); %let duration = %sysevalf(&end_time - &start_time); %put NOTE: The query took &duration seconds to run.; endrsubmit;
Thank you for your assistance!
Gabor
Welcome @abigel to the sas support community:
you wrote:
However, when I created a new table, the process finished in 30 minutes. How should I properly execute this process?
I would use the new table creation process. It's simply faster.
As to why this is the case, I write the below as a non-expert in PROC SQL:
"Integrity constraints are rules that you specify to guarantee the accuracy, completeness, or consistency of data in tables. All integrity constraints are enforced when you insert, delete, or alter data values in the columns of a table for which integrity constraints have been defined. Before a constraint is added to a table that contains existing data, all the data is checked to determine that it satisfies the constraints."
This suggests to me that the ALTER statement would need to allocate time to apply those integrity constraints in modifying every observation in the table. OTOH, I presume that making a new table would not automatically do so using the code you have shown.
In addition to the integrity constraints mentioned by @mkeintz, indexes on the table may also require modification. So, a couple of questions:
1. Is your data in a SAS data set, or is it stored in a database table accessed via a SAS/ACCESS LIBNAME engine?
If so, the INTNX function cannot be passed to the database. This means that the entire table would have to be pulled out into SAS, the data modified, then copied back into the database. Depending on bandwidth and server co-location issues, this could really choke your performance.
2. Does the table have indexes or integrity constraints defined?
If so, consider dropping the indexes before updating and re-buliding them after the update is complete.
If this is a SAS dataset in a SAS library, consider using DATA step processing instead. Something like this:
/* Create some data to experiment with */
data mylib.have;
length deal_id date variable 8;
format deal_id z4. date variable yymmddd10.;
drop _:;
retain deal_id 100;
do _m=1 to 12 by 2;
do _d=1 to 31 by 3;
date=mdy(_m,_d,2024);
if not _error_ then do;
variable=date;
output;
end;
end;
end;
run;
title "MYLIB.HAVE - Before mods";
proc print data=mylib.have(obs=5);run;
/* If you can, create both data sets in the same library: */
/* Add the new column, write out to a new data set */
data mylib.want;
set mylib.have;
format shifted_variable yymmddd10.;
shifted_variable=intnx('month', date, -1, 'E');
run;
proc datasets library=mylib nolist nodetails;
/* The, delete the original */
delete have;
run;
/* And rename the new data set with the original data set's name */
change want=have;
run; quit;
title "MYLIB.HAVE - After mods";
proc print data=mylib.have(obs=5);run;
MYLIB.HAVE - Before mods |
Obs | deal_id | date | variable |
---|---|---|---|
1 | 0100 | 2024-01-01 | 2024-01-01 |
2 | 0100 | 2024-01-04 | 2024-01-04 |
3 | 0100 | 2024-01-07 | 2024-01-07 |
4 | 0100 | 2024-01-10 | 2024-01-10 |
5 | 0100 | 2024-01-13 | 2024-01-13 |
MYLIB.HAVE - After mods |
Obs | deal_id | date | variable | shifted_variable |
---|---|---|---|---|
1 | 0100 | 2024-01-01 | 2024-01-01 | 2023-12-31 |
2 | 0100 | 2024-01-04 | 2024-01-04 | 2023-12-31 |
3 | 0100 | 2024-01-07 | 2024-01-07 | 2023-12-31 |
4 | 0100 | 2024-01-10 | 2024-01-10 | 2023-12-31 |
5 | 0100 | 2024-01-13 | 2024-01-13 | 2023-12-31 |
If your SAS data set had indexes or integrity constraints, they will have been deleted by the DATA step process. You can rebuild them on the new dataset using the PROC DATASETS INDEX CREATE and IC CREATE statements.
Is this a SAS dataset we are talking about?
Don't use ALTER TABLE. That will make a NEW dataset. Then you will have push individual values into individual observations (think counting grains of sand or replacing all of the orange grains of sand with yellow grains). https://en.wikipedia.org/wiki/File:Asynchronous_Syntropy.jpg
Instead just run a step to make the new dataset.
And if by "shift" you just mean remember the previous value then the step is almost trivial.
data want;
set have;
by deal_id date;
shifted_variable=lag(variable);
if first.deal_id then call missing(shifted_variable);
run;
Performance is fickle and depends on the data and how you use it.
ALTER creates a new data set, and isn't usually faster than CREATE, but it depends..
SQL does things in a different way, and can also exhibit slowness compared to data steps. SAS has really tuned these over the years.
Consider this:
data TEST1 TEST2 TEST3;
do I=1 to 5e7;
output;
end;
run;
proc sql; %*** 4.1 seconds ***;
alter table TEST1 add J int;
quit;
proc sql; %*** 4.7 seconds ***;
create table TEST2 as select *, . as J from TEST2 ;
quit;
data TEST3; %*** 2.7 seconds ***;
set TEST3;
retain J .;
run;
A data step is fastest here, CREATE takes 70% more time. ALTER is in the middle.
I ran that code on SAS ODA btw.
Bottom line: You'll build a feel for what methods work best for different tasks, but when performance becomes important, nothing replaces bench-marking your processes with your data on your hardware.
Hello @abigel,
You know your data best and also what the next steps will be after adding that "shifted date column" to the large dataset.
That said, looking at your code and running it against simple test data (see further below), there are a couple of things I noticed suggesting that aiming at a different result (e.g., a separate rather than augmented dataset) might be more efficient than your current approach:
t1.date = intnx('month', t2.date, -1, 'E')
/* Create simplified test data */
data have;
call streaminit(27182818);
do deal_id=1 to 3;
do date='01JAN2021'd to '31DEC2024'd;
v=rand('integer',100,999);
x=rand('uniform');
output;
end;
end;
format date date9.;
run; /* 4383 obs. */
/* Perform the original left join */
proc sql;
create table result as
select t1.*, t2.v as shifted_v
from have as t1
left join (select date, deal_id, v from have) as t2
on t1.deal_id = t2.deal_id and t1.date = intnx('month', t2.date, -1, 'E')
order by t1.deal_id, t1.date;
quit; /* 8532 obs. */
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.