04-15-2014 03:53 AM
I have a sas dataset with around 4000 million records , 26 columns.
I need to add 2 new char columns with just blanks..
Are the below statements efficient?
alter table libname.tablename add colName1 char(3) format=$3.;
alter table libname.tablename add colName2 char(3) format=$3.;
If not, is there a better way to add columns to alarge dataset such that it doesn't take huge time ?
04-15-2014 04:23 AM
If you know that you want to add two column, have them added in on pass of the data. Two alter table statements will do this in two separate passes.
Do you have any indexes or constraints on the table?
is it stored in Base SAS libname?
04-15-2014 05:45 AM
Not sure what you are trying to achieve here, but as a suggestion, if you have a large table like that and some of that data will have a flag say, then why not create a secondary table with just the id variables and the flag, and fill that out as necessary and join it on when needed further in the program. It is of the RDBMS form, so keeping the minimum amount of information necessary, i.e. that additional table is going to be empty until some data is derived thus is minimum. If you add two columns on then the size increases as all rows would have those additional variables on.
04-15-2014 06:51 AM
Ok, quite important info...
The table is not a member of cluster, or is a cluster itself?
Try the alter table statement (just one!), and put it in an explicit SQL pass-thru block.
Can't guarantee that this is the most efficient option. Try it out on a smaller temporary subset of the original table (with indexes).
04-16-2014 02:42 AM
If you don't have the capacity to add the column to the data, just add it in the process which reads that data (the easy way is to create a VIEW - and the best performance would be achieved with an SQL view)
04-15-2014 04:38 AM
You can do it in a data step if you just have a "simple" data set without indexes, constraints or other:
format colName1 colName2 $3.;
retain colName1 " " colName2 " "; * avoids "unitialized" notes in the log;
Any method has to rewrite the whole data set (because the structure of the records changes), so you will mostly be bound by I/O.
04-15-2014 06:56 AM
check the sasspds reference. https://support.sas.com/documentation/cdl/en/engspde/64755/PDF/default/engspde.pdf
You still need to think about the physical effects and performance. It does not free you of that headache.
04-16-2014 02:15 AM
Of course you get into this issues you have to think on the physical effects.
What I deduce:
- 4000M records 28 variables is about 400Gb or more of data.
- I do not see the OS Unix/Windows there are technical differences
You need twice that size for replacing the dataset. I do not expect you have that lot of free space at the SASSPDS setup
Filling reading 1Tb of data can easily give 6 hours of processing dependent on your IO-speed.
What you can do is unload, to an other location clean-up and reload, with all needed additional definitions.
There is time-schedule backup.recovery etc needed. If you can activate a new location instead the old it is a renaming switch.