When creating/replacing a data set with a data step with a REPLACE=YES option, does the old data set get deleted/truncated as soon as new data shows up, or does the new copy get created so that both the old and new exist during the data step and the old one gets deleted at the successful conclusion of the data step?
The reason I'm asking is, we have some large (relative to their available storage) data sets that occasionally get rebuilt from scratch from external data. We could delete the data set first to ensure there's enough space for the new version, and in practice that's what we end up doing manually when multiple jobs lay claim to insufficient free space and things blow up. The argument then becomes "the replace=yes option was set, so it shouldn't be a problem."
Clearly we need better data management around this constrained resource. But the question remains: Does REPLACE=YES in a data step free up and re-use the original data sets space, or does it essentially need at least as much free space as the data set it's replacing?
The replace statement in datastep works the same way as the system options replace. It just tells the system whether or not to push out an error when you use the same name coming in as the name going out. It does not affect how SAS reads/writes the data. This is controlled in datastep by the PDV (program data vector), one row of data is read from the dataset, processed by the datastep, then written out. The file is locked during this process. So yes, you will need to have more storage than just the file alone.
When you say these things are being rebuilt from scratch, then I assume you would want both sets anyways, I would assume that this would be part of the validation process, and that the rebuild would not be pushed to production without some sort of testing procedure? So you would have two environments? If no, then yes, I would highly advise look at the whole process, it may be better to use update methodology, or to have a test environment or something in place.
my experience has been that the original SAS dataset lives while the SAS process creates a locked version.
After the process has completed the locked version replaces then overwrites original dataset.
so if the dataset is not deleted from the server or server parturition you will need an estimated 2 times the space to preform the procedure.
As soon as a large dataset is replaced from another source (so you basically do not have
data x;
set x;
.....
run;), you can (and should) delete it beforehand (make sure there is a backup and that processing up to this point was successful).
In my batch jobs, I always do that by using the operating system's tools, not only because I might unnecessarily run out of space, but because I avoid any problems caused by datasets in use.
If you (re-) create a SAS data set via a SAS data step then SAS always writes to a new generated table name. Only when the whole data step finishes successfully this generated table gets renamed to the name you've used in your DATA statement. If a same named table already exists in this location then the new table will replace the existing one.
This process is very robust and I can't remember that I've ever ended up with a corrupted table - let's say due to an out of disk space condition (which would occur when the temporary table gets written; the rename/replace doesn't require additional disk space).
It's a different story of course if you're using code like the MODIFY statement or SQL which changed data in-place (same table).
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.
