Hello. I have broad question about working with large datasets in SAS.
In the past, my work has used datasets that were relatively small in size. My preferred way of working with that data is to create a temporary dataset in the work folder, work through a session/program with multiple temporary datasets, and then save a new permanent from the last temp dataset at the end of a session/program.
For example:
data tempwork; set permlib.originaldata; run;
... SAS program code...
data permlib.originaldata_v1_032919; set tempwork20; run;
Each time that a new work session is started, I make a temporary dataset from the most recent permanent one, and begin the process over (e.g. data tempwork; set permlib.originaldata_v1_032919..... data permlib.originaldata_v2_04202019.... run;). This ends up with multiple permanent datasets. Maybe this is not "correct" or efficient, but I like working with multiple temp datasets when working toward building my final dataset.
I am starting a project with a datasets that totals about 600 GB. My former process is not possible, because having multiple datasets with such big flies is not an option. My idea was to copy over the same permanent dataset repeatedly. For example, instead of making a _v2, just copy back over _v1. This is not working because it replaces the historical datatset, thus I cannot rerun code without going all the way back to the beginning program and original dataset. I'm not sure how to proceed with this data using my preferred process.
Any suggestions for how to keep (or revert to) historical versions of datasets without taking up too much additional hard drive space? Could PROC DATASETS be used here. Any advice, with code, would be much appreciated.
Hello @eabc0351,
You've already received some excellent advice. My first idea when I read your post was: You could draw a suitable "representative" sample from this dataset and use that for developing code (e.g. for initial data checks, data cleaning, restructuring, creating new variables, aggregated datasets, summary reports or whatever your tasks are). Then you would apply those programs to the full dataset only after they have been thoroughly tested and debugged, after final report templates have been agreed upon, etc. This should save you a lot of time and facilitate program development if it makes sense for your project and expected workflow.
The question sort of becomes why do you need to make all of those data sets?
The initial " temporary dataset in the work folder" is likely unneeded. Just make sure that the output from the first step that uses, and references, the permanent dataset does not overwrite anything you need to keep.
I am afraid that proper recommendations would require providing a lot more detail about the entire process.
Right, this a likely an issue with my process. But I think SAS generation datasets might be the answer to keep historical versions.
I have a suspicion that multiple 600GB sets are not going to make the IT people happy with you.
How many generations are you thinking you may need? Five generations are going to be about 3TB storage.
They already aren't happy with me We have 2TB of storage for this. I can probably get rid of some of the un-needed observations before starting. If I needed 3 generations, @ballardw can you provide code for how to do this? Or is that more detailed than is possible in a post?
Hi @eabc0351
You should in principle be able to hold 3 generations of your 600GB data set in your allocated 2TB. But depending on your process, space for an extra copy (.lck extension) may be necessary, and in that case there is only space for 2 generations.
Depending on the content of the big data set, compression can work wonders. In my work a data set is often reduced to anything from 20 to 60% of the uncompressed size, and the extra computing time is not substantially increased. Try both compressing algoritms, compress=yes and compress=binary and see what happens.
How much space is allocated to the sas work/utilloc libraries in your installation? This might be a bottleneck with such large data sets.
Hi @ErikLund_Jensen. I have not tried compressing data before. This sounds like a good option. To your last question, I don't know offhand, but have never run out of space in the work library.
This information you provided is very helpful. Thank you !
Hello @eabc0351,
You've already received some excellent advice. My first idea when I read your post was: You could draw a suitable "representative" sample from this dataset and use that for developing code (e.g. for initial data checks, data cleaning, restructuring, creating new variables, aggregated datasets, summary reports or whatever your tasks are). Then you would apply those programs to the full dataset only after they have been thoroughly tested and debugged, after final report templates have been agreed upon, etc. This should save you a lot of time and facilitate program development if it makes sense for your project and expected workflow.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.