Solved: Historical Datasets Without Multiple Permanent Datasets

eabc0351 · Posted 03-29-2019 09:07 AM

Hello. I have broad question about working with large datasets in SAS.

In the past, my work has used datasets that were relatively small in size. My preferred way of working with that data is to create a temporary dataset in the work folder, work through a session/program with multiple temporary datasets, and then save a new permanent from the last temp dataset at the end of a session/program.

For example:

data tempwork; set permlib.originaldata; run;

... SAS program code...

data permlib.originaldata_v1_032919; set tempwork20; run;

Each time that a new work session is started, I make a temporary dataset from the most recent permanent one, and begin the process over (e.g. data tempwork; set permlib.originaldata_v1_032919..... data permlib.originaldata_v2_04202019.... run;). This ends up with multiple permanent datasets. Maybe this is not "correct" or efficient, but I like working with multiple temp datasets when working toward building my final dataset.

I am starting a project with a datasets that totals about 600 GB. My former process is not possible, because having multiple datasets with such big flies is not an option. My idea was to copy over the same permanent dataset repeatedly. For example, instead of making a _v2, just copy back over _v1. This is not working because it replaces the historical datatset, thus I cannot rerun code without going all the way back to the beginning program and original dataset. I'm not sure how to proceed with this data using my preferred process.

Any suggestions for how to keep (or revert to) historical versions of datasets without taking up too much additional hard drive space? Could PROC DATASETS be used here. Any advice, with code, would be much appreciated.

FreelanceReinh · Posted 03-29-2019 05:14 PM

Hello @eabc0351,

You've already received some excellent advice. My first idea when I read your post was: You could draw a suitable "representative" sample from this dataset and use that for developing code (e.g. for initial data checks, data cleaning, restructuring, creating new variables, aggregated datasets, summary reports or whatever your tasks are). Then you would apply those programs to the full dataset only after they have been thoroughly tested and debugged, after final report templates have been agreed upon, etc. This should save you a lot of time and facilitate program development if it makes sense for your project and expected workflow.

View solution in original post

ballardw · Posted 03-29-2019 10:55 AM

The question sort of becomes why do you need to make all of those data sets?

The initial " temporary dataset in the work folder" is likely unneeded. Just make sure that the output from the first step that uses, and references, the permanent dataset does not overwrite anything you need to keep.

I am afraid that proper recommendations would require providing a lot more detail about the entire process.

eabc0351 · Posted 03-29-2019 11:11 AM

Right, this a likely an issue with my process. But I think SAS generation datasets might be the answer to keep historical versions.

ballardw · Posted 03-29-2019 12:28 PM

I have a suspicion that multiple 600GB sets are not going to make the IT people happy with you.

How many generations are you thinking you may need? Five generations are going to be about 3TB storage.

eabc0351 · Posted 03-29-2019 01:02 PM

They already aren't happy with me We have 2TB of storage for this. I can probably get rid of some of the un-needed observations before starting. If I needed 3 generations, @ballardw can you provide code for how to do this? Or is that more detailed than is possible in a post?

ErikLund_Jensen · Posted 03-29-2019 02:24 PM

Hi @eabc0351

You should in principle be able to hold 3 generations of your 600GB data set in your allocated 2TB. But depending on your process, space for an extra copy (.lck extension) may be necessary, and in that case there is only space for 2 generations.

Depending on the content of the big data set, compression can work wonders. In my work a data set is often reduced to anything from 20 to 60% of the uncompressed size, and the extra computing time is not substantially increased. Try both compressing algoritms, compress=yes and compress=binary and see what happens.

How much space is allocated to the sas work/utilloc libraries in your installation? This might be a bottleneck with such large data sets.

eabc0351 · Posted 03-29-2019 03:01 PM

Hi @ErikLund_Jensen. I have not tried compressing data before. This sounds like a good option. To your last question, I don't know offhand, but have never run out of space in the work library.

This information you provided is very helpful. Thank you !

tomrvincent · Posted 03-29-2019 03:50 PM

I'd start by normalizing the data, moving redundant fields into dimension tables. You could then apply SCD (slowing changing dimension) logic to those tables.

FreelanceReinh · Posted 03-29-2019 05:14 PM

Hello @eabc0351,

You've already received some excellent advice. My first idea when I read your post was: You could draw a suitable "representative" sample from this dataset and use that for developing code (e.g. for initial data checks, data cleaning, restructuring, creating new variables, aggregated datasets, summary reports or whatever your tasks are). Then you would apply those programs to the full dataset only after they have been thoroughly tested and debugged, after final report templates have been agreed upon, etc. This should save you a lot of time and facilitate program development if it makes sense for your project and expected workflow.

eabc0351 · Posted 04-01-2019 09:46 AM

Thank you for another good idea @FreelanceReinh. Very helpful!

Historical Datasets Without Multiple Permanent Datasets

Re: Historical Datasets Without Multiple Permanent Datasets

Re: Historical Datasets Without Multiple Permanent Datasets

Re: Historical Datasets Without Multiple Permanent Datasets

Re: Historical Datasets Without Multiple Permanent Datasets

Re: Historical Datasets Without Multiple Permanent Datasets

Re: Historical Datasets Without Multiple Permanent Datasets

Re: Historical Datasets Without Multiple Permanent Datasets

Re: Historical Datasets Without Multiple Permanent Datasets

Re: Historical Datasets Without Multiple Permanent Datasets

Re: Historical Datasets Without Multiple Permanent Datasets

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away