BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
eabc0351
Quartz | Level 8

Hello. I have broad question about working with large datasets in SAS.

 

In the past, my work has used datasets that were relatively small in size. My preferred way of working with that data is to create a temporary dataset in the work folder, work through a session/program with multiple temporary datasets, and then save a new permanent from the last temp dataset at the end of a session/program.

               For example:

                  data tempwork; set permlib.originaldata; run;

                 ... SAS program code...

                   data permlib.originaldata_v1_032919; set tempwork20; run;

Each time that a new work session is started, I make a temporary dataset from the most recent permanent one, and begin the process over (e.g. data tempwork; set permlib.originaldata_v1_032919..... data permlib.originaldata_v2_04202019.... run;). This ends up with multiple permanent datasets. Maybe this is not "correct" or efficient, but I like working with multiple temp datasets when working toward building my final dataset. 

 

I am starting a project with a datasets that totals about 600 GB. My former process is not possible, because having multiple datasets with such big flies is not an option. My idea was to copy over the same permanent dataset repeatedly. For example, instead of making a _v2, just copy back over _v1. This is not working because it replaces the historical datatset, thus I cannot rerun code without going all the way back to the beginning program and original dataset. I'm not sure how to proceed with this data using my preferred process.

 

Any suggestions for how to keep (or revert to) historical versions of datasets without taking up too much additional hard drive space? Could PROC DATASETS be used here. Any advice, with code, would be much appreciated.

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @eabc0351,

 

You've already received some excellent advice. My first idea when I read your post was: You could draw a suitable "representative" sample from this dataset and use that for developing code (e.g. for initial data checks, data cleaning, restructuring, creating new variables, aggregated datasets, summary reports or whatever your tasks are). Then you would apply those programs to the full dataset only after they have been thoroughly tested and debugged, after final report templates have been agreed upon, etc. This should save you a lot of time and facilitate program development if it makes sense for your project and expected workflow.

View solution in original post

9 REPLIES 9
ballardw
Super User

The question sort of becomes why do you need to make all of those data sets?

 

The initial " temporary dataset in the work folder" is likely unneeded. Just make sure that the output from the first step that uses, and references, the permanent dataset does not overwrite anything you need to keep.

 

 

I am afraid that proper recommendations would require providing a lot more detail about the entire process.

 

eabc0351
Quartz | Level 8

Right, this a likely an issue with my process. But I think SAS generation datasets might be the answer to keep historical versions.

 

ballardw
Super User

I have a suspicion that multiple 600GB sets are not going to make the IT people happy with you.

How many generations are you thinking you may need? Five generations are going to be about 3TB storage.

eabc0351
Quartz | Level 8

They already aren't happy with me Smiley Frustrated We have 2TB of storage for this. I can probably get rid of some of the un-needed observations before starting. If I needed 3 generations, @ballardw can you provide code for how to do this? Or is that more detailed than is possible in a post?

 

 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @eabc0351 

 

You should in principle be able to hold 3 generations of your 600GB data set in your allocated 2TB. But depending on your process, space for an  extra copy (.lck extension) may be necessary, and in that case there is only space for 2 generations.

 

Depending on the content of the big data set, compression can work wonders. In my work a data set is often reduced to anything from 20 to 60% of the uncompressed size, and the extra computing time is not substantially increased. Try both compressing algoritms, compress=yes and compress=binary and see what happens.

 

How much space is allocated to the sas work/utilloc libraries in your installation? This might be a bottleneck with such large data sets.

eabc0351
Quartz | Level 8

Hi @ErikLund_Jensen. I have not tried compressing data before. This sounds like a good option. To your last question, I don't know offhand, but have never run out of space in the work library.

 

This information you provided is very helpful. Thank you !

tomrvincent
Rhodochrosite | Level 12
I'd start by normalizing the data, moving redundant fields into dimension tables. You could then apply SCD (slowing changing dimension) logic to those tables.
FreelanceReinh
Jade | Level 19

Hello @eabc0351,

 

You've already received some excellent advice. My first idea when I read your post was: You could draw a suitable "representative" sample from this dataset and use that for developing code (e.g. for initial data checks, data cleaning, restructuring, creating new variables, aggregated datasets, summary reports or whatever your tasks are). Then you would apply those programs to the full dataset only after they have been thoroughly tested and debugged, after final report templates have been agreed upon, etc. This should save you a lot of time and facilitate program development if it makes sense for your project and expected workflow.

eabc0351
Quartz | Level 8

Thank you for another good idea @FreelanceReinh. Very helpful!

 

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1490 views
  • 1 like
  • 5 in conversation