BookmarkSubscribeRSS Feed
makset
Obsidian | Level 7

Hi guys
I have a little problem.
In the program I read a file about 50gb so to speed up the calculations I divided it into smaller ones according to one of the variables. This resulted in a significant shortening of the calculations. But another problem arose. The divided files take up much more than 100gb of disk space (many small files of 128kb each). I use sas 9.4. You can do something about it.

 

Thank you for your help
Best regards

6 REPLIES 6
Kurt_Bremser
Super User

We can't do anything about it, that's up to you. We do not have access to your SAS server 😉

But we can give you hints.

My first suspicion is that your original dataset is compressed, and your subset datasets are not.

Make sure to use the COMPRESS=YES dataset option when creating the subsets.

Run a PROC CONTENTS on your original dataset to see if and how it is compressed.

ChrisNZ
Tourmaline | Level 20

1. 50 GB to 128 kB seem like way too many small files. Can you make larger chunks? They will compress better. If you use a binary-compressed SPDE library, the compression will be much higher still. But not on such small files.

2. Another way is to store the files in a compressed folder. Larger files are also better here.

3. Another way is to process the original large file in chunks by using a BY statement, or by using successive where clauses.

4. 50 GB to 128 kB  is about 400,000 files. Are you sure you want this?

5. With 128 kB files, you waste a good part of the disk space, depending on the file system's cluster size. 

In summary: the method you describe seems sub-optimal.

 

makset
Obsidian | Level 7

I split the entire dataset by the values of the three variables.
Small files are not my guess but the distribution of the variable.

This is not optimal in terms of disk space, but in terms of computing speed, yes

Kurt_Bremser
Super User

If you need to process all data anyway, the overall time will increase by splitting. And some analysis will only be valid if run on all data at once.

Splitting makes sense if only a subset is needed repeatedly (otherwise a WHERE condition in the first step will be sufficient), resulting in LESS disk space, not MORE as in your case, or if you just need an arbitrary subset for testing your code before running it on the whole dataset.

ChrisNZ
Tourmaline | Level 20

> This is not optimal in terms of disk space, but in terms of computing speed, yes

I doubt it, and as you can see I am not the only one.

And we haven't mentioned the time needed to create and delete these files.

 

Anyway, it seems you have two solutions:

- Implement our suggestions and have larger files. Maybe use 2 variables instead of 3?.Or even 1.

- Keep your method, In this case, you create 200,00 files, process them, and then do the same for the other half.

.

 

 

Kurt_Bremser
Super User

I totally missed that:

many small files of 128kb each

 

That is WAY too small. That's basically a single SAS dataset page for each, so you create LOTs of overhead, no matter which options you use.

Try your calculation on a subset of about 5 GB in size (a tenth of the original dataset).

And also identify which part of your calculations takes up too much time when using the whole dataset. There may be more efficient methods that allow you to process the whole dataset at once.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 589 views
  • 1 like
  • 3 in conversation