The divided file takes up too much space

makset · Posted 10-01-2020 03:19 AM

Hi guys
I have a little problem.
In the program I read a file about 50gb so to speed up the calculations I divided it into smaller ones according to one of the variables. This resulted in a significant shortening of the calculations. But another problem arose. The divided files take up much more than 100gb of disk space (many small files of 128kb each). I use sas 9.4. You can do something about it.

Thank you for your help
Best regards

Kurt_Bremser · Posted 10-01-2020 03:34 AM

We can't do anything about it, that's up to you. We do not have access to your SAS server 😉

But we can give you hints.

My first suspicion is that your original dataset is compressed, and your subset datasets are not.

Make sure to use the COMPRESS=YES dataset option when creating the subsets.

Run a PROC CONTENTS on your original dataset to see if and how it is compressed.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 10-01-2020 05:01 AM

1. 50 GB to 128 kB seem like way too many small files. Can you make larger chunks? They will compress better. If you use a binary-compressed SPDE library, the compression will be much higher still. But not on such small files.

2. Another way is to store the files in a compressed folder. Larger files are also better here.

3. Another way is to process the original large file in chunks by using a BY statement, or by using successive where clauses.

4. 50 GB to 128 kB is about 400,000 files. Are you sure you want this?

5. With 128 kB files, you waste a good part of the disk space, depending on the file system's cluster size.

In summary: the method you describe seems sub-optimal.

High-Performance SAS Coding - Third Edition

makset · Posted 10-01-2020 06:16 AM

I split the entire dataset by the values of the three variables.
Small files are not my guess but the distribution of the variable.

This is not optimal in terms of disk space, but in terms of computing speed, yes

Kurt_Bremser · Posted 10-01-2020 06:42 AM

If you need to process all data anyway, the overall time will increase by splitting. And some analysis will only be valid if run on all data at once.

Splitting makes sense if only a subset is needed repeatedly (otherwise a WHERE condition in the first step will be sufficient), resulting in LESS disk space, not MORE as in your case, or if you just need an arbitrary subset for testing your code before running it on the whole dataset.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

ChrisNZ · Posted 10-01-2020 05:05 PM

> This is not optimal in terms of disk space, but in terms of computing speed, yes

I doubt it, and as you can see I am not the only one.

And we haven't mentioned the time needed to create and delete these files.

Anyway, it seems you have two solutions:

- Implement our suggestions and have larger files. Maybe use 2 variables instead of 3?.Or even 1.

- Keep your method, In this case, you create 200,00 files, process them, and then do the same for the other half.

.

High-Performance SAS Coding - Third Edition

Kurt_Bremser · Posted 10-01-2020 05:17 AM

I totally missed that:

many small files of 128kb each

That is WAY too small. That's basically a single SAS dataset page for each, so you create LOTs of overhead, no matter which options you use.

Try your calculation on a subset of about 5 GB in size (a tenth of the original dataset).

And also identify which part of your calculations takes up too much time when using the whole dataset. There may be more efficient methods that allow you to process the whole dataset at once.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

The divided file takes up too much space

Re: The divided file takes up too much space

Re: The divided file takes up too much space

Re: The divided file takes up too much space

Re: The divided file takes up too much space

Re: The divided file takes up too much space

Re: The divided file takes up too much space

The divided file takes up too much space

Re: The divided file takes up too much space

Re: The divided file takes up too much space

Re: The divided file takes up too much space

Re: The divided file takes up too much space

Re: The divided file takes up too much space

Re: The divided file takes up too much space

Registration is open

SAS Training: Just a Click Away