Disk space

river1 · Posted 12-18-2023 03:40 AM

Hello, I'm having problems with disk space, I am running a lot of code that is linking three datasets so I need to keep many of the datasets to refer back to. I delete wherever possible but they are huge files. I have used (compress=yes) wherever possible and I've dropped any variables I am not using.

At the moment I am running this code over and over again to retrieve data using different variables in the two databases but I'll need to work in a third next. In total it will be over 300 datasets for this segment but I can only store about 10 and then my 100GB space is depleted. I added the proc data step block to compress the output but the file size is exactly the same file size.

I've tried writing a programme to reduce the effort of changing the variables but I just ran out of space so I gave up. Are there any techniques that I can refer to or do people just buy external storage?

I am using SAS Enterprise Guide 8.3 Update 8 (8.3.8.206) (64-bit)z. Thank you for the advice

proc sql;
  create table MYLIB3.DATA3 as
  select *
  from MYLIB1.DATA1
  where ID in (select ID from MYLIB2.DATA2 where VAR_2="XXXX");
quit;


proc datasets library=MYLIB3;
  modify DATA3 (compress=yes);
run;


proc sql;
    select count(distinct ID) as DATA3
    from MYLIB3.DATA3;
quit;

LinusH · Posted 12-18-2023 04:55 AM

Verify that your compress is actually compressing your data (see the log).

If your data contains mostly numerical variables you could try COMPRESS=BINARY.

Other than that, re-factoring your code or adding more disk space are your options I believe.

Data never sleeps

river1 · Posted 12-24-2023 10:30 PM

Thanks for your help. The compress is not working with the datasets created via proc sql. I tried the COMPRESS=BINARY and the file size remained the same with the same warning messages. No problem, I’ll just run fewer blocks of code each time and delete. I have several datasets that I am using and then the rest can be deleted. I have used the compress=yes to create the earlier datasets with success so this must be because I can only use it to create the initial SAS datasets.

Error message:
WARNING: The option COMPRESS is not valid in this context. Option ignored.
WARNING: Some options for file mylib.data1 were not processed because of errors or warnings noted above.

Kurt_Bremser · Posted 12-25-2023 02:04 AM

Keep in mind that all utility files (as used by PROC SORT and PROC SQL) are uncompressed; the higher the compression rate of your source dataset, the more severe the impact when you run one of these.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 12-18-2023 05:03 AM

If you set options compress=yes at the beginning of your program then all SAS tables will get stored compress and you don't need to define it anymore as data set option.

You say your have 3 source tables. How big are they?

What makes me a bit suspicious in regards of your design is your statement: "over 300 datasets" How do you get "suddenly" from 3 to over 300 tables.

Besides of dropping variable what also will reduce storage requirements is defining lengths for your variables that aren't bigger than required. For example some analytical tables may have many many indicator variable that only store 0 and 1. Ideally that's stored a CHAR(1) but if it needs to be a numerical variable then one can save a lot of space by defining these variables with a length of 3 (instead of default 8). 3 bytes are still more than enough to store the values 0 and 1.

Kurt_Bremser · Posted 12-18-2023 07:06 AM

Why do you keep all variables when you only want a count distinct of id?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

river1 · Posted 12-24-2023 11:06 PM

I’m retrieving data between a few datasets for multivariate data analysis. I’ve created a few variables which I am using to then compare age/gender/ethnicity and other variables etc. There isn’t a linking variable in all three datasets so I have to created intermediary variables and these have taken up a lot of storage. But for the counts I am running a few blocks and then deleting.

AhmedAl_Attar · Posted 12-18-2023 07:52 AM

Hi @river1

Try using the attached macro which I developed to reduce the disk storage consumed by SAS data set(s) without compression! compressing SAS data sets, would increase CPU time, as the data would have to be decompressed when read and recompressed when written out.

You can try it out with a sample/fraction of your large data set(s) to see what impact it yield on the final disk consumption.

Have a look at this formally known as "SAS Global Forum" paper Twenty Ways to Run Your SAS® Program Faster and Use Less Space By Stephen Sloan, Accenture

He provides several ways to preserve space when dealing with large SAS data sets.

Hope this helps,

Ahmed

river1 · Posted 12-24-2023 11:07 PM

thanks these were very helpful

Kurt_Bremser · Posted 12-18-2023 08:00 AM

Replace your SQL with a DATA step and a hash object:

data mylib.data3 (compress=yes);
set mylib.data1;
if _n_ = 1
then do;
  declare hash d2 (dataset:"mylib.data2 (where=(var_2='XXX'))");
  d2.definekey("id");
  d2.definedone();
end;
if d2.check() = 0;
run;

This will sort DATA2 purely in-memory, so no utility file is created (which often causes SQL to blow up your disk space).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

river1 · Posted 12-24-2023 11:09 PM

thanks for this, I used the hash object to create my intermediary datasets. I now see the problem was use use the sql. This makes sense

andreas_lds · Posted 12-18-2023 08:08 AM

Please explain what the you are doing and what you expect as result.

Others have already pointed out that creating 300 datasets does not sound reasonable at all.

Tom · Posted 12-18-2023 01:42 PM

There is no need to copy the data twice to get it compressed. You can collapse the first two steps into one.

proc sql;
  create table MYLIB3.DATA3(compress=yes) as
  select *
  from MYLIB1.DATA1
  where ID in (select ID from MYLIB2.DATA2 where VAR_2="XXXX");
quit;

The last step just prints a count. If that is all you need there is no need to store the dataset at all.

proc sql;
  select distinct(id) 
  from MYLIB1.DATA1
  where ID in (select ID from MYLIB2.DATA2 where VAR_2="XXXX");
quit;

You mentioned using different "variables". But it looks like instead you are using different subsets of the OBSERVATIONS. That is you are taking the subset of observations from DATA3 for just the ID's that have a specific value of VAR_2 in DATA2.

SASKiwi · Posted 12-18-2023 07:40 PM

Where are your SAS sessions running? On a remote SAS server or on your PC? Are you using SAS On-Demand?

For remote SAS servers, your SAS administrator should be able to ensure you have enough disk space to do your work. By all means code your SAS programs so you use disk space economically, but at the end of the day spending all your time recoding to save just a bit more disk space is counterproductive.

If your sessions run on your PC, then you are in control of managing your own disk space. Make sure you delete any unwanted files and check that your SAS WORK directory only contains current session folders - remove any old ones that may have been kept because EG didn't close cleanly.

Disk space

Re: Disk space

Re: Disk space

Re: Disk space

Re: Disk space

Re: Disk space

Re: Disk space

Re: Disk space

Re: Disk space

Re: Disk space

Re: Disk space

Re: Disk space

Re: Disk space

Re: Disk space