Hello, I'm having problems with disk space, I am running a lot of code that is linking three datasets so I need to keep many of the datasets to refer back to. I delete wherever possible but they are huge files. I have used (compress=yes) wherever possible and I've dropped any variables I am not using.
At the moment I am running this code over and over again to retrieve data using different variables in the two databases but I'll need to work in a third next. In total it will be over 300 datasets for this segment but I can only store about 10 and then my 100GB space is depleted. I added the proc data step block to compress the output but the file size is exactly the same file size.
I've tried writing a programme to reduce the effort of changing the variables but I just ran out of space so I gave up. Are there any techniques that I can refer to or do people just buy external storage?
I am using SAS Enterprise Guide 8.3 Update 8 (8.3.8.206) (64-bit)z. Thank you for the advice
proc sql; create table MYLIB3.DATA3 as select * from MYLIB1.DATA1 where ID in (select ID from MYLIB2.DATA2 where VAR_2="XXXX"); quit; proc datasets library=MYLIB3; modify DATA3 (compress=yes); run; proc sql; select count(distinct ID) as DATA3 from MYLIB3.DATA3; quit;
Verify that your compress is actually compressing your data (see the log).
If your data contains mostly numerical variables you could try COMPRESS=BINARY.
Other than that, re-factoring your code or adding more disk space are your options I believe.
Keep in mind that all utility files (as used by PROC SORT and PROC SQL) are uncompressed; the higher the compression rate of your source dataset, the more severe the impact when you run one of these.
If you set options compress=yes at the beginning of your program then all SAS tables will get stored compress and you don't need to define it anymore as data set option.
You say your have 3 source tables. How big are they?
What makes me a bit suspicious in regards of your design is your statement: "over 300 datasets" How do you get "suddenly" from 3 to over 300 tables.
Besides of dropping variable what also will reduce storage requirements is defining lengths for your variables that aren't bigger than required. For example some analytical tables may have many many indicator variable that only store 0 and 1. Ideally that's stored a CHAR(1) but if it needs to be a numerical variable then one can save a lot of space by defining these variables with a length of 3 (instead of default 8). 3 bytes are still more than enough to store the values 0 and 1.
Hi @river1
Try using the attached macro which I developed to reduce the disk storage consumed by SAS data set(s) without compression! compressing SAS data sets, would increase CPU time, as the data would have to be decompressed when read and recompressed when written out.
You can try it out with a sample/fraction of your large data set(s) to see what impact it yield on the final disk consumption.
Have a look at this formally known as "SAS Global Forum" paper Twenty Ways to Run Your SAS® Program Faster and Use Less Space By Stephen Sloan, Accenture
He provides several ways to preserve space when dealing with large SAS data sets.
Hope this helps,
Ahmed
Replace your SQL with a DATA step and a hash object:
data mylib.data3 (compress=yes);
set mylib.data1;
if _n_ = 1
then do;
declare hash d2 (dataset:"mylib.data2 (where=(var_2='XXX'))");
d2.definekey("id");
d2.definedone();
end;
if d2.check() = 0;
run;
This will sort DATA2 purely in-memory, so no utility file is created (which often causes SQL to blow up your disk space).
Please explain what the you are doing and what you expect as result.
Others have already pointed out that creating 300 datasets does not sound reasonable at all.
There is no need to copy the data twice to get it compressed. You can collapse the first two steps into one.
proc sql;
create table MYLIB3.DATA3(compress=yes) as
select *
from MYLIB1.DATA1
where ID in (select ID from MYLIB2.DATA2 where VAR_2="XXXX");
quit;
The last step just prints a count. If that is all you need there is no need to store the dataset at all.
proc sql;
select distinct(id)
from MYLIB1.DATA1
where ID in (select ID from MYLIB2.DATA2 where VAR_2="XXXX");
quit;
You mentioned using different "variables". But it looks like instead you are using different subsets of the OBSERVATIONS. That is you are taking the subset of observations from DATA3 for just the ID's that have a specific value of VAR_2 in DATA2.
Where are your SAS sessions running? On a remote SAS server or on your PC? Are you using SAS On-Demand?
For remote SAS servers, your SAS administrator should be able to ensure you have enough disk space to do your work. By all means code your SAS programs so you use disk space economically, but at the end of the day spending all your time recoding to save just a bit more disk space is counterproductive.
If your sessions run on your PC, then you are in control of managing your own disk space. Make sure you delete any unwanted files and check that your SAS WORK directory only contains current session folders - remove any old ones that may have been kept because EG didn't close cleanly.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.