BookmarkSubscribeRSS Feed
river1
Obsidian | Level 7

Hello, I'm having problems with disk space, I am running a lot of code that is linking three datasets so I need to keep many of the datasets to refer back to. I delete wherever possible but they are huge files. I have used (compress=yes) wherever possible and I've dropped any variables I am not using.

 

At the moment I am running this code over and over again to retrieve data using different variables in the two databases but I'll need to work in a third next. In total it will be over 300 datasets for this segment but I can only store about 10 and then my 100GB space is depleted. I added the proc data step block to compress the output but the file size is exactly the same file size.

 

I've tried writing a programme to reduce the effort of changing the variables but I just ran out of space so I gave up. Are there any techniques that I can refer to or do people just buy external storage? 

I am using SAS Enterprise Guide 8.3 Update 8 (8.3.8.206) (64-bit)z. Thank you for the advice

proc sql;
  create table MYLIB3.DATA3 as
  select *
  from MYLIB1.DATA1
  where ID in (select ID from MYLIB2.DATA2 where VAR_2="XXXX");
quit;


proc datasets library=MYLIB3;
  modify DATA3 (compress=yes);
run;


proc sql;
    select count(distinct ID) as DATA3
    from MYLIB3.DATA3;
quit;

  

13 REPLIES 13
LinusH
Tourmaline | Level 20

Verify that your compress is actually compressing your data (see the log).

If your data contains mostly numerical variables you could try COMPRESS=BINARY.

Other than that, re-factoring your code or adding more disk space are your options I believe.

Data never sleeps
river1
Obsidian | Level 7
Thanks for your help. The compress is not working with the datasets created via proc sql. I tried the COMPRESS=BINARY and the file size remained the same with the same warning messages. No problem, I’ll just run fewer blocks of code each time and delete. I have several datasets that I am using and then the rest can be deleted. I have used the compress=yes to create the earlier datasets with success so this must be because I can only use it to create the initial SAS datasets.

Error message:
WARNING: The option COMPRESS is not valid in this context. Option ignored.
WARNING: Some options for file mylib.data1 were not processed because of errors or warnings noted above.
Kurt_Bremser
Super User

Keep in mind that all utility files (as used by PROC SORT and PROC SQL) are uncompressed; the higher the compression rate of your source dataset, the more severe the impact when you run one of these.

Patrick
Opal | Level 21

If you set options compress=yes at the beginning of your program then all SAS tables will get stored compress and you don't need to define it anymore as data set option.

 

You say your have 3 source tables. How big are they? 

 

What makes me a bit suspicious in regards of your design is your statement: "over 300 datasets" How do you get "suddenly" from 3 to over 300 tables.

 

Besides of dropping variable what also will reduce storage requirements is defining lengths for your variables that aren't bigger than required. For example some analytical tables may have many many indicator variable that only store 0 and 1. Ideally that's stored a CHAR(1) but if it needs to be a numerical variable then one can save a lot of space by defining these variables with a length of 3 (instead of default 8). 3 bytes are still more than enough to store the values 0 and 1.

 

river1
Obsidian | Level 7
I’m retrieving data between a few datasets for multivariate data analysis. I’ve created a few variables which I am using to then compare age/gender/ethnicity and other variables etc. There isn’t a linking variable in all three datasets so I have to created intermediary variables and these have taken up a lot of storage. But for the counts I am running a few blocks and then deleting.
AhmedAl_Attar
Rhodochrosite | Level 12

Hi @river1 

 

Try using the attached macro which I developed to reduce the disk storage consumed by SAS data set(s) without compression! compressing SAS data sets, would increase CPU time, as the data would have to be decompressed when read and recompressed  when written out.

 

You can try it out with a sample/fraction of your large data set(s) to see what impact it yield on the final disk consumption.

 

Have a look at this  formally known as "SAS Global Forum" paper Twenty Ways to Run Your SAS® Program Faster and Use Less Space By Stephen Sloan, Accenture

He provides several ways to preserve space when dealing with large SAS data sets.

 

Hope this helps,

Ahmed

river1
Obsidian | Level 7
thanks these were very helpful
Kurt_Bremser
Super User

Replace your SQL with a DATA step and a hash object:

data mylib.data3 (compress=yes);
set mylib.data1;
if _n_ = 1
then do;
  declare hash d2 (dataset:"mylib.data2 (where=(var_2='XXX'))");
  d2.definekey("id");
  d2.definedone();
end;
if d2.check() = 0;
run;

This will sort DATA2 purely in-memory, so no utility file is created (which often causes SQL to blow up your disk space).

river1
Obsidian | Level 7
thanks for this, I used the hash object to create my intermediary datasets. I now see the problem was use use the sql. This makes sense
andreas_lds
Jade | Level 19

Please explain what the you are doing and what you expect as result.

Others have already pointed out that creating 300 datasets does not sound reasonable at all.

Tom
Super User Tom
Super User

There is no need to copy the data twice to get it compressed.  You can collapse the first two steps into one.

proc sql;
  create table MYLIB3.DATA3(compress=yes) as
  select *
  from MYLIB1.DATA1
  where ID in (select ID from MYLIB2.DATA2 where VAR_2="XXXX");
quit;

The last step just prints a count.  If that is all you need there is no need to store the dataset at all.

proc sql;
  select distinct(id) 
  from MYLIB1.DATA1
  where ID in (select ID from MYLIB2.DATA2 where VAR_2="XXXX");
quit;

 

You mentioned using different "variables".  But it looks like instead you are using different subsets of the OBSERVATIONS.  That is you are taking the subset of observations from DATA3 for just the ID's that have a specific value of VAR_2 in DATA2.

SASKiwi
PROC Star

Where are your SAS sessions running? On a remote SAS server or on your PC? Are you using SAS On-Demand?

 

For remote SAS servers, your SAS administrator should be able to ensure you have enough disk space to do your work. By all means code your SAS programs so you use disk space economically, but at the end of the day spending all your time recoding to save just a bit more disk space is counterproductive. 

 

If your sessions run on your PC, then you are in control of managing your own disk space. Make sure you delete any unwanted files and check that your SAS WORK directory only contains current session folders - remove any old ones that may have been kept because EG didn't close cleanly.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 4254 views
  • 6 likes
  • 8 in conversation