BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
somebody
Lapis Lazuli | Level 10

I have a large dataset (about 500 Gb more than 1 bil obs). Some of the columns can be reformatted to make them smaller. For example, some have the format $128. but the values in there only have 20 characters at most. so I am thinking about reformatting them to to $20. to save space. The issue is my hard drive is only 500gb, and with this dataset, I have only 7Gb free space. Normally, I would use a DATA step to create new columns that take the values of the ones I want to reformat. However, SAS would create a temporary dataset, and clearly, this does not work. Is there another way? Otherwise, I would have to get a hard drive for this. 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

Yes, you need the disk space to make the new copy of the dataset.  Your first step is fine, although if you use the same name for input and output then SAS will do the delete and rename for you automatically for you.

 

Yes, if you are working with large datasets with many long character variables and limited disk space you should probably always run with the system option COMPRESS set to YES.  You will use a few more CPU cycles to compress/uncompress the data, but you will use much less disk space and much less time waiting for disk reads and writes.

 

Changing the FORMAT attached to the variable does NOT change the number of bytes used to store the variable, just how many are displayed.  You need to actually change how the variable is defined.  So you need to change the LENGTH of the variable. If you want to change the LENGTH of the variable you need to do it before the variable is seen by the data step compiler. So before the SET statement.  

 

If your character variables actually have formats like $128. attached to them and you change the length from 128 to 30 then do not leave the $128. format attached to the variable.  It will confuse you.  It is easier to just remove the formats from the character variables as they are not required.

 

So to change the length of four variables you could use code like this:

data new; 
  length varA $30 varB $3 varC $70 varD $30;
  set old;
  format _character_;
run;

View solution in original post

9 REPLIES 9
SASKiwi
PROC Star

IMHO trying to manage storing a 493GB dataset on a 500GB drive is highly dangerous. You need a larger local drive or better still a network drive where your data is backed up.

somebody
Lapis Lazuli | Level 10

yes totally agree!! I just downloaded this, and think I can make it smaller after formatting those columns

ballardw
Super User

@somebody wrote:

yes totally agree!! I just downloaded this, and think I can make it smaller after formatting those columns


Is there a chance that you can get a text version of the data such as in CSV layout? The extra "blanks" may not be in that form an you write your data step to read them with the lengths you want (and maybe even don't keep some variables you don't need).

 

Tom
Super User Tom
Super User

Note that in SAS a FORMAT is instructions on how to convert the stored values into text. You can easily change the format attached to a variable without re-writing the dataset by using PROC DATASETS.

 

But it sounds like you meant that you want to consider changing the actual defined length of the variable instead of just the display format.

 

In general there is not much need to change the length of character variables to save space, you can just the COMPRESS=YES option with writing the dataset to disk and it will take less space.

 

To see if it makes much difference for your dataset try it on a subset.   For example this code will make two copies of the first 10,000 observations in the dataset.

data large(compress=no)
        small(compress=yes)
;
  set have (obs=10000);
run;

The NOTES in the SAS log will let you know whether it looks like it will make much difference.

 

somebody
Lapis Lazuli | Level 10

Thank you so much. Your code helps reduce the size of my data by 80% (based on the 10,000 obs). However, to run this, I need some space to create a new compressed dataset, and then delete the original dataset, such as:

data NEW(compress=yes);set OLD;run;
*and then delete the OLD dataset;

Previously, I wanted to do something like this:

data new(drop=varA varB varC varD); 
	set old;

	format varA2 $30.; *currently, these variables have length of 128, but their values are not that long;
	format varB2 $3.;
	format varC2 $70.;
	format varD2 $30.;

	varA2 = varA;
	varB2 = varB;
	varC2 = varC;
	varD2 = varD;
run;

By the way, would the new compressed dataset behave normally as a normal dataset? or do I have to declare every time I use it such as"

options compress=yes;
data DOSOMETHING; set NEW;run;

?

Tom
Super User Tom
Super User

Yes, you need the disk space to make the new copy of the dataset.  Your first step is fine, although if you use the same name for input and output then SAS will do the delete and rename for you automatically for you.

 

Yes, if you are working with large datasets with many long character variables and limited disk space you should probably always run with the system option COMPRESS set to YES.  You will use a few more CPU cycles to compress/uncompress the data, but you will use much less disk space and much less time waiting for disk reads and writes.

 

Changing the FORMAT attached to the variable does NOT change the number of bytes used to store the variable, just how many are displayed.  You need to actually change how the variable is defined.  So you need to change the LENGTH of the variable. If you want to change the LENGTH of the variable you need to do it before the variable is seen by the data step compiler. So before the SET statement.  

 

If your character variables actually have formats like $128. attached to them and you change the length from 128 to 30 then do not leave the $128. format attached to the variable.  It will confuse you.  It is easier to just remove the formats from the character variables as they are not required.

 

So to change the length of four variables you could use code like this:

data new; 
  length varA $30 varB $3 varC $70 varD $30;
  set old;
  format _character_;
run;
somebody
Lapis Lazuli | Level 10

thanks, does the code 

format _character_;

affect the format of other variables in the dataset? 

Ksharp
Super User
If you want remove all the format of character variables ,try proc datasets. Tom's code is not efficient,since your table is big.





data class;
set sashelp.class;
run;

proc datasets library=work nolist nodetails;
modify class;
attrib _character_ format=;
quit;
Ksharp
Super User

Try proc sql which could change  stored length and format of variables .

 

data class;
set sashelp.class;
run;

proc sql;
alter table class
modify name char(200) format=$200.  ,
       sex  char(20)  format=$20.  ;
quit;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1728 views
  • 1 like
  • 5 in conversation