12-12-2014 04:31 AM
I want to compress the SAS data while importing .txt file into SAS. I have ~4gb for data in .txt file which contains character and numeric variables. Is there any way to compress instead of:
options compress = yes;
12-12-2014 09:46 AM
Linus compressing in SAS is using confusing words. SAS and confusing words not something new.
"CHAR to use the RLE (Run Length Encoding) compression algorithm." That is based on repeating bytes.
It has nothing to do with the type of sas-variables. I have made that association also and than found being w rong. SAS(R) 9.4 Language Reference: Concepts, Fourth Edition
Going back to the compress / data storage saving / question.
-> the best compression is achieved by having no data stored.
Call it data-virtualization federation or whatever as some sales buzzword.
Hoe can you do this technically?
- use a sas-datastep view on the text-file. The result, the view, will process the data text-file when necessary. Logically you will see a sas-table dataset.
To extend this,
- you can defines proc sql views in sas. that are views on tables.
- you can define views withins RDBMS systems. This can hide complexity and implement some type of security.
The disadvantage can be all processing overhead when running. (view on view etc). You can design intermediate materalized tables as some checkpoint in your analyses proces
12-12-2014 10:06 AM
One common cause of large txt files is extra columns that contain no data. Example: A name field that was defined to be 50 characters but the longest name in the actual file is 15 characters resulting in 35 unneeded blanks per data line in this field.
When you read the data into SAS specify the length of the variable as 15 and it will take less storage space.
Something else that may reduce your SAS dataset size is to use an INVALUE custom format to turn a variable into either a numeric code or a shorter text code. This adds a small amount of complexity to maintenance if the values change occasionally but lets look at fairly common variable: Sex. Suppose the data set is sending you the values of sex as Male and Female. That would normally require 6 characters to store.
invalue $sex Upcase default=1)
'MALE' = 'M'
'FEMALE' = 'F'
and associate the informat with the variable in the reading program:
informat sex $sex.;
This approach very well with things that don't change very often such as State and County names, any data collected with a restricted selection field such a checkbox that only allows single responses.