I have a SAS dataset that is about 4GB in size (56 million rows by 6 variables). I need to transfer this file to another system that I've been told has a 2GB file size limit.
Other than IT solutions to this issue (we're working on these solutions), I'm wondering if anyone has any good SAS-based solutions.
I was thinking of creating 4 1GB subsets of the data and then creating a SAS view to access the subsets, but I'm not sure if SAS views end up creating a combined temporary dataset (that could be up to 4GB in this case) when they're accessed.
I was also thinking of creating indices for the 2 character variables to reduce the file size. On the good side, the values stored in the dataset would be numeric which would take less space than the character strings. On the bad side, I would need to create at least one format with 55,000 values.
What solutions have people used to deal with a SAS dataset that is larger than the hardware will allow?
I'm surprised that 56 million rows and 6 variables result in a 4Gb dataset. That seems way too much. But, ok, you have it and have to deal with it.
You could use the COMPRESS dataset-option, which probably will reduce the 4Gb substantially. However, this option helps you to store the enormous dataset, but it will be de-compressed whenever you use it with SAS.
Another idea you already suggested and would be using numeric counterparts for the character variables. Perhaps you could use a PROC FREQ for determining the 55,000 different values and at the same time store these in a dataset. From that dataset you could build your format and store that.
Another trick would be to check whether the length of the character variables is not too big. Is there any value that uses the maximum width? If not, adjust the length to that smaller size. I guess that about a year ago, SAS publiced a macro that would check the maximum length of the values of a character variable. Maybe you can dig it up.
Similarly, you could try to reduce the number of bytes used for each numeric variable, but this is tricky.
I did a few calculations to check whether a 4GB file size seems appropriate. My six variables are $32, $15, 8, $1, 8, 8 in length for a total of 72 bytes per row. 56x10e6 rows x 72 bytes per row / 1.07x10e9 bytes per GB is about equal to 3.8GB. The character formats are as small as they can be, but you're right that I could reduce two of these to just 8 bytes if I use numeric values with a format.
If I replace two of the character values with numeric values (and create a format for each of the unique character strings), my six variables would still come to 41 bytes per row which calculates to about 2.1GB. This is too close to the limit for my comfort.
In your online documentation, or in the version on the SAS website, look up the chapter on numeric precision for your target Operating System.
You will note that for Windows and Unix, (which are systems using IEEE for numeric storage) you can precisely store integer values between 0 and 8192 with a 3 byte number. A 4 byte number will store integer values up to 2097152 which equates to dates up to 23Oct7701.
If your new host is MVS (which seems unlikely) then the floating point storage allows for even larger storage capacity.
The caveats are that you ensure the values won't exceed the upper limit, that they are integers (real numbers will have a lower limit) and that you will see the numbers expanded to 8 bytes as the data are read which may affect memory usage and disk storage of target tables.