08-05-2015 11:46 PM
I am creating a sas dataset with around 85-90 columns in it and some of columns are of big size (2048) so the dataset is coming of very large size i.e. 1331622 observations are coming as of 14.9 GB size. I have a limitation that file size should be max of 4 GB so that it can pass through some transferring channel. Please let me know if there is any way to reduce the size of dataset without using COMPRESS option and the column length should be same as what available in database(means the column length and datatype should be same as that of database through which I am fetching data). .
08-10-2015 04:59 AM
If you are not allowed to compress (why?) then splitting your dataset is the only option I can think of.
data part1 part2 part3;
if _N_ <= 400000 then output part1;
else if _N_ <= 800000 then output part2;
else output part3;
* and glue them together at the other end of the line;
set part1 part2 part3;
08-10-2015 05:33 AM
Well, this really depends on various things. You haven't shown an example of the data so am just guessing here but some things to consider:
Apply coding to you dataset. For instance, those columns which have long text strings, are they similar strings? If so then pull a distinct list out and then apply as a format. That alone will drastically reduce your dataset. Code any other possible variables, use formats, or related datasets, for instance, consider:
ID VAR1 BP HR COMMENTS AGE SOMETHINGELSE ANOTHERCOMMENT
Now the above could all go in one dataset, but is that the most efficient use of space? Perhaps have a comments dataset separate from this and assign a linking variable. Maybe BP and HR could be moved to a separate vitals signs dataset using ID? You can also split the dataset as EricHoogenboom has said.
There are other methods as well, why for instance, does the transfer only allow 4gb? Doesn't seem to be a valid reason. Are you sending by email, that in itself is a bad idea. Use a controlled portal to transfer data. Use a controlled web portal and stream CSV data across is another method (pro http will download a stream of CSV data for instance).
Also, have you tried Zipping the file, I.e. use a proper ZIP program (not windows compressed folder) and choose maximum compression. Or try XPT from SAS.
As you can see there are a variety of methods, both coding, and utility programs for such a task depending on your situation.
08-10-2015 06:47 AM
What keeps you from using the COMPRESS option? It is the big helper for exactly the situation you're in.
08-11-2015 01:42 PM
I am not using compress because compress option increases the number of cpu cycle for accessing the data and it can impact the performance of an application that is why I am looking for some alternative.
08-11-2015 02:37 PM
Pallavi_Varsh did you try the compressing and prove the results wuiht your expectations ? With a spinning disk a compress=binary you could effectively win a lot time just by all that IO overhead. I have seen an increase with a total record-size of 80bytes and having a SSD. that number of 1,3M record is not that big you must have some 10Kbyte recordsize. I am Expecting just performance gains.
08-12-2015 02:35 AM
As the others said, your additional CPU cycles will be offset by the increased I/O performance. Rule of thumb: with the exception of the STAT procedures, SAS is always I/O bound.
Here, all datasets that are decreased by at least 20% in size are stored with the compress option. Otherwise, we wouldn't be able to work at all with the current setup (storage and server).
08-12-2015 03:19 AM
Let me guess: you never ran a comparison test, did you?
08-11-2015 05:06 AM
The COMPRESS option is the best way of dealing with your space issue. Many SAS sites now have it switched on by default, including us (COMPRESS = BINARY).