Help using Base SAS procedures

Dataset size issue

Reply
New Contributor
Posts: 2

Dataset size issue

I am creating a sas dataset with around 85-90 columns in it and some of columns are of big size (2048) so the dataset is coming of very large size i.e. 1331622 observations are coming as of 14.9 GB size. I have a limitation that file size should be max of 4 GB so that it can pass through some transferring channel. Please let me know if there is any way to reduce the size of dataset without using COMPRESS option and the column length should be same as what available in database(means the column length and datatype should be same as that of database through which I am fetching data). .

Occasional Contributor
Posts: 11

Re: Dataset size issue

If you are not allowed to compress (why?) then splitting your dataset is the only option I can think of.

* untested;

data part1 part2 part3;

   set have;

   if _N_ <= 400000 then output part1;

   else if _N_ <= 800000 then output part2;

   else output part3;

run;

* and glue them together at the other end of the line;

data want;

   set part1 part2 part3;

run;

Hth,

Eric

Esteemed Advisor
Esteemed Advisor
Posts: 7,017

Re: Dataset size issue

Well, this really depends on various things.  You haven't shown an example of the data so am just guessing here but some things to consider:

Apply coding to you dataset.  For instance, those columns which have long text strings, are they similar strings?  If so then pull a distinct list out and then apply as a format.  That alone will drastically reduce your dataset.  Code any other possible variables, use formats, or related datasets, for instance, consider:

ID          VAR1          BP          HR          COMMENTS     AGE     SOMETHINGELSE     ANOTHERCOMMENT

...

Now the above could all go in one dataset, but is that the most efficient use of space?  Perhaps have a comments dataset separate from this and assign a linking variable.  Maybe BP and HR could be moved to a separate vitals signs dataset using ID?  You can also split the dataset as EricHoogenboom has said. 

There are other methods as well, why for instance, does the transfer only allow 4gb?  Doesn't seem to be a valid reason.  Are you sending by email, that in itself is a bad idea.  Use a controlled portal to transfer data.  Use a controlled web portal and stream CSV data across is another method (pro http will download a stream of CSV data for instance).

Also, have you tried Zipping the file, I.e. use a proper ZIP program (not windows compressed folder) and choose maximum compression.  Or try XPT from SAS. 

As you can see there are a variety of methods, both coding, and utility programs for such a task depending on your situation.

Esteemed Advisor
Posts: 6,350

Re: Dataset size issue

What keeps you from using the COMPRESS option? It is the big helper for exactly the situation you're in.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
New Contributor
Posts: 2

Re: Dataset size issue

Hi,

I am not using compress because compress option increases the number of cpu cycle for accessing the data and it can impact the performance of an application that is why I am looking for some alternative.

Thanks,

Pallavi

Valued Guide
Posts: 3,206

Re: Dataset size issue

Newbie did you try the compressing and prove the results wuiht your expectations ? With a spinning disk a compress=binary you could effectively win a lot time just by all that IO overhead. I have seen an increase with a total record-size of 80bytes and having a SSD. that number of 1,3M record is not that big  you must have some 10Kbyte recordsize.   I am Expecting just performance gains.      

---->-- ja karman --<-----
Respected Advisor
Posts: 3,001

Re: Dataset size issue

In our experience the slightly increased overhead in CPU is more than compensated by the big savings in IO, resulting in much faster run times.

Esteemed Advisor
Posts: 6,350

Re: Dataset size issue

As the others said, your additional CPU cycles will be offset by the increased I/O performance. Rule of thumb: with the exception of the STAT procedures, SAS is always I/O bound.

Here, all datasets that are decreased by at least 20% in size are stored with the compress option. Otherwise, we wouldn't be able to work at all with the current setup (storage and server).

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Esteemed Advisor
Posts: 6,350

Re: Dataset size issue

Let me guess: you never ran a comparison test, did you?

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Respected Advisor
Posts: 3,001

Re: Dataset size issue

The COMPRESS option is the best way of dealing with your space issue. Many SAS sites now have it switched on by default, including us (COMPRESS = BINARY).

Regular Contributor
Posts: 202

Re: Dataset size issue

Hi,

Try this Sample code from support.sas.com Sample 24804: %SQUEEZE-ing Before Compressing Data, Redux

Thanks,

Ahmed

Ask a Question
Discussion stats
  • 10 replies
  • 533 views
  • 2 likes
  • 7 in conversation