An Idea Exchange for SAS software and services

Comments
by Super User
on ‎09-05-2014 04:54 AM

Would need to be a lot of testing on that before implementing.  The SAS procedures are compiled to be as efficient as possible with the structure of the dataset.  To then include a zip compression would then require either:

- re-implementation of all SAS software to utilize a new format

- incur higher I/O cost and lower operating speed as the ZIP would effectively need to be zipped/unzipped each time.

Now with the cost per TB of disk storage being negligible, I don't see reducing storage as being a cost, it would be cheaper just to get a new storage server.

For the I/O, I am not sure you would really see any improvement.  The excellent SQLite is very good, however the structure of that (and the other DB's you mention) are RDBMS, they are designed in such a way to save space and be relatively fast at processing relational data, you will note that they don't do statistical procedures on the data.  SAS is geared towards tabular data, and statistics off that.  Hence they are very different software for different purposes.

by Contributor AndrewZ
on ‎09-05-2014 11:00 AM

About the radicalness you suppose, SAS 9.4 introduced the FILENAME Statement, ZIP Access Method.  This would be similar except it works on SAS data sets and would more transparent to the user like just setting "option compress=name-of-algorithm".

You are "not sure you would really see an improvement," but please consider this case: we store a lot of files on the network.  If it takes 1 minute to write a file, then cutting the storage by 90% would cut the I/O time to 6 seconds.  That's a big improvement.  In this scenario it doesn't matter if the network is dialup or gigabit.

Better compression doesn't necessarily mean higher CPU usage: LZOP is an example of an algorithm that uses little CPU while getting good compression.

Even if the CPU usage is higher, it's up to the user to determine which is the best for his scenario (based on disk availability, bandwidth, CPU, how often the data is read, etc.).  SAS has other performance tuning options like this READBUFF and CPUCOUNT.

by Trusted Advisor
on ‎09-05-2014 01:21 PM

I am not at the up or down side (yet). AndrewZ it is your proposal you may vote it up (missings you  there).


The Zip compressing method will not be very practical. It is needing a header or several of additional information.  I agree that the compressing method could be improved, record/observation in blocks based. It is now I believe just a RLE method. The requirement of that additional header is a nasty one.
With the introduction of V8 the physical order of variables has changed. All numeric (4 byte boundary) and character got grouped. There must be a translation table in the header of the SAS-dataset for the logical order that has no changeable tools. This ordering was done for optimizing performance (numeric) it can be used for better compression. So far I can agree with the proposal.      

In a server based environment where the data is on a SAN the SAN can do already compressing. In that case why should we add additional complexity.
Going for in-database processing not moving data at all (DS2) or with in-memory processing the idea of local processing of SAS data is less relevant.
Then we have the SPD libname and SPD server with Hadoop as other techniques going along. On what part should we focus effort?.  

by Super Contributor
on ‎09-05-2014 04:27 PM

I do understand some of the argument against this proposal. For instance it is true that storage has become cheaper, but remember that many just run SAS on their workstation and a storage server is therefore not an option.

I find it frustrating that a numerical variable need at least 3 bytes even if I know that it will contain only small integers, and this will cost me both storage and IO. An alternative proposal would therefore be to allow for a dataset-format that can contain other datatypes like forinstance "smallint", "bit" etc..

by Contributor AndrewZ
on ‎09-05-2014 04:35 PM

You should post a ballot for various smaller numerics. It would be helpful to pack 8 bit variables in a byte.  Now for a bit I either use 3 bytes or 1 char (T/F). 

by Super User
on ‎09-05-2014 06:53 PM

I almost suspect that any time saved WRITING to the network drive would be about the same as the time lost uncompressing it to work. Most of my data is relatively static with updates at monthly or quarterly intervals but I READ a dataset way more often than I write it, like orders of magnitude. Some data sets were written in 2007 but have been read frequently every year since then.

by Trusted Advisor
on ‎09-06-2014 01:58 AM

@AndrewZ Working with 3bytes length for numeric is fooling yourself. There are no gains in contrary.  Seen notes/links below.


@Jacob The frustration on my side is all that misunderstanding what numerics (floating 8 byte ieee) and characters (a-z A-Z and 0-9). I see an improvement by using the words measurement and category. Sex is a category, cohort is an ordinal, length (cm) is a measurement.  Doing mathematics calculations on roman digit approach looks nice but is not very efficient. We need a similar step to get into a digital age that is understood now by 10 types of persons.
** I can understand you a lot of usage on the workstation as being a good approach. Working on a server based way is also good having a different order of capacity to be used. There are other reasons (regulations) for having a server based approach.
** Working with smallint bit  etc.is possible with "proc DS2".  It is more focused on in-database processing needing support for those.
SAS(R) 9.4 DS2 Language Reference, Third Edition See all the after the declare.

     
@ballard reading data that is kept mostly static. The whole hype of analyzing data with hadoop is based on that. Cheap hardware be placed in mass and duplicated in mass (no goal for compressing) for optimal read response with parallel processing.

....

I opened my mouth and should give links with additional information.
2676 - Determining the order of variables within a SAS data set  (numerics are places in the front of the dataset aligned on 4 or 8 byte boundaries)
Base SAS(R) 9.3 Procedures Guide, Second Edition (Observation Length, Alignment, and Padding for a SAS Data Set)

"Observations within a SAS dataset are aligned on double-byte boundaries whenever possible. As a result, 8-byte and 4-byte numeric variables are positioned at 8-byte boundaries at the front of the data set and followed by character variables in the order in which they are encountered. If the data set only contains 4-byte numeric data, the alignment is based on 4-byte boundaries. Since numeric doubles can be operated upon directly rather than being moved and aligned before doing comparisons or increments, the boundaries cause better performance."

SAS(R) 9.4 Language Reference: Concepts, Third Edition (Techniques for Optimizing CPU Performance, Specifying Variable Lengths)

SAS(R) 9.4 Data Set Options: Reference, Second Edition (OUTREP= Data Set Option)

With SPD you will find same notes. The reason for this coming in from low-level hardware concepts. The numeric precision is also a often forgotten pitfall. It would be better having worked with a slider in the learning path to understand that.   

Using indexes or point on dataset is requiring a counter / recognition which data-block is needed. This was not reliable with first 8 release when combined with compress. http://www2.sas.com/proceedings/sugi28/003-28.pdf this paper describes better what was done around that.

by Contributor AndrewZ
on ‎09-08-2014 01:16 PM

: Decompression is typically much faster than compression time: see for example this benchmark, so for your case you would pay the compression cost only once and could reap faster reads many times.  And like compress=yes, the proposed idea here is for an option that people could enable based on a careful examination of their individual situations (CPU speed, I/O speed, storage availability, read and write patterns, trade offs, etc.).

by Contributor AndrewZ
on ‎09-08-2014 01:56 PM

: Using LENGTH does shrink the data set: the following data sets are 384KB, 512KB, and 896KB.

data test3;

    length i 3;

    do i = 1 to 100000; /* yes, this causes rounding errors */

    output;

    end;

run;

data test4;

    length i 4;

    do i = 1 to 100000;

    output;

    end;

run;

data test8;

    length i 8;

    do i = 1 to 100000;

    output;

    end;

run;

by Trusted Advisor
on ‎09-08-2014 03:27 PM

I checked your test indeed different sizes. Your test is just having 1 variable not really a good test reading the conditions of a datasets. I add 1 numeric length 4  and 2 byte char in each dataset.
A new test is defnining length 3 and 5 numeric with a 2 byte char. See test code:  I did run this using UE.

AndrewZ diiferent more enhanced testing. Proving what is documented. It was in the end 90's to 2002 having been confronted with this as side effect in a conversion.

In that  environment killing as sizes where hard limited (MVS SB37 abends operational jobs). Had to do the research into this is was unexplained not understood.


libname test "/folders/myfolders/test";

data test.test3;
    length i 3 m 4  z $2 ;
    do i = 1 to 1000000;   output;  end;
run;

data test.test4;
    length i 4 m 4  z $2 ;
    do i = 1 to 1000000;   output;   end;
run;

data test.test8;
    length i 3 m 4 j 3 n 8 z $2 ;
    do i = 1 to 1000000;  output;    end;
run;

data test.test9;
    length i 8  j 4  n 8  z $2 ;
    do i = 1 to 1000000;   output;   end;
run;

results:  ad3  11.968 (kb) 0,34s   the length 3 here makes no difference to length 4 ad4/

             ad4  11.968 (kb) 0.32 s 

             ad8  23.680 (kb) 0,58s    changing j-3 to j=5 makes no difference in sizing. total length is 2 different less as ad9

             ad9  23,680 (kb) 0,58s

Careful reading the physical structure this was to be expected. There is some kind of grouping and than padding is added 

The effect of overhead is not seen. For that other options as bufsize align-iofiles are more important.
Going into more technical details of SAN hardware Unix mountpoints or 3390 Vtoc effects there are more difficulties to solve with higher prio. 

Idea Statuses
Top Liked Authors