SAS Programming

Junyong · Posted 01-14-2021 12:20 AM

I have a data set, the first 10 rows of which look like the following.

year,R_F,R_MKT,R_ME,R_IA,R_ROE,R_EG
1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998
1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650
1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056
1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111
1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606
1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977
1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879
1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174
1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153

So, each value has at most four numbers after the decimal point. There is no explicit bound, but the values will effectively be in between -999 and 999. I ran the following code.

data want;
	infile "http://global-q.org/uploads/1/2/2/6/122679606/
q5_factors_annual_2019a.csv" url firstobs=2 dsd;
	length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6;
	input year R_F R_MKT R_ME R_IA R_ROE R_EG;
run;

I used 6 for the variable length based on this document, but it seems some values are unusually read as follows.

I wonder whether (1) the unusual values such as the 5.2941999999 above are just OK, and (2) the default length 8 rather than the 6 above must be used for these four-digit values.

mkeintz · Posted 01-14-2021 12:42 AM

@Junyong wrote:

I have a data set, the first 10 rows of which look like the following.
year,R_F,R_MKT,R_ME,R_IA,R_ROE,R_EG
1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998
1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650
1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056
1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111
1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606
1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977
1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879
1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174
1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153
So, each value has at most four numbers after the decimal point. There is no explicit bound, but the values will effectively be in between -999 and 999. I ran the following code.
data want;
	infile "http://global-q.org/uploads/1/2/2/6/122679606/
q5_factors_annual_2019a.csv" url firstobs=2 dsd;
	length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6;
	input year R_F R_MKT R_ME R_IA R_ROE R_EG;
run;
I used 6 for the variable length based on this document, but it seems some values are unusually read as follows.

I wonder whether (1) the unusual values such as the 5.2941999999 above are just OK, and (2) the default length 8 rather than the 6 above must be used for these four-digit values.

The values are not unusually read, but they are "unusually" stored. They were read in by the data step as 8-byte floating point values, but they are stored on disk with length 6 instead of (the default) 8 bytes. And since they are stored in floating point binary (not decimal) the last two bytes (16 bits) may not be all trailing zeroes even though in decimal notation, everything after the (say) 8th significant digit would be zero. Then when the 6-byte values are brought back into memory from the disk for viewing or any PROC, they will be expanded to 8-bytes in memory by appending 16 binary zeroes. The result will NOT be exactly what it would have been if the last two bytes had been kept.

Unless your data always has exact integers, (or exact halves, quarters, eight's, etc.) you will inevitably introduce these approximations. Don't use the LENGTH attribute to shorten storage of non-integers without a very powerful and well-understood reason.

BTW, this is not a SAS problem - it is an attribute of truncating the default double-word (8 bytes) floating point numeric structure for almost all computing hardware.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 01-14-2021 12:42 AM

@Junyong wrote:

I have a data set, the first 10 rows of which look like the following.
year,R_F,R_MKT,R_ME,R_IA,R_ROE,R_EG
1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998
1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650
1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056
1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111
1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606
1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977
1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879
1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174
1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153
So, each value has at most four numbers after the decimal point. There is no explicit bound, but the values will effectively be in between -999 and 999. I ran the following code.
data want;
	infile "http://global-q.org/uploads/1/2/2/6/122679606/
q5_factors_annual_2019a.csv" url firstobs=2 dsd;
	length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6;
	input year R_F R_MKT R_ME R_IA R_ROE R_EG;
run;
I used 6 for the variable length based on this document, but it seems some values are unusually read as follows.

I wonder whether (1) the unusual values such as the 5.2941999999 above are just OK, and (2) the default length 8 rather than the 6 above must be used for these four-digit values.

The values are not unusually read, but they are "unusually" stored. They were read in by the data step as 8-byte floating point values, but they are stored on disk with length 6 instead of (the default) 8 bytes. And since they are stored in floating point binary (not decimal) the last two bytes (16 bits) may not be all trailing zeroes even though in decimal notation, everything after the (say) 8th significant digit would be zero. Then when the 6-byte values are brought back into memory from the disk for viewing or any PROC, they will be expanded to 8-bytes in memory by appending 16 binary zeroes. The result will NOT be exactly what it would have been if the last two bytes had been kept.

Unless your data always has exact integers, (or exact halves, quarters, eight's, etc.) you will inevitably introduce these approximations. Don't use the LENGTH attribute to shorten storage of non-integers without a very powerful and well-understood reason.

BTW, this is not a SAS problem - it is an attribute of truncating the default double-word (8 bytes) floating point numeric structure for almost all computing hardware.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Shmuel · Posted 01-14-2021 01:57 AM

I have added INFORMAT and FORMAT statements to your code. Try it:

data test;
  length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6;
  informat R_F R_MKT R_ME R_IA R_ROE R_EG best8.4;
  infile cards dlm=',' dsd truncover;
  input year R_F R_MKT R_ME R_IA R_ROE R_EG;
  format year 4.
        R_F R_MKT R_ME R_IA R_ROE R_EG best8.4;
cards;  
1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998
1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650
1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056
1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111
1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606
1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977
1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879
1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174
1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153
; run;

Junyong · Posted 01-14-2021 05:23 AM

Thanks—in case, my code above imports the full data via url.

Shmuel · Posted 01-14-2021 05:53 AM

@Junyong wrote:

Thanks—in case, my code above imports the full data via url.

The data origin is not relevant here.

What makes the difference are: length, informat and format statements.

Junyong · Posted 01-14-2021 05:17 AM

I had absolutely no idea on this issue so visited IEEE 754—it seems deviating from the default 8 is unreasonable unless integers like dates, years, dummies, etc. Much appreciate all these details for that novice question.

SASKiwi · Posted 01-14-2021 01:29 AM

Don't EVER store decimals in SAS in less than the default 8 bytes - it really is asking for trouble with precision. Disk storage is cheap these days so you aren't saving any money doing so, but you are wasting time writing unnecessary LENGTH statements. Integers can be stored accurately in less than 8 bytes and the most common candidates for this are SAS dates. But again I personally think it is a waste of time adding LENGTH statements for all SAS dates.

If you want to save space in SAS datasets, use the COMPRESS option in an OPTIONS statement then all following datasets created will be compressed. You'll save a lot more space that way than tinkering with the storage length of numeric variables and it is way quicker to implement.

Junyong · Posted 01-14-2021 05:37 AM

Thanks for this clarification again—it seems changing the default is ill-suited if non-integers. However, it seems it still pays to cut the data size because IT services such as Dropbox and Spectrum are not frictionless. I think compress works well for characters rather than numbers.

SASKiwi · Posted 01-14-2021 04:49 PM

@Junyong - COMPRESS = BINARY also compresses some numerics.

ballardw · Posted 01-14-2021 03:31 AM

For display if the range of values is -999 to 999 with up to 4 decimals then a BEST9. should work.

-998.1234 occupies 9 print positions: 1 for the -, 3 for the integer portion, 1 for the decimal point and 4 for the decimal digits. 1+3+1+4=9.

There are a whole lot of references for computer precision and representation of decimals in binary storage. Do a search.

Moral of the story: don't us length less than the default 8 for decimals.

The following reads the same text value into multiple variables that are defined with different lengths.

data example;
   length x1 8 x2 7 x3 6 x4 5 x5 4 x6 3;
   array x (*) x1 - x6;
   do i=1 to dim(x);
      input @1 x[i] @;
   end;
   input;
   format x: best16. ;
datalines;
123456789.
12345678.9
1234567.89
123456.789
12345.6789
1234.56789
123.456789
12.3456789
1.23456789
0.123456789
0.0123456789
0.00123456789
;

With a wider than default format you can see issues with decimals occurring quite easily just based on the length assigned.

Junyong · Posted 01-14-2021 05:41 AM

Thanks for this intuitive example. I had absolutely no idea on this issue so read the IEEE 754 page above. I will stick to it unless integers.

mkeintz · Posted 01-14-2021 06:02 PM

You don't have to restrict length shortening to just integers. If all your values are halves and quarters for instance, then you could shorten with no loss of accuracy. Consider the possibilities of length 6 below for this subset of integers divided by powers of 2:

213  data _null_;
214    /* Get the largest consecutive integer storable as floating point 6 bytes */
215    L6=constant('exactint',6);
216    put L6=comma22.0;
217
218    /* Now show the maximum consecutive value for power of 2 fractions: */
219    X=L6;
220    Length txt $22;
221    do denom = 2 by 0 while (denom<=256*256);
222      X=x/2;
223      TXT=right(put(x,comma22.0));
224      put '1/' denom  @15 txt $char22.;
225      denom=2*denom;
226    end;
227  run;

L6=137,438,953,472
1/2                   68,719,476,736
1/4                   34,359,738,368
1/8                   17,179,869,184
1/16                   8,589,934,592
1/32                   4,294,967,296
1/64                   2,147,483,648
1/128                  1,073,741,824
1/256                    536,870,912
1/512                    268,435,456
1/1024                   134,217,728
1/2048                    67,108,864
1/4096                    33,554,432
1/8192                    16,777,216
1/16384                    8,388,608
1/32768                    4,194,304
1/65536                    2,097,152

I.e. every 1/4th up to 34,359,738,368 will suffer no loss of accuracy in a storage length of 6.

BTW, this was a real-world problem in the stock market. Markets used to report prices in dollars, half-dollar, quarters, eights (and I think down to sixteenths) up until 2001, when they switch to decimalization. Not only did the decimalization cut down of trader's commissions (granularity of prices went down to single pennies), but until that time we could accurately store daily closing prices with a length of less the 8 (I believe our supplier - CRSP - used a length of 6). Well, that little space savings disappeared quickly. More importantly, we get hundreds of millions of stock trades and quotes daily from us stock exchanges.

At 2 bytes per trade/quote space needs went up dramatically.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Programming

Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Re: Best Variable Length for Numbers with Four Digits After Decimal Point

Prxmatch function for values with decimal point and without

Count the number of digits before decimal and after decimal

decimal points

Team Finalyse - DAM Pain Points

w.d foramt decimal

Follow Us

What is...

SAS Programming

Our biggest data and AI event of the year.

SAS Training: Just a Click Away

Follow Us

What is...