SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

I have a data set, the first 10 rows of which look like the following.

year,R_F,R_MKT,R_ME,R_IA,R_ROE,R_EG
1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998
1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650
1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056
1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111
1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606
1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977
1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879
1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174
1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153

So, each value has at most four numbers after the decimal point. There is no explicit bound, but the values will effectively be in between -999 and 999. I ran the following code.

data want;
	infile "http://global-q.org/uploads/1/2/2/6/122679606/
q5_factors_annual_2019a.csv" url firstobs=2 dsd;
	length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6;
	input year R_F R_MKT R_ME R_IA R_ROE R_EG;
run;

I used 6 for the variable length based on this document, but it seems some values are unusually read as follows.

a.png

I wonder whether (1) the unusual values such as the 5.2941999999 above are just OK, and (2) the default length 8 rather than the 6 above must be used for these four-digit values.

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

@Junyong wrote:

I have a data set, the first 10 rows of which look like the following.

year,R_F,R_MKT,R_ME,R_IA,R_ROE,R_EG
1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998
1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650
1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056
1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111
1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606
1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977
1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879
1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174
1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153

So, each value has at most four numbers after the decimal point. There is no explicit bound, but the values will effectively be in between -999 and 999. I ran the following code.

data want;
	infile "http://global-q.org/uploads/1/2/2/6/122679606/
q5_factors_annual_2019a.csv" url firstobs=2 dsd;
	length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6;
	input year R_F R_MKT R_ME R_IA R_ROE R_EG;
run;

I used 6 for the variable length based on this document, but it seems some values are unusually read as follows.

a.png

I wonder whether (1) the unusual values such as the 5.2941999999 above are just OK, and (2) the default length 8 rather than the 6 above must be used for these four-digit values.


The values are not unusually read, but they are "unusually" stored.  They were read in by the data step as 8-byte floating point values, but they are stored on disk with length 6 instead of (the default) 8 bytes.  And since they are stored in floating point binary (not decimal) the last two bytes (16 bits) may not be all trailing zeroes even though in decimal notation, everything after the (say) 8th significant digit would be zero.   Then when the 6-byte values are brought back into memory from the disk for viewing or any PROC, they will be expanded to 8-bytes in memory by appending 16 binary zeroes.  The result will NOT be exactly what it would have been if the last two bytes had been kept.

 

Unless your data always has exact integers, (or exact halves, quarters, eight's, etc.) you will inevitably introduce these approximations.  Don't use the LENGTH attribute to shorten storage of non-integers without a very powerful and well-understood reason.

 

BTW, this is not a SAS problem - it is an attribute of truncating the default double-word (8 bytes) floating point numeric structure for almost all computing hardware.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

11 REPLIES 11
mkeintz
PROC Star

@Junyong wrote:

I have a data set, the first 10 rows of which look like the following.

year,R_F,R_MKT,R_ME,R_IA,R_ROE,R_EG
1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998
1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650
1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056
1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111
1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606
1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977
1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879
1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174
1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153

So, each value has at most four numbers after the decimal point. There is no explicit bound, but the values will effectively be in between -999 and 999. I ran the following code.

data want;
	infile "http://global-q.org/uploads/1/2/2/6/122679606/
q5_factors_annual_2019a.csv" url firstobs=2 dsd;
	length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6;
	input year R_F R_MKT R_ME R_IA R_ROE R_EG;
run;

I used 6 for the variable length based on this document, but it seems some values are unusually read as follows.

a.png

I wonder whether (1) the unusual values such as the 5.2941999999 above are just OK, and (2) the default length 8 rather than the 6 above must be used for these four-digit values.


The values are not unusually read, but they are "unusually" stored.  They were read in by the data step as 8-byte floating point values, but they are stored on disk with length 6 instead of (the default) 8 bytes.  And since they are stored in floating point binary (not decimal) the last two bytes (16 bits) may not be all trailing zeroes even though in decimal notation, everything after the (say) 8th significant digit would be zero.   Then when the 6-byte values are brought back into memory from the disk for viewing or any PROC, they will be expanded to 8-bytes in memory by appending 16 binary zeroes.  The result will NOT be exactly what it would have been if the last two bytes had been kept.

 

Unless your data always has exact integers, (or exact halves, quarters, eight's, etc.) you will inevitably introduce these approximations.  Don't use the LENGTH attribute to shorten storage of non-integers without a very powerful and well-understood reason.

 

BTW, this is not a SAS problem - it is an attribute of truncating the default double-word (8 bytes) floating point numeric structure for almost all computing hardware.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Shmuel
Garnet | Level 18

I have added INFORMAT and FORMAT statements to your code. Try it:

data test;
  length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6;
  informat R_F R_MKT R_ME R_IA R_ROE R_EG best8.4;
  infile cards dlm=',' dsd truncover;
  input year R_F R_MKT R_ME R_IA R_ROE R_EG;
  format year 4.
        R_F R_MKT R_ME R_IA R_ROE R_EG best8.4;
cards;  
1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998
1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650
1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056
1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111
1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606
1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977
1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879
1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174
1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153
; run;

Junyong
Pyrite | Level 9

Thanks—in case, my code above imports the full data via url.

Shmuel
Garnet | Level 18

@Junyong wrote:

Thanks—in case, my code above imports the full data via url.


The data origin is not relevant here.

What makes the difference are: length, informat and format statements.

Junyong
Pyrite | Level 9

I had absolutely no idea on this issue so visited IEEE 754—it seems deviating from the default 8 is unreasonable unless integers like dates, years, dummies, etc. Much appreciate all these details for that novice question.

SASKiwi
PROC Star

Don't EVER store decimals in SAS in less than the default 8 bytes - it really is asking for trouble with precision. Disk storage is cheap these days so you aren't saving any money doing so, but you are wasting time writing unnecessary LENGTH statements. Integers can be stored accurately in less than 8 bytes and the most common candidates for this are SAS dates. But again I personally think it is a waste of time adding LENGTH statements for all SAS dates.

 

If you want to save space in SAS datasets, use the COMPRESS option in an OPTIONS statement then all following datasets created will be compressed. You'll save a lot more space that way than tinkering with the storage length of numeric variables and it is way quicker to implement.

Junyong
Pyrite | Level 9

Thanks for this clarification again—it seems changing the default is ill-suited if non-integers. However, it seems it still pays to cut the data size because IT services such as Dropbox and Spectrum are not frictionless. I think compress works well for characters rather than numbers.

SASKiwi
PROC Star

@Junyong  - COMPRESS = BINARY also compresses some numerics.

ballardw
Super User

For display if the range of values is -999 to 999 with up to 4 decimals then a BEST9. should work.

-998.1234 occupies 9 print positions: 1 for the -, 3 for the integer portion, 1 for the decimal point and 4 for the decimal digits. 1+3+1+4=9.

 

There are a whole lot of references for computer precision and representation of decimals in binary storage. Do a search.

 

Moral of the story: don't us length less than the default 8 for decimals.

The following reads the same text value into multiple variables that are defined with different lengths.

data example;
   length x1 8 x2 7 x3 6 x4 5 x5 4 x6 3;
   array x (*) x1 - x6;
   do i=1 to dim(x);
      input @1 x[i] @;
   end;
   input;
format x: best16. ; datalines; 123456789. 12345678.9 1234567.89 123456.789 12345.6789 1234.56789 123.456789 12.3456789 1.23456789 0.123456789 0.0123456789 0.00123456789 ;

With a wider than default format you can see issues with decimals occurring quite easily just based on the length assigned.

Junyong
Pyrite | Level 9

Thanks for this intuitive example. I had absolutely no idea on this issue so read the IEEE 754 page above. I will stick to it unless integers.

mkeintz
PROC Star

You don't have to restrict length shortening to just integers.  If all your values are halves and quarters for instance, then you could shorten with no loss of accuracy.  Consider the possibilities of length 6 below for this subset of integers divided by powers of 2: 

 

213  data _null_;
214    /* Get the largest consecutive integer storable as floating point 6 bytes */
215    L6=constant('exactint',6);
216    put L6=comma22.0;
217
218    /* Now show the maximum consecutive value for power of 2 fractions: */
219    X=L6;
220    Length txt $22;
221    do denom = 2 by 0 while (denom<=256*256);
222      X=x/2;
223      TXT=right(put(x,comma22.0));
224      put '1/' denom  @15 txt $char22.;
225      denom=2*denom;
226    end;
227  run;

L6=137,438,953,472
1/2                   68,719,476,736
1/4                   34,359,738,368
1/8                   17,179,869,184
1/16                   8,589,934,592
1/32                   4,294,967,296
1/64                   2,147,483,648
1/128                  1,073,741,824
1/256                    536,870,912
1/512                    268,435,456
1/1024                   134,217,728
1/2048                    67,108,864
1/4096                    33,554,432
1/8192                    16,777,216
1/16384                    8,388,608
1/32768                    4,194,304
1/65536                    2,097,152

I.e. every 1/4th up to 34,359,738,368 will suffer no loss of accuracy in a storage length of 6.

 

 

BTW, this was a real-world problem in the stock market.  Markets used to report prices in dollars, half-dollar, quarters, eights (and I think down to sixteenths) up until 2001, when they switch to decimalization. Not only did the decimalization cut down of trader's commissions (granularity of prices went down to single pennies), but until that time we could accurately store daily closing prices with a length of less the 8 (I believe our supplier - CRSP - used a length of 6).  Well, that little space savings disappeared quickly.   More importantly, we get hundreds of millions of stock trades and quotes daily from us stock exchanges. 

At 2 bytes per trade/quote space needs went up dramatically.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 2978 views
  • 1 like
  • 5 in conversation