- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a data set, the first 10 rows of which look like the following.
year,R_F,R_MKT,R_ME,R_IA,R_ROE,R_EG
1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998
1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650
1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056
1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111
1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606
1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977
1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879
1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174
1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153
So, each value has at most four numbers after the decimal point. There is no explicit bound, but the values will effectively be in between -999 and 999. I ran the following code.
data want;
infile "http://global-q.org/uploads/1/2/2/6/122679606/
q5_factors_annual_2019a.csv" url firstobs=2 dsd;
length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6;
input year R_F R_MKT R_ME R_IA R_ROE R_EG;
run;
I used 6 for the variable length based on this document, but it seems some values are unusually read as follows.
I wonder whether (1) the unusual values such as the 5.2941999999 above are just OK, and (2) the default length 8 rather than the 6 above must be used for these four-digit values.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Junyong wrote:
I have a data set, the first 10 rows of which look like the following.
year,R_F,R_MKT,R_ME,R_IA,R_ROE,R_EG 1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998 1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650 1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056 1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111 1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606 1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977 1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879 1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174 1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153
So, each value has at most four numbers after the decimal point. There is no explicit bound, but the values will effectively be in between -999 and 999. I ran the following code.
data want; infile "http://global-q.org/uploads/1/2/2/6/122679606/ q5_factors_annual_2019a.csv" url firstobs=2 dsd; length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6; input year R_F R_MKT R_ME R_IA R_ROE R_EG; run;
I used 6 for the variable length based on this document, but it seems some values are unusually read as follows.
I wonder whether (1) the unusual values such as the 5.2941999999 above are just OK, and (2) the default length 8 rather than the 6 above must be used for these four-digit values.
The values are not unusually read, but they are "unusually" stored. They were read in by the data step as 8-byte floating point values, but they are stored on disk with length 6 instead of (the default) 8 bytes. And since they are stored in floating point binary (not decimal) the last two bytes (16 bits) may not be all trailing zeroes even though in decimal notation, everything after the (say) 8th significant digit would be zero. Then when the 6-byte values are brought back into memory from the disk for viewing or any PROC, they will be expanded to 8-bytes in memory by appending 16 binary zeroes. The result will NOT be exactly what it would have been if the last two bytes had been kept.
Unless your data always has exact integers, (or exact halves, quarters, eight's, etc.) you will inevitably introduce these approximations. Don't use the LENGTH attribute to shorten storage of non-integers without a very powerful and well-understood reason.
BTW, this is not a SAS problem - it is an attribute of truncating the default double-word (8 bytes) floating point numeric structure for almost all computing hardware.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Junyong wrote:
I have a data set, the first 10 rows of which look like the following.
year,R_F,R_MKT,R_ME,R_IA,R_ROE,R_EG 1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998 1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650 1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056 1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111 1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606 1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977 1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879 1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174 1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153
So, each value has at most four numbers after the decimal point. There is no explicit bound, but the values will effectively be in between -999 and 999. I ran the following code.
data want; infile "http://global-q.org/uploads/1/2/2/6/122679606/ q5_factors_annual_2019a.csv" url firstobs=2 dsd; length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6; input year R_F R_MKT R_ME R_IA R_ROE R_EG; run;
I used 6 for the variable length based on this document, but it seems some values are unusually read as follows.
I wonder whether (1) the unusual values such as the 5.2941999999 above are just OK, and (2) the default length 8 rather than the 6 above must be used for these four-digit values.
The values are not unusually read, but they are "unusually" stored. They were read in by the data step as 8-byte floating point values, but they are stored on disk with length 6 instead of (the default) 8 bytes. And since they are stored in floating point binary (not decimal) the last two bytes (16 bits) may not be all trailing zeroes even though in decimal notation, everything after the (say) 8th significant digit would be zero. Then when the 6-byte values are brought back into memory from the disk for viewing or any PROC, they will be expanded to 8-bytes in memory by appending 16 binary zeroes. The result will NOT be exactly what it would have been if the last two bytes had been kept.
Unless your data always has exact integers, (or exact halves, quarters, eight's, etc.) you will inevitably introduce these approximations. Don't use the LENGTH attribute to shorten storage of non-integers without a very powerful and well-understood reason.
BTW, this is not a SAS problem - it is an attribute of truncating the default double-word (8 bytes) floating point numeric structure for almost all computing hardware.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have added INFORMAT and FORMAT statements to your code. Try it:
data test;
length year 3 R_F R_MKT R_ME R_IA R_ROE R_EG 6;
informat R_F R_MKT R_ME R_IA R_ROE R_EG best8.4;
infile cards dlm=',' dsd truncover;
input year R_F R_MKT R_ME R_IA R_ROE R_EG;
format year 4.
R_F R_MKT R_ME R_IA R_ROE R_EG best8.4;
cards;
1967,4.1474,24.4192,40.5479,-11.4478,20.6095,-3.2998
1968,5.2942,8.8747,24.9021,14.7436,-2.4844,11.4650
1969,6.5912,-17.4274,-11.7458,0.4645,15.4144,12.9056
1970,6.3829,-6.3099,-7.9029,22.7755,-0.4696,17.1111
1971,4.3172,11.8817,5.3575,0.9003,11.4332,6.1606
1972,3.8912,13.4494,-9.0317,5.1487,5.6877,15.0977
1973,7.0586,-25.8075,-17.1956,7.7738,0.9487,17.2879
1974,8.0781,-36.0193,4.5604,18.5330,11.4993,20.1174
1975,5.8210,31.5368,16.9765,7.4978,-6.2017,11.5153
; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks—in case, my code above imports the full data via url
.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Junyong wrote:
Thanks—in case, my code above imports the full data via
url
.
The data origin is not relevant here.
What makes the difference are: length, informat and format statements.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I had absolutely no idea on this issue so visited IEEE 754—it seems deviating from the default 8 is unreasonable unless integers like dates, years, dummies, etc. Much appreciate all these details for that novice question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Don't EVER store decimals in SAS in less than the default 8 bytes - it really is asking for trouble with precision. Disk storage is cheap these days so you aren't saving any money doing so, but you are wasting time writing unnecessary LENGTH statements. Integers can be stored accurately in less than 8 bytes and the most common candidates for this are SAS dates. But again I personally think it is a waste of time adding LENGTH statements for all SAS dates.
If you want to save space in SAS datasets, use the COMPRESS option in an OPTIONS statement then all following datasets created will be compressed. You'll save a lot more space that way than tinkering with the storage length of numeric variables and it is way quicker to implement.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for this clarification again—it seems changing the default is ill-suited if non-integers. However, it seems it still pays to cut the data size because IT services such as Dropbox and Spectrum are not frictionless. I think compress
works well for characters rather than numbers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Junyong - COMPRESS = BINARY also compresses some numerics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For display if the range of values is -999 to 999 with up to 4 decimals then a BEST9. should work.
-998.1234 occupies 9 print positions: 1 for the -, 3 for the integer portion, 1 for the decimal point and 4 for the decimal digits. 1+3+1+4=9.
There are a whole lot of references for computer precision and representation of decimals in binary storage. Do a search.
Moral of the story: don't us length less than the default 8 for decimals.
The following reads the same text value into multiple variables that are defined with different lengths.
data example; length x1 8 x2 7 x3 6 x4 5 x5 4 x6 3; array x (*) x1 - x6; do i=1 to dim(x); input @1 x[i] @; end; input;
format x: best16. ; datalines; 123456789. 12345678.9 1234567.89 123456.789 12345.6789 1234.56789 123.456789 12.3456789 1.23456789 0.123456789 0.0123456789 0.00123456789 ;
With a wider than default format you can see issues with decimals occurring quite easily just based on the length assigned.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for this intuitive example. I had absolutely no idea on this issue so read the IEEE 754 page above. I will stick to it unless integers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You don't have to restrict length shortening to just integers. If all your values are halves and quarters for instance, then you could shorten with no loss of accuracy. Consider the possibilities of length 6 below for this subset of integers divided by powers of 2:
213 data _null_;
214 /* Get the largest consecutive integer storable as floating point 6 bytes */
215 L6=constant('exactint',6);
216 put L6=comma22.0;
217
218 /* Now show the maximum consecutive value for power of 2 fractions: */
219 X=L6;
220 Length txt $22;
221 do denom = 2 by 0 while (denom<=256*256);
222 X=x/2;
223 TXT=right(put(x,comma22.0));
224 put '1/' denom @15 txt $char22.;
225 denom=2*denom;
226 end;
227 run;
L6=137,438,953,472
1/2 68,719,476,736
1/4 34,359,738,368
1/8 17,179,869,184
1/16 8,589,934,592
1/32 4,294,967,296
1/64 2,147,483,648
1/128 1,073,741,824
1/256 536,870,912
1/512 268,435,456
1/1024 134,217,728
1/2048 67,108,864
1/4096 33,554,432
1/8192 16,777,216
1/16384 8,388,608
1/32768 4,194,304
1/65536 2,097,152
I.e. every 1/4th up to 34,359,738,368 will suffer no loss of accuracy in a storage length of 6.
BTW, this was a real-world problem in the stock market. Markets used to report prices in dollars, half-dollar, quarters, eights (and I think down to sixteenths) up until 2001, when they switch to decimalization. Not only did the decimalization cut down of trader's commissions (granularity of prices went down to single pennies), but until that time we could accurately store daily closing prices with a length of less the 8 (I believe our supplier - CRSP - used a length of 6). Well, that little space savings disappeared quickly. More importantly, we get hundreds of millions of stock trades and quotes daily from us stock exchanges.
At 2 bytes per trade/quote space needs went up dramatically.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------