Solved: Re: "Largest Integer Represented Exactly?" - NOT Exactly

mkeintz · Posted 01-24-2020 12:40 AM

A lot of people know this, but I thinks it's worth putting in a message:

Have you ever seen this table presenting information on the range and precision of numeric variables for varying storage lengths? This table has been in the SAS documentation for decades.

Significant Digits and Largest Integer by Length for SAS Variables under Windows

Length in Bytes

Largest Integer Represented Exactly

Exponential Notation

Significant Digits Retained

3

8,192

2¹³

3

4

2,097,152

2²¹

6

5

536,870,912

2²⁹

8

6

137,438,953,472

2³⁷

11

7

35,184,372,088,832

2⁴⁵

13

8

9,007,199,254,740,992

2⁵³

15

~~Some variation in operating systems but the negative values require more storage than the simple integers~~. (Struck out as per @FreelanceReinh 's comment.

The column "Largest Integer Represented Exactly" is a bit misleading. The values displayed are actually the "Largest Consecutive Integer Represented Exactly". For instance, for a 3-byte variable you can precisely store many integers over the posted value of 8,192. They are:

All even numbers >8,192 and <= 16,384 - i.e. all even numbers in the range (1*8192, 2*8192]
1. In the range notation above the "(" means an open end - i.e. down-to-but-not-including 8,192.
2. And the "]" (closed) means include the upper end-point.
3. This amounts to another 4,096 integers.
Then, all integers exactly divisible by 4 >16,384 and <=32,768 - i.e. in the range (2*8192, 4*8192]
i.e. another 4,096 ~~2,048~~ integers.
All divisible by 8 ("0mod8") integers in (4*8192, 8*8192]
4,096 ~~1,024~~ more integers
Generally all 0modX (where X is a power of 2) in the range ( (x/2)*8192, x*8192]
4,096 ~~8192/x~~ more integers.

~~So the number of integers above 8,192 approaches 8,192.~~

The same principle holds for all the other storage lengths.

However all the progressions of large integer values above stop at 1.797693E308, the largest double-precision (8 bytes) number representable on the (windows) system. That number is produced by the CONTSTANT('BIG') function.

Of course, you can't print these large integers if they require more the 32 decimal digits.

2nd edit: In the other direction, for length 3 values:

There are only integers in (4096,8192]
There are only integers and halves in (2048,4096]
Integers,halves,quarters in (1024,2048]
etc.

Edited Additional Note. Also remember that the numeric integer limits above apply only after the variable has been stored in a SAS data step - that's when the double-precision (8 byte) value generated and used in the DATA step is actually truncated to the user-specified length. For example if you run this program, notice how 8,193 gets truncated to 8,192 only after being stored in, and later retrieved from a data set:

data t;
  length x3 3  x8 8;
  do x3=8190 to 8194;
    x8=x3;
    put (x:) (=comma5.0);
	output;
  end;
run;
data _null_;
  set t;
  put (x:) (=comma5.0);
run;

which produces this log:

9    data t;
10     length x3 3  x8 8;
11     do x3=8190 to 8194;
12       x8=x3;
13       put (x:) (=comma5.0);
14       output;
15     end;
16   run;

x3=8,190 x8=8,190
x3=8,191 x8=8,191
x3=8,192 x8=8,192
x3=8,193 x8=8,193
x3=8,194 x8=8,194


17   data _null_;
18     set t;
19     put (x:) (=comma5.0);
20   run;

x3=8,190 x8=8,190
x3=8,191 x8=8,191
x3=8,192 x8=8,192
x3=8,192 x8=8,193
x3=8,194 x8=8,194

Notice X8 keeps the original values, but odd values of X3 (length 3) above 8,192 are truncated to the next lower even number.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FreelanceReinh · Posted 01-24-2020 09:47 AM

@mkeintz: Yes, I agree that the wording ("largest integer") which is often used in this context can be misleading. (However, the statement that "negative values require more storage ..." contradicts the fact that the "sign bit" is always included in the internal representation. I guess you copied this sentence from this recent post, not from official SAS documentation.)

Just to add the explanation (for readers who are curious) why numeric variables of any length can store certain larger integers than those supposedly "largest" ones without losing precision, in particular powers of 2 (up to 2**1023=8.988...E307 -- displaying such values is a different issue): The internal floating-point representation of numeric values uses separate sets of bits for the exponent (containing a number's "order of magnitude") and the mantissa (relevant for the "precision"). For both parts of the number it uses the binary system.

This means: If a number can be stored exactly, such as an integer with an absolute value below the upper bounds 8192 etc. mentioned in @mkeintz's post, then one can multiply this number by any power of 2, i.e. 2**k with a positive or negative integer k, and the result, be it an integer or not, can still be stored exactly -- except in the extreme cases where its absolute value exceeds CONSTANT('BIG') or falls below CONSTANT('SMALL'). This is because that multiplication will only change the exponent, but not the mantissa. It's analogous to scientific notation in the decimal system, e.g. 6.022E23, where a multiplication by any power of the base 10 affects only the exponent (23), but leaves the mantissa digits (6022) unchanged and in particular doesn't require more of them.

For example, a numeric variable of length 3 (bytes) on a Windows or Unix system will happily store exact numbers like 4321*2**44=76015835897921536 or 123*2**-17=0.00093841552734375, but will miserably lose precision when being assigned values such as 8765 or 0.1 (!), for which 12 mantissa bits (3*8 bits due to length 3, minus 1 bit for the sign minus 11 bits for the exponent) are insufficient:

data test;
length a1-a4 3;  /* mathematical binary representation, mantissa bits blue:     */
a1=4321*2**44;   /* 100001110000100000000000000000000000000000000000000000000   */
a2=123*2**-17;   /* 0.00000000001111011000000                                   */
a3=8765;         /* 10001000111101                                              */
a4=0.1;          /* 0.000110011001100110011001... (infinitely repeating "0011") */
run;

proc print data=test;
format a1 best32. a2 e20.;
run;

Results (under Windows):

               a1                      a2     a3        a4

76015835897921536     9.3841552734375E-04    8764    0.099991

More details can be found in Numerical Accuracy in SAS Software.

[Edit: Corrected minor typo in the text and improved wording in one place.]

Edit 2: Just noticed: I think your calculations underestimate the number of integers with an exact representation in a 3-byte variable: Each of the ranges [8192, 16384), [16384, 32768), ..., [2**1023, 2**1024) -- these are 1023-13+1=1011 ranges -- contributes 2**12=4096 integers. (In fact, there are 4096, not 2048, integers divisible by 4 in the range (16384, 32768] or [16384, 32768) for that matter, again 4096 divisible by 8 in [32768, 65536) and so on. Both the length of the range and the divisor are doubled in each step.) These correspond to the 2**12 different combinations that can be formed with the 12 mantissa bits and each range corresponds to one combination of the exponent bits. The pertinent values of the exponent (which is shifted by 1023 ["bias"]) are 1036, 1037, ..., 2046. So, together with the 8191 integers 1, 2, 3, ..., 8191 (whose exponents range from 1023 to 1035 while only 1+2+4+8+...+4096=8191 mantissas yield integers) I end up with n=1011*4096+8191=4,149,247 positive integers and hence a total of 2*n+1=8,298,495 integers including zero and negative integers.

View solution in original post

yabwon · Posted 01-24-2020 03:22 AM

@mkeintz

Mark,

thanks a 1e6 for this very useful analysis! I will definitely refer to your post during my classes.

All the best

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

FreelanceReinh · Posted 01-24-2020 09:47 AM

@mkeintz: Yes, I agree that the wording ("largest integer") which is often used in this context can be misleading. (However, the statement that "negative values require more storage ..." contradicts the fact that the "sign bit" is always included in the internal representation. I guess you copied this sentence from this recent post, not from official SAS documentation.)

Just to add the explanation (for readers who are curious) why numeric variables of any length can store certain larger integers than those supposedly "largest" ones without losing precision, in particular powers of 2 (up to 2**1023=8.988...E307 -- displaying such values is a different issue): The internal floating-point representation of numeric values uses separate sets of bits for the exponent (containing a number's "order of magnitude") and the mantissa (relevant for the "precision"). For both parts of the number it uses the binary system.

This means: If a number can be stored exactly, such as an integer with an absolute value below the upper bounds 8192 etc. mentioned in @mkeintz's post, then one can multiply this number by any power of 2, i.e. 2**k with a positive or negative integer k, and the result, be it an integer or not, can still be stored exactly -- except in the extreme cases where its absolute value exceeds CONSTANT('BIG') or falls below CONSTANT('SMALL'). This is because that multiplication will only change the exponent, but not the mantissa. It's analogous to scientific notation in the decimal system, e.g. 6.022E23, where a multiplication by any power of the base 10 affects only the exponent (23), but leaves the mantissa digits (6022) unchanged and in particular doesn't require more of them.

For example, a numeric variable of length 3 (bytes) on a Windows or Unix system will happily store exact numbers like 4321*2**44=76015835897921536 or 123*2**-17=0.00093841552734375, but will miserably lose precision when being assigned values such as 8765 or 0.1 (!), for which 12 mantissa bits (3*8 bits due to length 3, minus 1 bit for the sign minus 11 bits for the exponent) are insufficient:

data test;
length a1-a4 3;  /* mathematical binary representation, mantissa bits blue:     */
a1=4321*2**44;   /* 100001110000100000000000000000000000000000000000000000000   */
a2=123*2**-17;   /* 0.00000000001111011000000                                   */
a3=8765;         /* 10001000111101                                              */
a4=0.1;          /* 0.000110011001100110011001... (infinitely repeating "0011") */
run;

proc print data=test;
format a1 best32. a2 e20.;
run;

Results (under Windows):

               a1                      a2     a3        a4

76015835897921536     9.3841552734375E-04    8764    0.099991

More details can be found in Numerical Accuracy in SAS Software.

[Edit: Corrected minor typo in the text and improved wording in one place.]

Edit 2: Just noticed: I think your calculations underestimate the number of integers with an exact representation in a 3-byte variable: Each of the ranges [8192, 16384), [16384, 32768), ..., [2**1023, 2**1024) -- these are 1023-13+1=1011 ranges -- contributes 2**12=4096 integers. (In fact, there are 4096, not 2048, integers divisible by 4 in the range (16384, 32768] or [16384, 32768) for that matter, again 4096 divisible by 8 in [32768, 65536) and so on. Both the length of the range and the divisor are doubled in each step.) These correspond to the 2**12 different combinations that can be formed with the 12 mantissa bits and each range corresponds to one combination of the exponent bits. The pertinent values of the exponent (which is shifted by 1023 ["bias"]) are 1036, 1037, ..., 2046. So, together with the 8191 integers 1, 2, 3, ..., 8191 (whose exponents range from 1023 to 1035 while only 1+2+4+8+...+4096=8191 mantissas yield integers) I end up with n=1011*4096+8191=4,149,247 positive integers and hence a total of 2*n+1=8,298,495 integers including zero and negative integers.

mkeintz · Posted 01-24-2020 02:30 PM

@FreelanceReinh Thanks for providing a deeper dive into numeric precision in SAS. I was hoping for someone to discuss the WHY that explains my description of the WHAT.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

FreelanceReinh · Posted 01-24-2020 05:02 PM

You're welcome. Numeric representation issues have been one of my "favorite" topics in SAS for many years. Yet I still "discover" new surprising oddities about them from time to time.

Sorry, I hadn't checked earlier your calculations regarding the number of exactly representable integers in a 3-byte variable. I have added my own calculations (resulting in larger values) in "Edit 2" of my post.

Length in Bytes	Largest Integer Represented Exactly	Exponential Notation	Significant Digits Retained
3	8,192	2¹³	3
4	2,097,152	2²¹	6
5	536,870,912	2²⁹	8
6	137,438,953,472	2³⁷	11
7	35,184,372,088,832	2⁴⁵	13
8	9,007,199,254,740,992	2⁵³	15

"Largest Integer Represented Exactly?" - NOT Exactly

Re: "Largest Integer Represented Exactly?" - NOT Exactly

Re: "Largest Integer Represented Exactly?" - NOT Exactly

Re: "Largest Integer Represented Exactly?" - NOT Exactly

Re: "Largest Integer Represented Exactly?" - NOT Exactly

Re: "Largest Integer Represented Exactly?" - NOT Exactly

Click image to register for webinar

Classroom Training Available!