Solved: I meet with a problem of data values in do loops

Hong-tian · Posted 10-11-2020 11:45 AM

Hi All, I generate a dataset named tmp1 using the following do loops, based on which I wanted to generate another dataset named tmp2. Although it is obvious that tmp2 would have one record, it did not.

Could anyone tell me what's wrong? Thank you!

data tmp1;

do i=0.025 to 0.250 by 0.025;
output;
end;
run;

data tmp2;
set tmp1;
if i eq 0.2;
run;

PaigeMiller · Posted 10-11-2020 12:03 PM

This is a problem of machine precision, where digital computers cannot represent some numbers exactly, and so the value in the data set is the number you expect within plus or minus some very small value epsilon.

So, you can handle it this way:

data tmp2;
    set tmp1;
    if fuzz(i-0.2)=0;
run;

or this way

data tmp2;
set tmp1;
i=round(i,0.001);
if i eq 0.2;
run;

--
Paige Miller

View solution in original post

PaigeMiller · Posted 10-11-2020 12:03 PM

This is a problem of machine precision, where digital computers cannot represent some numbers exactly, and so the value in the data set is the number you expect within plus or minus some very small value epsilon.

So, you can handle it this way:

data tmp2;
    set tmp1;
    if fuzz(i-0.2)=0;
run;

or this way

data tmp2;
set tmp1;
i=round(i,0.001);
if i eq 0.2;
run;

--
Paige Miller

vellad · Posted 10-11-2020 02:02 PM

@PaigeMiller is right. Looping around floating points is a headache in many other programming languages as well. An easy fix would be to loop around integers, maintaining the same logic:

data tmp1;
    do i=1 to 10 by 1; /*just multiple all numbers by 40 to keep it nice and simple*/
        i_=i/40; /*preserve value of i you originally wanted*/
        output;
    end;
run;

data tmp2;
    set tmp1;
    if i eq 0.2*40;
run;

mkeintz · Posted 10-11-2020 07:11 PM

@vellad wrote:

@PaigeMiller is right. Looping around floating points is a headache in many other programming languages as well. An easy fix would be to loop around integers, maintaining the same logic:
data tmp1;
    do i=1 to 10 by 1; /*just multiple all numbers by 40 to keep it nice and simple*/
        i_=i/40; /*preserve value of i you originally wanted*/
        output;
    end;
run;

data tmp2;
    set tmp1;
    if i eq 0.2*40;
run;

Just to gild the lily: Converting to integer works, but more generally you just need to convert such that both the starting value and the increment are representable as powers of 2 (including non-positive powers representing 1, 1/2, 1/4, 1/8, etc.).

So converting as below would work as well. Any further change to integer would be no more than floating the radix point - i.e. modifying the binary exponent. Of course, it's not necessarily any more intuitive.

data tmp1;
    do i=0.25 to 2.5 by 0.25; 
        i_=i/10; 
        output;
    end;
run;

data tmp2;
    set tmp1;
    if i eq 0.2*10;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Hong-tian · Posted 10-11-2020 08:32 PM

Thank you for your further explanations. That's very helpful, in particular to the understanding of this problem.

vellad · Posted 10-11-2020 08:45 PM

Its neat. For Neo, definitely more intuitive 😉

FreelanceReinh · Posted 10-12-2020 06:13 AM

Hi @vellad,

I like the approach of using an integer index variable and use it regularly in similar situations. However, I would not use a selection criterion like i eq 0.2*40 where a calculation involves an operand (0.2) which is known to contain numeric representation error (although it works in this particular example). I would be much more confident that a calculation involving only integer operands of moderate size (like i_=i/40 in your case) yields a result which can be safely used in comparisons (such as i_ eq 0.2). Of course, the safest option is to avoid decimal fractions altogether (as in i eq 8).

Warning example:

data test;
do i=1 to 99;
  x=i/100;
  a=(x=input(cats('0.',put(i,z2.)),8.)); /* testing e.g. x=0.07 */
  b=(i=x*100); /* testing e.g. i=0.07*100 */
  if ~a | ~b then output;
end;
run;

proc print data=test;
run;

Result (Windows SAS 9.4):

Obs     i      x     a    b

 1      7    0.07    1    0
 2     14    0.14    1    0
 3     28    0.28    1    0
 4     29    0.29    1    0
 5     55    0.55    1    0
 6     56    0.56    1    0
 7     57    0.57    1    0
 8     58    0.58    1    0

That is, e.g., the Boolean expression 7/100=0.07 & 0.07*100~=7 is true in SAS under Windows.

vellad · Posted 10-12-2020 01:00 PM

Hmm do you know why i and y don't match in below code? I feel I'm missing something fundamental here.

FreelanceReinh · Posted 10-12-2020 01:26 PM

My assertion was:

@FreelanceReinh wrote:

(...) the Boolean expression 7/100=0.07 & 0.07*100~=7 is true in SAS under Windows.

Your code first computes 7/100 and rounds the result to two decimals, which doesn't change anything (as mentioned in my previous post), so x=0.07, and then the result of x*100 -- a number close to, but not equal to 7 in SAS under Windows -- is stored in variable y. This, in turn, is compared to variable i, which contains the exact value 7. The IF condition is met and the OUTPUT statement produces one observation in dataset test, thus confirming my assertion. The reason for the inequality is a rounding error occurring in the multiplication of 0.07 -- which has no exact internal binary representation (i.e., variable x already contains a rounding error, called numeric representation error) -- with 100.

Edit:

The decimal fraction 0.07, converted to the binary system (and with some line breaks inserted), looks like this:

0.0001000111101011100001
    01000111101011100001
    01000111101011100001
    ...

The 20-binary-digit pattern "01000111101011100001" is repeated infinitely many times ("periodic fraction").

The internal 64-bit floating-point representation that SAS under Windows uses to store 0.07 in an 8-byte numeric variable is based on the above binary representation. However, only 52 of the 64 bits (rather than infinitely many) are available for the binary digits following the first "1". The remaining binary digits are rounded (in this particular example: rounded up), as can be seen with the BINARY64. format:

data _null_;
x=0.07;
put x binary64.;
run;

Result:

0011111110110001111010111000010100011110101110000101000111101100

Thanks to the periodicity, we know that the last four bits, 1100, resulted from rounding 1011100001... to 1100000000... Apart from this, the 52 "mantissa bits" (in bold face above) are copied from the periodic binary representation and we recognize the repeating 20-binary-digit pattern (highlighted in blue).

Unlike most decimal fractions, integers (of moderate size) have exact 64-bit floating-point binary representations, ending in many zeros without rounding. Let's have a look at these representations of 100 and 7:

data _null_;
x=100; y=7;
put (x y)(=binary64./);
run;

Result:

x=0100000001011001000000000000000000000000000000000000000000000000
y=0100000000011100000000000000000000000000000000000000000000000000

The 52 mantissa bits reflect the digits after the first "1" of the binary representations of 100 and 7, which are 1100100 and 111, respectively, padded with trailing zeros.

Now, the multiplication 0.07*100 (using the binary floating-point representations) amounts to multiplying the two numbers below:

1.0001111010111000010100011110101110000101000111101100 * 2**-4
1.100100 * 2**6

It's not difficult to do this multiplication by hand. Up to the factor 2**-4 * 2**6 = 2**2 = 4, the result can be obtained as the sum (of shifted binary numbers)

  1.0001111010111000010100011110101110000101000111101100
+ 0.10001111010111000010100011110101110000101000111101100
+ 0.00010001111010111000010100011110101110000101000111101100
  1.11000000000000000000000000000000000000000000000000001100

The latter number has a mantissa of 56 bits, hence it must be rounded to 52 bits in order to fit into an 8-byte numeric variable. As a consequence, the sequence of the least significant bits, ...01100 is rounded up to ...1(0000). So, for SAS under Windows the result of the multiplication is

1.1100000000000000000000000000000000000000000000000001 * 2**2

written in floating-point representation:

0100000000011100000000000000000000000000000000000000000000000001

Indeed, this matches the result obtained with the BINARY64. format:

data _null_;
y=0.07*100;
put y binary64.;
run;

So, the difference between the result stored in variable y and the exact value 7 (see internal representation further above) is 1 * 2**-50 = 8.881784...E-16, as is confirmed by SAS:

455  data _null_;
456  d=0.07*100-7;
457  put d;
458  run;

8.881784E-16

vellad · Posted 10-12-2020 02:19 PM

I can use (apart from a simple ROUND on 0.07*100 ) the TRUNC function to limit the size of 0.07*100 to <= 7 bytes and the result remains 7. But if I use 8 bytes (default), there's the classic precision error. I'm still trying to wrap my head around this because TRUNC(0.7*10 ) and TRUNC(0.007*1000 ) both yield exactly 7. Its just TRUNC(0.07*100 ) that has an issue.

FreelanceReinh · Posted 10-12-2020 04:29 PM

I've just extended my previous post to demonstrate how the result of 0.07*100 (deviating from 7) comes about. If you perform similar DATA steps and hand calculations for 0.7*10 and 0.007*1000, you'll see why there are no issues with these two cases. (I've checked only the first case, which is relatively easy.) Note that the similarities between the three pairs of decimal factors considered here does not translate to similarities in the binary system, i.e., the patterns of 0s and 1s are very different. For example, the lengths of the periods in the periodic binary representations of 0.7 and 0.007 are 4 and 100, respectively, not 20 as in the case of 0.07.

Using the TRUNC function with a second argument 7 (or 6, 5, ...) you truncate the 8 (or 16, 24, ...) least significant bits of the internal representation. In the example of 0.07*100 these bits, of course, include the rounding error 1 * 2**-50 (the "red 1" in the 52nd mantissa bit), so you get rid of it. If the results do not contain rounding errors in the internal representation (which happens to be true for 0.7*10 and 0.007*1000), the TRUNC function doesn't change that. This explains your observations.

Hong-tian · Posted 10-14-2020 09:06 PM

Thank you for your step-by-step illustration. It's really helpful. I had never thought it in this way.

vellad · Posted 10-13-2020 01:35 AM

Wow, thanks so much for the detailed response! I got it now and it is indeed not that counter intuitive how 0.07 is different from 0.7 and 0.007, once you go down to the bit level.

Hong-tian · Posted 10-11-2020 08:27 PM

Really helpful solutions! Thank you!

Hong-tian · Posted 10-11-2020 08:28 PM

I agree and will try to use intergers in do loops in my future works. Thank you!

I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Re: I meet with a problem of data values in do loops

Classroom Training Available!