BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Hong-tian
Fluorite | Level 6

Hi All, I generate a dataset named tmp1 using the following do loops, based on which I wanted to generate another dataset named tmp2. Although it is obvious that tmp2 would have one record, it did not.

Could anyone tell me what's wrong? Thank you!

 

data tmp1;

do i=0.025 to 0.250 by 0.025;
output;
end;
run;

 

data tmp2;
set tmp1;
if i eq 0.2;
run;

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

This is a problem of machine precision, where digital computers cannot represent some numbers exactly, and so the value in the data set is the number you expect within plus or minus some very small value epsilon.

 

So, you can handle it this way:

 

data tmp2;
    set tmp1;
    if fuzz(i-0.2)=0;
run;

or this way

 

data tmp2;
set tmp1;
i=round(i,0.001);
if i eq 0.2;
run;

 

--
Paige Miller

View solution in original post

14 REPLIES 14
PaigeMiller
Diamond | Level 26

This is a problem of machine precision, where digital computers cannot represent some numbers exactly, and so the value in the data set is the number you expect within plus or minus some very small value epsilon.

 

So, you can handle it this way:

 

data tmp2;
    set tmp1;
    if fuzz(i-0.2)=0;
run;

or this way

 

data tmp2;
set tmp1;
i=round(i,0.001);
if i eq 0.2;
run;

 

--
Paige Miller
vellad
Obsidian | Level 7

@PaigeMiller is right. Looping around floating points is a headache in many other programming languages as well. An easy fix would be to loop around integers, maintaining the same logic:

data tmp1;
    do i=1 to 10 by 1; /*just multiple all numbers by 40 to keep it nice and simple*/
        i_=i/40; /*preserve value of i you originally wanted*/
        output;
    end;
run;

data tmp2;
    set tmp1;
    if i eq 0.2*40;
run;
mkeintz
PROC Star

@vellad wrote:

@PaigeMiller is right. Looping around floating points is a headache in many other programming languages as well. An easy fix would be to loop around integers, maintaining the same logic:

data tmp1;
    do i=1 to 10 by 1; /*just multiple all numbers by 40 to keep it nice and simple*/
        i_=i/40; /*preserve value of i you originally wanted*/
        output;
    end;
run;

data tmp2;
    set tmp1;
    if i eq 0.2*40;
run;

Just to gild the lily: Converting to integer works, but more generally you just need to convert such that both the starting value and the increment are representable as powers of 2 (including non-positive powers representing 1, 1/2, 1/4, 1/8, etc.).

 

So converting as below would work as well.  Any further change to integer would be no more than floating the radix point - i.e. modifying the binary exponent.  Of course, it's not necessarily any more intuitive.

data tmp1;
    do i=0.25 to 2.5 by 0.25; 
        i_=i/10; 
        output;
    end;
run;

data tmp2;
    set tmp1;
    if i eq 0.2*10;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Hong-tian
Fluorite | Level 6

Thank you for your further explanations. That's very helpful, in particular to the understanding of this problem.

vellad
Obsidian | Level 7
Its neat. For Neo, definitely more intuitive 😉
FreelanceReinh
Jade | Level 19

Hi @vellad,

 

I like the approach of using an integer index variable and use it regularly in similar situations. However, I would not use a selection criterion like i eq 0.2*40 where a calculation involves an operand (0.2) which is known to contain numeric representation error (although it works in this particular example). I would be much more confident that a calculation involving only integer operands of moderate size (like i_=i/40 in your case) yields a result which can be safely used in comparisons (such as i_ eq 0.2). Of course, the safest option is to avoid decimal fractions altogether (as in i eq 8).

 

Warning example:

data test;
do i=1 to 99;
  x=i/100;
  a=(x=input(cats('0.',put(i,z2.)),8.)); /* testing e.g. x=0.07 */
  b=(i=x*100); /* testing e.g. i=0.07*100 */
  if ~a | ~b then output;
end;
run;

proc print data=test;
run;

Result (Windows SAS 9.4):

Obs     i      x     a    b

 1      7    0.07    1    0
 2     14    0.14    1    0
 3     28    0.28    1    0
 4     29    0.29    1    0
 5     55    0.55    1    0
 6     56    0.56    1    0
 7     57    0.57    1    0
 8     58    0.58    1    0

That is, e.g., the Boolean expression 7/100=0.07 & 0.07*100~=7 is true in SAS under Windows.

vellad
Obsidian | Level 7

Hmm do you know why i and y don't match in below code? I feel I'm missing something fundamental here.

vellad_0-1602521411559.png

 

FreelanceReinh
Jade | Level 19

My assertion was:


@FreelanceReinh wrote:

(...) the Boolean expression 7/100=0.07 & 0.07*100~=7 is true in SAS under Windows.


Your code first computes 7/100 and rounds the result to two decimals, which doesn't change anything (as mentioned in my previous post), so x=0.07, and then the result of x*100 -- a number close to, but not equal to 7 in SAS under Windows -- is stored in variable y. This, in turn, is compared to variable i, which contains the exact value 7. The IF condition is met and the OUTPUT statement produces one observation in dataset test, thus confirming my assertion. The reason for the inequality is a rounding error occurring in the multiplication of 0.07 -- which has no exact internal binary representation (i.e., variable x already contains a rounding error, called numeric representation error) -- with 100.

 

Edit:

The decimal fraction 0.07, converted to the binary system (and with some line breaks inserted), looks like this:

0.0001000111101011100001
    01000111101011100001
    01000111101011100001
    ...

The 20-binary-digit pattern "01000111101011100001" is repeated infinitely many times ("periodic fraction").

 

The internal 64-bit floating-point representation that SAS under Windows uses to store 0.07 in an 8-byte numeric variable is based on the above binary representation. However, only 52 of the 64 bits (rather than infinitely many) are available for the binary digits following the first "1". The remaining binary digits are rounded (in this particular example: rounded up), as can be seen with the BINARY64. format:

data _null_;
x=0.07;
put x binary64.;
run;

Result:

0011111110110001111010111000010100011110101110000101000111101100

Thanks to the periodicity, we know that the last four bits, 1100, resulted from rounding 1011100001... to 1100000000... Apart from this, the 52 "mantissa bits" (in bold face above) are copied from the periodic binary representation and we recognize the repeating 20-binary-digit pattern (highlighted in blue).

 

Unlike most decimal fractions, integers (of moderate size) have exact 64-bit floating-point binary representations, ending in many zeros without rounding. Let's have a look at these representations of 100 and 7:

data _null_;
x=100; y=7;
put (x y)(=binary64./);
run;

Result:

x=0100000001011001000000000000000000000000000000000000000000000000
y=0100000000011100000000000000000000000000000000000000000000000000

The 52 mantissa bits reflect the digits after the first "1" of the binary representations of 100 and 7, which are 1100100 and 111, respectively, padded with trailing zeros.

 

Now, the multiplication 0.07*100 (using the binary floating-point representations) amounts to multiplying the two numbers below:

1.0001111010111000010100011110101110000101000111101100 * 2**-4
1.100100 * 2**6

It's not difficult to do this multiplication by hand. Up to the factor 2**-4 * 2**6 = 2**2 = 4, the result can be obtained as the sum (of shifted binary numbers)

  1.0001111010111000010100011110101110000101000111101100
+ 0.10001111010111000010100011110101110000101000111101100
+ 0.00010001111010111000010100011110101110000101000111101100
  1.11000000000000000000000000000000000000000000000000001100

The latter number has a mantissa of 56 bits, hence it must be rounded to 52 bits in order to fit into an 8-byte numeric variable. As a consequence, the sequence of the least significant bits, ...01100 is rounded up to ...1(0000). So, for SAS under Windows the result of the multiplication is

1.1100000000000000000000000000000000000000000000000001 * 2**2

written in floating-point representation:

0100000000011100000000000000000000000000000000000000000000000001

Indeed, this matches the result obtained with the BINARY64. format:

data _null_;
y=0.07*100;
put y binary64.;
run;

So, the difference between the result stored in variable y and the exact value 7 (see internal representation further above) is 1 * 2**-50 = 8.881784...E-16, as is confirmed by SAS:

455  data _null_;
456  d=0.07*100-7;
457  put d;
458  run;

8.881784E-16
vellad
Obsidian | Level 7

I can use (apart from a simple ROUND on 0.07*100 ) the TRUNC function to limit the size of 0.07*100 to <= 7 bytes and the result remains 7. But if I use 8 bytes (default), there's the classic precision error. I'm still trying to wrap my head around this because TRUNC(0.7*10 ) and TRUNC(0.007*1000 ) both yield exactly 7. Its just TRUNC(0.07*100 ) that has an issue.

FreelanceReinh
Jade | Level 19

I've just extended my previous post to demonstrate how the result of 0.07*100 (deviating from 7) comes about. If you perform similar DATA steps and hand calculations for 0.7*10 and 0.007*1000, you'll see why there are no issues with these two cases. (I've checked only the first case, which is relatively easy.) Note that the similarities between the three pairs of decimal factors considered here does not translate to similarities in the binary system, i.e., the patterns of 0s and 1s are very different. For example, the lengths of the periods in the periodic binary representations of 0.7 and 0.007 are 4 and 100, respectively, not 20 as in the case of 0.07.

 

Using the TRUNC function with a second argument 7 (or 6, 5, ...) you truncate the 8 (or 16, 24, ...) least significant bits of the internal representation. In the example of 0.07*100 these bits, of course, include the rounding error 1 * 2**-50 (the "red 1" in the 52nd mantissa bit), so you get rid of it. If the results do not contain rounding errors in the internal representation (which happens to be true for 0.7*10 and 0.007*1000), the TRUNC function doesn't change that. This explains your observations.

Hong-tian
Fluorite | Level 6
Thank you for your step-by-step illustration. It's really helpful. I had never thought it in this way.
vellad
Obsidian | Level 7
Wow, thanks so much for the detailed response! I got it now and it is indeed not that counter intuitive how 0.07 is different from 0.7 and 0.007, once you go down to the bit level.
Hong-tian
Fluorite | Level 6
Really helpful solutions! Thank you!
Hong-tian
Fluorite | Level 6

I agree and will try to use intergers in do loops in my future works. Thank you!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 1275 views
  • 6 likes
  • 5 in conversation