Solved: Re: PROC SQL Greater Than (>) in WHERE clause returns wrong rows, i.e....

ikoyuncu · Posted 09-09-2021 09:00 PM

Neither of the following codes filters the rows properly. The discount column was initially formatted as a percentage, I changed it to Comma32.28 to see all decimals but there is nothing but zeros in the decimals. So, if I write WHERE Discount > 0.3 it shouldn't return columns where the discount value is 0.3, am I wrong?

What is wrong with this program? Why it keeps including equality when I specifically say > or GT. How is 0.300000000000000000000 greater than 0.3? Am I missing something?

Has anyone else encountered a similar problem? Is it because of lack of ORDER BY or something? Couldn't think of anything else...

FreelanceReinh · Posted 09-10-2021 07:50 AM

Hello @ikoyuncu,

It's true that SAS cannot store 0.3 exactly in its 64-bit binary floating-point representation -- indeed, in the binary system 0.3 is a periodic fraction: 0.0100110011001..., so the internal representation is rounded and under Windows it really corresponds to 0.299999999999999988897769753748434595763683319091796875 (= 0.3 - 2**-54 / 5 = 0.3 - 1.1102...E-17) -- but I wouldn't say that the numeric representation error (-1.11...E-17) of this particular number is the main reason for the rounding error in your variable Discount (whose value in the cases in question must differ from 0.3 by more than the unavoidable numeric representation error).

Virtually all numbers, including (what should be) integers, can be "contaminated" with similar rounding errors.

Examples (log from SAS 9.4M5 under Windows):

409  data _null_;
410  if 1.4-0.4 < 1 < 2.2-1.2
411   & 1.005*1000 < 1005 < 10.05*100
412   & 0.6/0.2 < 3 < 9.9/3.3
413  then put 'Surprised?';
414  run;

Surprised?
NOTE: DATA statement used (Total process time):

Calculations with numbers (like 0.3 and many others) affected by numeric representation error can easily produce rounding errors making the results different from what they should be mathematically, as shown in the examples above and below.

So the question is rather how those Discount values only looking like 0.3 or 0.3000... (in standard formats) were created. I suspect that they were calculated from amounts of money with their typical (up to) two decimal places. (Note that, in a sense, 96 percent of the numbers with up to 2 decimal places are affected by numeric representation error.)

Examples:

data test;
input x y;
Discount=x/y;
length c $12;
h=put(Discount, hex16.);
diff=Discount-0.3;
     if Discount<0.3 then c='Discount<0.3';
else if Discount=0.3 then c='Discount=0.3';
else if Discount>0.3 then c='Discount>0.3';
cards;
   2.7     9
 240.03  800.1
2100.99 7003.3
;

Result:

Obs          x         y    Discount         c                 h                diff

 1        2.70       9.0       0.3      Discount>0.3    3FD3333333333334    5.55112E-17
 2      240.03     800.1       0.3      Discount=0.3    3FD3333333333333              0
 3     2100.99    7003.3       0.3      Discount<0.3    3FD3333333333332    -5.5511E-17

Mathematically, all three Discount values should be 0.3, but only the result in obs. 2 happens to match the internal representation of 0.3. The other two results contain rounding errors in the least significant bit (place value: 2**-54 = 5.551...E-17) which make them different from 0.3 in the IF conditions. This is obvious in formats such as HEX16. revealing the internal representation, but often not in common formats like the standard w.d format or the COMMAw.d format because they are rounding. For instance, format 32.30 reveals a rounding error in obs. 3, but not in obs. 1, unless you look at the difference diff.

A definition of Discount using the ROUND function such as

Discount=round(x/y, 1e-9);

would have avoided the issues in the examples and in similar cases. Alternatively, the ROUND function can be applied in the comparison (as Tom has suggested) or in a preliminary "data cleaning" step, e.g., if the Discount values were imported from another database.

View solution in original post

Tom · Posted 09-10-2021 12:01 AM

SAS (and pretty much all computers) store numbers using BINARY representation. There is no way to represent 3/10ths exactly using binary numbers. The value that your query is saying is larger then 0.3 is larger than 0.3 by such a small amount that it does not display when printed as a decimal fraction.

Round the value and then test.

How many digits of precision do those numbers really have? Let's say it is to the thousandths place. So round to the thousandths place or the ten-thousands place and then compare.

WHERE round(Discount,0.0001) > 0.3

ikoyuncu · Posted 09-10-2021 01:53 PM

Thanks, this definitely works. I didn't create the Discount column. There has to be a formula involved when it is first created, the values look like 0.3 but they're not precise.

ChrisNZ · Posted 09-10-2021 02:54 AM

@Tom answered.

0.3 cannot be represented exactly as a power of 2, just like 1/3 cannot be represented in the decimal base. Look up numerical precision if you want to know more.

> How is 0.300000000000000000000 greater than 0.3?

You can add as many zeros as you want in the code, it makes no difference as the computer cannot use them when storing the decimal number in 8 bytes.

.

High-Performance SAS Coding - Third Edition

ikoyuncu · Posted 09-10-2021 01:21 PM

Thanks, I actually didn't add all those zeros to use. I only formatted the column to see if there are any decimals >0 that I am missing. As others suggested, a preliminary 'data cleaning' solved my problem.

ChrisHemedinger · Posted 09-10-2021 07:38 AM

For more about the nuances of floating point math and the apparent inequality of certain numbers that "look" the same, see Numerical precision in SAS. But as others said, this isn't unique to SAS -- it's an IEEE standard for how floating point numbers are represented in systems/programming languages in general.

Check out SAS Innovate on-demand content! Watch the main stage sessions, keynotes, and over 20 technical breakout sessions!

FreelanceReinh · Posted 09-10-2021 07:50 AM

Hello @ikoyuncu,

It's true that SAS cannot store 0.3 exactly in its 64-bit binary floating-point representation -- indeed, in the binary system 0.3 is a periodic fraction: 0.0100110011001..., so the internal representation is rounded and under Windows it really corresponds to 0.299999999999999988897769753748434595763683319091796875 (= 0.3 - 2**-54 / 5 = 0.3 - 1.1102...E-17) -- but I wouldn't say that the numeric representation error (-1.11...E-17) of this particular number is the main reason for the rounding error in your variable Discount (whose value in the cases in question must differ from 0.3 by more than the unavoidable numeric representation error).

Virtually all numbers, including (what should be) integers, can be "contaminated" with similar rounding errors.

Examples (log from SAS 9.4M5 under Windows):

409  data _null_;
410  if 1.4-0.4 < 1 < 2.2-1.2
411   & 1.005*1000 < 1005 < 10.05*100
412   & 0.6/0.2 < 3 < 9.9/3.3
413  then put 'Surprised?';
414  run;

Surprised?
NOTE: DATA statement used (Total process time):

Calculations with numbers (like 0.3 and many others) affected by numeric representation error can easily produce rounding errors making the results different from what they should be mathematically, as shown in the examples above and below.

So the question is rather how those Discount values only looking like 0.3 or 0.3000... (in standard formats) were created. I suspect that they were calculated from amounts of money with their typical (up to) two decimal places. (Note that, in a sense, 96 percent of the numbers with up to 2 decimal places are affected by numeric representation error.)

Examples:

data test;
input x y;
Discount=x/y;
length c $12;
h=put(Discount, hex16.);
diff=Discount-0.3;
     if Discount<0.3 then c='Discount<0.3';
else if Discount=0.3 then c='Discount=0.3';
else if Discount>0.3 then c='Discount>0.3';
cards;
   2.7     9
 240.03  800.1
2100.99 7003.3
;

Result:

Obs          x         y    Discount         c                 h                diff

 1        2.70       9.0       0.3      Discount>0.3    3FD3333333333334    5.55112E-17
 2      240.03     800.1       0.3      Discount=0.3    3FD3333333333333              0
 3     2100.99    7003.3       0.3      Discount<0.3    3FD3333333333332    -5.5511E-17

Mathematically, all three Discount values should be 0.3, but only the result in obs. 2 happens to match the internal representation of 0.3. The other two results contain rounding errors in the least significant bit (place value: 2**-54 = 5.551...E-17) which make them different from 0.3 in the IF conditions. This is obvious in formats such as HEX16. revealing the internal representation, but often not in common formats like the standard w.d format or the COMMAw.d format because they are rounding. For instance, format 32.30 reveals a rounding error in obs. 3, but not in obs. 1, unless you look at the difference diff.

A definition of Discount using the ROUND function such as

Discount=round(x/y, 1e-9);

would have avoided the issues in the examples and in similar cases. Alternatively, the ROUND function can be applied in the comparison (as Tom has suggested) or in a preliminary "data cleaning" step, e.g., if the Discount values were imported from another database.

PROC SQL Greater Than (>) in WHERE clause returns wrong rows, i.e., it includes equal values

Re: PROC SQL Greater Than (>) in WHERE clause returns wrong rows, i.e., it includes equal values

Re: PROC SQL Greater Than (>) in WHERE clause returns wrong rows, i.e., it includes equal values

Re: PROC SQL Greater Than (>) in WHERE clause returns wrong rows, i.e., it includes equal values

Re: PROC SQL Greater Than (>) in WHERE clause returns wrong rows, i.e., it includes equal values

Re: PROC SQL Greater Than (>) in WHERE clause returns wrong rows, i.e., it includes equal values

Re: PROC SQL Greater Than (>) in WHERE clause returns wrong rows, i.e., it includes equal values

Re: PROC SQL Greater Than (>) in WHERE clause returns wrong rows, i.e., it includes equal values

Classroom Training Available!