Solved: Re: Weird output when combining two data sets!

expertyejin · Posted 03-08-2016 11:00 PM

Hi all,

I'm testing a simple program by merging two datasets in SAS Studio. Here is my code:

data one(drop=i);
do i=1 to 100;
x=log10(i);
output;
end;
run;

data two;
input x y;
cards;
0 3
1 4
2 5
;
run;

data together;
merge one(in=o) two(in=t);
by x;
if o and t;
run;

proc print;
run;

If you run this code, you get the expected output, which is basically the CARDS/DATALINES section of data two. However, if I change the increment of the DO loop in data one from 1 to 0.1, i.e.

data one(drop=i);
do i=1 to 100 by 0.1;
x=log10(i);
output;
end;
run;

data two;
input x y;
cards;
0 3
1 4
2 5
;
run;

data together;
merge one(in=o) two(in=t);
by x;
if o and t;
run;

proc print;
run;

then I get an unexpected/unexplainable output which is only "0 3". Could anyone tell me why I get this strange output and how to fix this issue? Thanks a lot!

FreelanceReinh · Posted 03-09-2016 07:19 AM

Hi @expertyejin,

The primary issue with your code is the DO loop with step size 0.1. This is one of the "classic" examples of producing numeric representation issues and it is described in Numerical Accuracy in SAS Software.

I've written several detailed posts on this topic during the past couple of months. Here is a link to the search results for "numeric representation", restricted to my posts:
https://communities.sas.com/t5/forums/searchpage/tab/message?filter=labels%2CauthorId&q=%22numeric+r...

Among these, this posting describes in particular what's "wrong" with 0.1. (I've just seen that @Kurt_Bremser has posted a link to a similar explanation on an external website.)

As @Kurt_Bremser mentioned, not only 0.1, but many decimal fractions introduce numeric representation errors into calculations. Please see the bulleted list in this post for a quantification of this statement.

Moreover, for similar reasons it is risky to merge datasets using BY variables whose values are not necessarily integers (such as logarithms). In your example you rely (successfully) on mathematical facts such as log10(100)=2. However, there are many mathematical identities (e.g. 0.1+0.7=0.8 or 1.00E-5=1.000E-5) which are not true in SAS (platform dependent, though) due to numeric representation issues. Further surprising examples can be found in this post.

In many cases the ROUND function can be used to cope with these issues (e.g. round(0.1+0.7, 1e-9)=0.8).

A DO loop should use an integer increment. The decimal fractions can be calculated in the body of the loop:

data one(drop=i);
do i=10 to 1000;
  x=log10(i/10);
  output;
end;
run;

View solution in original post

Norman21 · Posted 03-09-2016 01:02 AM

This looks like a problem with the precision of the BY variables. LOG10(10) doesn't exactly equal 1.

You might like to try the ROUND function on the log transformed variable.

Norman.

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

Kurt_Bremser · Posted 03-09-2016 04:11 AM

To verify your precision problem, just add

format x best32.;

to your first data step and take a look at observation 91 of dataset one.

As soon as you have fractions with numbers represented in real format, you get small errors when doing calculations, unless you use fractions that represent easily (and finitely) in binary.

IOW, adding 0.1 10 times is different from adding 1, as the binary representation of 0.1 (base 10) looks kind of funny, while 1 (base 10) is simply 1 (base 2).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 03-09-2016 04:19 AM

Even more revealing is this:

data one;
format x i best32.;
do i=1 to 100 by 0.1;
x=log10(i);
output;
end;
run;

Just look at the seemingly crazy development of i, especially after obs 747

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Norman21 · Posted 03-09-2016 04:34 AM

This is very interesting - and worrying!

Why doesn't SAS throw up an ERROR (or at least a NOTE) when i is not an integer?

Norman.

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

Kurt_Bremser · Posted 03-09-2016 04:47 AM

Because SAS does not know "integer". A number in SAS is always stored in 8 byte real, and even if you enter a number that is considered an integer, it will be stored as mantissa an exponent. As soon as you need more precision than the mantissa can hold, you get rounding errors.

Decimal fractions need so many digits in the (binary) mantissa that you very quickly get those rounding errors.

And why should SAS expect integers when you already introduced fractions by adding

by 0.1;

?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Norman21 · Posted 03-09-2016 04:51 AM

OK, but this doesn't explain why the error described in the OP only occurs in the first example.

Norman.

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

Kurt_Bremser · Posted 03-09-2016 05:04 AM

@Norman21 wrote:

OK, but this doesn't explain why the error described in the OP only occurs in the first example.

Norman.

Look at this:

why-0-point-1-does-not-exist-in-floating-point/

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 03-09-2016 04:48 AM

I have another beauty for you.

Why does this work as expected?

data one;
format x i best32.;
do i=1 to 100 by 0.25;
x=log10(i);
output;
end;
run;

data two;
input x y;
cards;
0 3
1 4
2 5
;
run;

data together;
merge one(in=o) two(in=t);
by x;
if o and t;
run;

proc print;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Norman21 · Posted 03-09-2016 04:53 AM

!?!

Yet another reason for SAS to provide a NOTE (at least) if the increment is not an "integer".

Norman.

Norman.
SAS 9.4 (TS1M6) X64_10PRO WIN 10.0.17763 Workstation

Kurt_Bremser · Posted 03-09-2016 05:01 AM

SAS assumes that you know what you are doing when you specifiy a non-integer increment. The "by 0.1" didn't appear there by accident.

I am very happy that SAS does not treat me like somebody who left her/his brain at home. 😉

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 03-09-2016 07:19 AM

Hi @expertyejin,

The primary issue with your code is the DO loop with step size 0.1. This is one of the "classic" examples of producing numeric representation issues and it is described in Numerical Accuracy in SAS Software.

I've written several detailed posts on this topic during the past couple of months. Here is a link to the search results for "numeric representation", restricted to my posts:
https://communities.sas.com/t5/forums/searchpage/tab/message?filter=labels%2CauthorId&q=%22numeric+r...

Among these, this posting describes in particular what's "wrong" with 0.1. (I've just seen that @Kurt_Bremser has posted a link to a similar explanation on an external website.)

As @Kurt_Bremser mentioned, not only 0.1, but many decimal fractions introduce numeric representation errors into calculations. Please see the bulleted list in this post for a quantification of this statement.

Moreover, for similar reasons it is risky to merge datasets using BY variables whose values are not necessarily integers (such as logarithms). In your example you rely (successfully) on mathematical facts such as log10(100)=2. However, there are many mathematical identities (e.g. 0.1+0.7=0.8 or 1.00E-5=1.000E-5) which are not true in SAS (platform dependent, though) due to numeric representation issues. Further surprising examples can be found in this post.

In many cases the ROUND function can be used to cope with these issues (e.g. round(0.1+0.7, 1e-9)=0.8).

A DO loop should use an integer increment. The decimal fractions can be calculated in the body of the loop:

data one(drop=i);
do i=10 to 1000;
  x=log10(i/10);
  output;
end;
run;

expertyejin · Posted 03-10-2016 03:44 PM

Thank you all for your great help and advice!! And I've considered myself an advanced learner when I'm even stumbled on such a basic concept (sigh...)

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away