Solved: Re: value precision issue in data extracted from snowflake

leehsin · Posted 02-15-2022 11:21 AM

Hi, I had two data sets and encountered an issue that the value in a data set extracted from Snowflake are slightly different to the original value stored in SAS data set. See the example:

DATA_SF DATA_Ori
COST_AMT COST_AMT Diff. %Diff
44.48 44.48 7.11E-15 1.60E-14
5.1 5.1 8.88E-16 1.74E-14

My questions: What could be the reason for this different? Suggestion?

Thank you very much!

FreelanceReinh · Posted 02-15-2022 03:21 PM

Hi @leehsin,

I would make sure that the values are correct if they are processed later on. Otherwise these small rounding errors (7.11E-15 etc.) can cause unwanted issues in all sorts of comparisons, including IF or WHERE conditions, assignment of user-defined format categories, merging and sorting. Similar errors can arise from calculations, even within SAS.

Example:

data data_ori;
input cost_amt;
cards;
44.48
5.1
;

data data_sf;
set data_ori;
cost_amt=cost_amt*449/449;
run;

proc compare data=data_sf c=data_ori;
run;

Result (using Windows SAS 9.4M5):

           ||       Base    Compare
       Obs ||   cost_amt   cost_amt      Diff.     % Diff
 ________  ||  _________  _________  _________  _________
           ||
        1  ||    44.4800    44.4800  7.105E-15  1.597E-14
        2  ||     5.1000     5.1000  8.882E-16  1.742E-14

In both cases the calculation (seemingly a multiplication by 1) has reduced the result by one unit of the least significant bit in the internal binary floating-point representation, namely 2**-47=7.105...E-15 in the case of 44.48 and 2**-50=8.8817...E-16 for 5.1. You can notice the errors in the HEX16. format:

proc print data=data_sf;
format cost_amt hex16.;
run;

Result:

Obs            cost_amt

 1     40463D70A3D70A3C
 2     4014666666666665

Even without seeing the corresponding representations of the original values (in data_ori) it's fairly obvious that the correct last hex digits should be D and 6, respectively (in view of the repeating digit patterns). So, look at a few of your values in HEX16. format to find out where the errors have occurred.

Besides calculations, "moving [data] across application and/or architecture borders" is a common way of introducing this type of rounding (or numeric representation) errors, as mentioned by Kurt_Bremser. I would follow ballardw's advice and correct the errors (in SAS) by applying the ROUND function with an appropriate rounding unit (which depends a bit on your data).

Example:

data want;
set data_sf;
cost_amt=round(cost_amt,1e-7);
run;

View solution in original post

Reeza · Posted 02-15-2022 11:29 AM

Numeric precision - basically computers aren't perfect and can't track decimals exactly after a certain point.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/lepg/p0dv87zb3bnse6n1mqo360be70qr.htm

Is a difference of 0.0000000000000711 of significance to you?

@leehsin wrote:

Hi, I had two data sets and encountered an issue that the value in a data set extracted from Snowflake are slightly different to the original value stored in SAS data set. See the example:

DATA_SF DATA_Ori
COST_AMT COST_AMT Diff. %Diff
44.48 44.48 7.11E-15 1.60E-14
5.1 5.1 8.88E-16 1.74E-14

My questions: What could be the reason for this different? Suggestion?

Thank you very much!

leehsin · Posted 02-15-2022 11:41 AM

Thanks, Reeze!

Here any difference is my concern to find a solution, regardless the significance.

Kurt_Bremser · Posted 02-15-2022 12:00 PM

You will often find such differences when moving across application and/or architecture borders. We found differences between SAS on the mainframe and SAS on UNIX/Windows simply because the MF uses more bits in the mantissa.

So, once you verified how the differences originate and that they are statistically insignificant (which you have done), live with them.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 02-15-2022 11:31 AM

SAS stores all numbers in 64 bits (8 bytes) floating point. This leads to precision limitations (see the linked article), and if your other data source uses another storage format, there will be differences.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

leehsin · Posted 02-15-2022 11:59 AM

Thanks, Kurt!

ballardw · Posted 02-15-2022 11:31 AM

E-15 range of differences likely points to precision of different machines as very likely issue.

If you are actually only using two decimal places then round or truncate the data to the required precision you use.

leehsin · Posted 02-15-2022 11:57 AM

Thanks Ballardw!

The values start with 44.48, 5.1 in a SAS dataset. The dataset was uploaded to Snowflake. Then a request from a user asks to extract data from the Snowflake for another task, and the values change in the new dataset. Though it is acceptable in this case, will this kind of change accumulates by time and becomes an issue for a small value? I am seeking a way that I can get the unchanged value in the extraction data, if this way technically exists.

leehsin · Posted 02-15-2022 11:57 AM

Thanks Ballardw !

The values start with 44.48, 5.1 in a SAS dataset. The dataset was uploaded to Snowflake. Then a request from a user asks to extract data from the Snowflake for another task, and the values change in the new dataset. Though it is acceptable in this case, will this kind of change accumulates by time and becomes an issue for a small value? I am seeking a way that I can get the unchanged value in the extraction data, if this way technically exists.

Reeza · Posted 02-15-2022 12:00 PM

There is no change in the data. The small error is negligible and something you'll find with all computers. If you check the snowflake data against the original data source you'll likely find the same differences. Computers typically store data with a set precision of up to 16 decimal points and you're differences are after that point. What is the precision requirement for your data?

FreelanceReinh · Posted 02-15-2022 03:21 PM

Hi @leehsin,

I would make sure that the values are correct if they are processed later on. Otherwise these small rounding errors (7.11E-15 etc.) can cause unwanted issues in all sorts of comparisons, including IF or WHERE conditions, assignment of user-defined format categories, merging and sorting. Similar errors can arise from calculations, even within SAS.

Example:

data data_ori;
input cost_amt;
cards;
44.48
5.1
;

data data_sf;
set data_ori;
cost_amt=cost_amt*449/449;
run;

proc compare data=data_sf c=data_ori;
run;

Result (using Windows SAS 9.4M5):

           ||       Base    Compare
       Obs ||   cost_amt   cost_amt      Diff.     % Diff
 ________  ||  _________  _________  _________  _________
           ||
        1  ||    44.4800    44.4800  7.105E-15  1.597E-14
        2  ||     5.1000     5.1000  8.882E-16  1.742E-14

In both cases the calculation (seemingly a multiplication by 1) has reduced the result by one unit of the least significant bit in the internal binary floating-point representation, namely 2**-47=7.105...E-15 in the case of 44.48 and 2**-50=8.8817...E-16 for 5.1. You can notice the errors in the HEX16. format:

proc print data=data_sf;
format cost_amt hex16.;
run;

Result:

Obs            cost_amt

 1     40463D70A3D70A3C
 2     4014666666666665

Even without seeing the corresponding representations of the original values (in data_ori) it's fairly obvious that the correct last hex digits should be D and 6, respectively (in view of the repeating digit patterns). So, look at a few of your values in HEX16. format to find out where the errors have occurred.

Besides calculations, "moving [data] across application and/or architecture borders" is a common way of introducing this type of rounding (or numeric representation) errors, as mentioned by Kurt_Bremser. I would follow ballardw's advice and correct the errors (in SAS) by applying the ROUND function with an appropriate rounding unit (which depends a bit on your data).

Example:

data want;
set data_sf;
cost_amt=round(cost_amt,1e-7);
run;

leehsin · Posted 02-23-2022 12:09 PM

Thank you, FreelanceReinhard! Great explanation!
Actually I did use round function to the concerned variables in both data when making comparisons, but this is the aftermath solution when the issues appear. If I could prevent this kind of issue from happening at the beginning when I extract a data from Snowflake, I would be happy not going back to repeat the process by applying the solution because this issue is not a real issue actually but may give you a surprise for having a overwhelming long list of data inconsistency, and repeating a process sometime is a very time consuming step in a project having many steps.

Ksharp · Posted 02-16-2022 07:38 AM

Try options METHOD= and FUZZ= :

data data_ori;
input cost_amt;
cards;
44.48
5.1
;

data data_sf;
set data_ori;
cost_amt=cost_amt*449/449;
run;

proc compare data=data_sf c=data_ori method=relative fuzz=1E-10;
run;

Registration is open

SAS Training: Just a Click Away