- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi ALL,
I have a proc sql step which looks like below.
proc sql;
create table test as
select compress(key) as key, total_amt , a_cd, b_cd
from table1
except
select compress(key) as key, total_amt , a_cd, b_cd
from table2;
quit;
I have two tables table1 and table2. I am getting weird result.
table1
key total_amt a_cd b_cd
9528553591287 209.20
9707589975114 0.00
table2
key total_amt a_cd b_cd
9528553591287 209.20
9707589975114 0.00
test
key total_amt a_cd b_cd
9528553591287 209.20
I believe I should have got 0 records with the above query but I am getting some rows although they have same values in table 1 and 2. The formats/lengths of the columns are different in two tables but I don't believe that is the cause of the issue.
table1:
total_amt format: 20.2
informat 20.2
numeric
length 8
a_cd format: $1
informat $1
b_cd format: $4
informat $4
table2:
total_amt format: 21.2
informat 21.2
numeric
length 8
a_cd format: $1
informat $1
b_cd format: $10
informat $10
Any idea why ,although the data values are same in both tables, I am getting results with the except operator?
Thanks in advance
Nick
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @nickspencer,
I support @ballardw's ROUND function approach. Note that there are 35 (!) different numeric values (i.e. internal binary representations) whose formatted value even in BEST32. format looks like 209.2. Deviations from the standard representation (which in this example is 406A266666666666 in HEX16. format) can occur easily when calculations are performed with non-integer values or data from different platforms or databases are imported.
Here are four examples of the 35 values mentioned above:
data numrep;
x0=209.20;
x1=208.80+0.40;
x2=337.28-128.08;
x3=273.22-64.02;
put (x:) (=hex16./);
run;
Result (under Windows):
x0=406A266666666666 x1=406A266666666667 x2=406A266666666665 x3=406A266666666668
Needless to say, all four numbers should be exactly 209.20 from a mathematical point of view. But the tiny rounding errors due to numeric representation issues make them different for PROC SQL's EXCEPT operator (and any other SAS operator relying on exact internal equality, e.g., the condition 208.8+0.4 ne 209.2 is true!). The fact that common formats like 20.12 or BEST32. do not reveal the relevant decimal digits adds to the confusion. Use the HEX16. format to see the true internal values of your numeric variables and how they differ.
The ROUND function could also be used to clean the data so that numeric values have their expected standard internal representation. This would reduce the risk of getting such "weird results" in the future.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I tested your code and it works as expected with zero records in results
data one;
infile cards missover;
input key : $30. total_amt a_cd b_cd;
cards;
9528553591287 209.20
9707589975114 0.00
;
data two;
infile cards missover;
input key : $30. total_amt a_cd b_cd;
cards;
9528553591287 209.20
9707589975114 0.00
;
proc sql;
create table test as
select compress(key) as key, total_amt , a_cd, b_cd
from one
except
select compress(key) as key, total_amt , a_cd, b_cd
from two;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Print both of your table Total_amt variables with a wider format like 20.12.
The display formats, both set to 2 decimal places could well be rounding off values that appear the same but have actual decimal differences.
Instead of relying on the format you might consider:
proc sql; create table test as select compress(key) as key, Round(total_amt,.01) as total_amt , a_cd, b_cd from table1 except select compress(key) as key, Round(total_amt,.01) as total_amt, a_cd, b_cd from table2; quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @nickspencer,
I support @ballardw's ROUND function approach. Note that there are 35 (!) different numeric values (i.e. internal binary representations) whose formatted value even in BEST32. format looks like 209.2. Deviations from the standard representation (which in this example is 406A266666666666 in HEX16. format) can occur easily when calculations are performed with non-integer values or data from different platforms or databases are imported.
Here are four examples of the 35 values mentioned above:
data numrep;
x0=209.20;
x1=208.80+0.40;
x2=337.28-128.08;
x3=273.22-64.02;
put (x:) (=hex16./);
run;
Result (under Windows):
x0=406A266666666666 x1=406A266666666667 x2=406A266666666665 x3=406A266666666668
Needless to say, all four numbers should be exactly 209.20 from a mathematical point of view. But the tiny rounding errors due to numeric representation issues make them different for PROC SQL's EXCEPT operator (and any other SAS operator relying on exact internal equality, e.g., the condition 208.8+0.4 ne 209.2 is true!). The fact that common formats like 20.12 or BEST32. do not reveal the relevant decimal digits adds to the confusion. Use the HEX16. format to see the true internal values of your numeric variables and how they differ.
The ROUND function could also be used to clean the data so that numeric values have their expected standard internal representation. This would reduce the risk of getting such "weird results" in the future.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When doing comparisons with numbers in SAS, using the round() function is almost imperative, as you might find that 200.40 is smaller or greater than 200.40 😉
There is no danger in using this function, especially not if you use a cutoff value according to the significant digits (in your case .01)
.01 will take care of the artifacts while keeping everything important intact.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
For numeric values which are potentially used in calculations (and hence are "truly" numeric, unlike numeric IDs etc.) I prefer rounding to conversion to character. When the goal is just "data cleaning," i.e., getting rid of artifacts originating in numeric representation issues, I often use small rounding units such as 1E-8 as the second argument of the ROUND function. If the data are known to have at most two decimal places, rounding unit .01 will have the same effect. The larger rounding unit (.01) is also preferable if very large values like 100 millions and beyond are to be rounded because numeric representation error grows with the order of (absolute) magnitude of the numeric values.
Example:
data _null_;
x=123456789.01;
y=123456789.02;
if round(x+0.01, 1e-7)=y then put 'OK';
d=x+0.01-y;
put d best12.;
run;
Result (under Windows):
OK 1.4901161E-8
With rounding unit 1E-8 the "OK" would not occur because the rounding error to be removed is greater than that (see variable d), but anything between 1E-7 and 1E-2 works.