Solved: Re: Issue with except operator in proc sql

nickspencer · Posted 11-18-2019 05:48 PM

Hi ALL,

I have a proc sql step which looks like below.

proc sql;
create table test as
select compress(key) as key, total_amt , a_cd, b_cd
from table1
except
select compress(key) as key, total_amt , a_cd, b_cd
from table2;
quit;

I have two tables table1 and table2. I am getting weird result.

table1
key                       total_amt    a_cd   b_cd
9528553591287      209.20
9707589975114         0.00

table2
key                      total_amt    a_cd   b_cd
9528553591287     209.20
9707589975114         0.00

test
key total_amt a_cd b_cd
9528553591287 209.20

I believe I should have got 0 records with the above query but I am getting some rows although they have same values in table 1 and 2. The formats/lengths of the columns are different in two tables but I don't believe that is the cause of the issue.

table1:

total_amt format: 20.2

informat 20.2

numeric

length 8

a_cd format: $1

informat $1

b_cd format: $4

informat $4

table2:

total_amt format: 21.2

informat 21.2

numeric

length 8

a_cd format: $1

informat $1

b_cd format: $10

informat $10

Any idea why ,although the data values are same in both tables, I am getting results with the except operator?

Thanks in advance

Nick

FreelanceReinh · Posted 11-18-2019 07:58 PM

Hi @nickspencer,

I support @ballardw's ROUND function approach. Note that there are 35 (!) different numeric values (i.e. internal binary representations) whose formatted value even in BEST32. format looks like 209.2. Deviations from the standard representation (which in this example is 406A266666666666 in HEX16. format) can occur easily when calculations are performed with non-integer values or data from different platforms or databases are imported.

Here are four examples of the 35 values mentioned above:

data numrep;
x0=209.20;
x1=208.80+0.40;
x2=337.28-128.08;
x3=273.22-64.02;
put (x:) (=hex16./);
run;

Result (under Windows):

x0=406A266666666666
x1=406A266666666667
x2=406A266666666665
x3=406A266666666668

Needless to say, all four numbers should be exactly 209.20 from a mathematical point of view. But the tiny rounding errors due to numeric representation issues make them different for PROC SQL's EXCEPT operator (and any other SAS operator relying on exact internal equality, e.g., the condition 208.8+0.4 ne 209.2 is true!). The fact that common formats like 20.12 or BEST32. do not reveal the relevant decimal digits adds to the confusion. Use the HEX16. format to see the true internal values of your numeric variables and how they differ.

The ROUND function could also be used to clean the data so that numeric values have their expected standard internal representation. This would reduce the risk of getting such "weird results" in the future.

View solution in original post

novinosrin · Posted 11-18-2019 05:55 PM

I tested your code and it works as expected with zero records in results


data one;
infile cards missover;
input key : $30.                       total_amt    a_cd   b_cd;
cards;
9528553591287      209.20
9707589975114         0.00  
;

data two;
infile cards missover;
input key : $30.                       total_amt    a_cd   b_cd;
cards;
9528553591287     209.20
9707589975114         0.00  
;
proc sql;
create table test as
select compress(key) as key, total_amt , a_cd, b_cd
  from one
except
select compress(key) as key, total_amt , a_cd, b_cd
  from two;
quit;

ballardw · Posted 11-18-2019 05:57 PM

Print both of your table Total_amt variables with a wider format like 20.12.

The display formats, both set to 2 decimal places could well be rounding off values that appear the same but have actual decimal differences.

Instead of relying on the format you might consider:

proc sql;
   create table test as
   select compress(key) as key, Round(total_amt,.01) as total_amt , a_cd, b_cd
     from table1
   except
   select compress(key) as key, Round(total_amt,.01) as total_amt, a_cd, b_cd
     from table2;
quit;

FreelanceReinh · Posted 11-18-2019 07:58 PM

Hi @nickspencer,

I support @ballardw's ROUND function approach. Note that there are 35 (!) different numeric values (i.e. internal binary representations) whose formatted value even in BEST32. format looks like 209.2. Deviations from the standard representation (which in this example is 406A266666666666 in HEX16. format) can occur easily when calculations are performed with non-integer values or data from different platforms or databases are imported.

Here are four examples of the 35 values mentioned above:

data numrep;
x0=209.20;
x1=208.80+0.40;
x2=337.28-128.08;
x3=273.22-64.02;
put (x:) (=hex16./);
run;

Result (under Windows):

x0=406A266666666666
x1=406A266666666667
x2=406A266666666665
x3=406A266666666668

Needless to say, all four numbers should be exactly 209.20 from a mathematical point of view. But the tiny rounding errors due to numeric representation issues make them different for PROC SQL's EXCEPT operator (and any other SAS operator relying on exact internal equality, e.g., the condition 208.8+0.4 ne 209.2 is true!). The fact that common formats like 20.12 or BEST32. do not reveal the relevant decimal digits adds to the confusion. Use the HEX16. format to see the true internal values of your numeric variables and how they differ.

The ROUND function could also be used to clean the data so that numeric values have their expected standard internal representation. This would reduce the risk of getting such "weird results" in the future.

nickspencer · Posted 11-18-2019 08:39 PM

@FreelanceReinhard & @ballardw This makes sense. One of the value is coming from a teradata table and has decimal(18,2) format. I like your idea of rounding off. Does it possess any risk with except operator and comparing with existing dataset value? Or converting it to character using cats() function while using the except operator should work ? Which solution will be more efficient ?

Kurt_Bremser · Posted 11-19-2019 02:00 AM

When doing comparisons with numbers in SAS, using the round() function is almost imperative, as you might find that 200.40 is smaller or greater than 200.40 😉

There is no danger in using this function, especially not if you use a cutoff value according to the significant digits (in your case .01)

.01 will take care of the artifacts while keeping everything important intact.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

FreelanceReinh · Posted 11-19-2019 06:04 AM

For numeric values which are potentially used in calculations (and hence are "truly" numeric, unlike numeric IDs etc.) I prefer rounding to conversion to character. When the goal is just "data cleaning," i.e., getting rid of artifacts originating in numeric representation issues, I often use small rounding units such as 1E-8 as the second argument of the ROUND function. If the data are known to have at most two decimal places, rounding unit .01 will have the same effect. The larger rounding unit (.01) is also preferable if very large values like 100 millions and beyond are to be rounded because numeric representation error grows with the order of (absolute) magnitude of the numeric values.

Example:

data _null_;
x=123456789.01;
y=123456789.02;
if round(x+0.01, 1e-7)=y then put 'OK';
d=x+0.01-y;
put d best12.;
run;

Result (under Windows):

OK
1.4901161E-8

With rounding unit 1E-8 the "OK" would not occur because the rounding error to be removed is greater than that (see variable d), but anything between 1E-7 and 1E-2 works.

Classroom Training Available!