Hello,
When I do two times the same left join including a sum, I get slightly different output datasets. Differences of smnt of max E-8 for numbers in thousands.
proc sql;
create table tab3_time1 as
select
t1.num_dman_finm,
sum(t2.mnt) as smnt
from local.tab1 as t1
left join local.tab2 as t2
on t1.num_dman_finm = t2.num_dman_finm
and t1.num_pers_dfe = t2.num_pers_dfe
group by t1.num_dman_finm
order by t1.num_dman_finm
;
quit;
proc sql;
create table tab3_time2 as
select
t1.num_dman_finm,
sum(t2.mnt) as smnt
from local.tab1 as t1
left join local.tab2 as t2
on t1.num_dman_finm = t2.num_dman_finm
and t1.num_pers_dfe = t2.num_pers_dfe
group by t1.num_dman_finm
order by t1.num_dman_finm
;
quit;
proc compare data=tab3_time1 compare=tab3_time2;
run;
The above proc sql are identical, except for the output name.
tab1 contains only num_dman_finm
and num_pers_dfe
. Without duplicates.
tab2 contains num_dman_finm
, num_pers_dfe
and mnt
. With key duplicates.
num_dman_finm
and num_pers_dfe
are integer keys. mnt
are moneys amounts, with 2 decimals.
Shouldn't the same request produce the same output?
There are ~5000 differences in the output datasets of ~900 000.
Working on version 9.04.
Unlike a data step, SQL does not have to process data in order.; it can also split a query and join the results at the end. Because the results may not be joined in the same order, the calculation (the sum here) performed is slightly different. And because computers have issues with decimal numerical precision, the order matters.
I'd expect the difference to be lower though, around 1e-15 rather than 1e-8. That's many orders of magnitude higher, and that's where my explanation falls shorter than I'd like.
This is odd. Are you saying that the results vary every time you run this query? In a random or systematic way?
Unlike a data step, SQL does not have to process data in order.; it can also split a query and join the results at the end. Because the results may not be joined in the same order, the calculation (the sum here) performed is slightly different. And because computers have issues with decimal numerical precision, the order matters.
I'd expect the difference to be lower though, around 1e-15 rather than 1e-8. That's many orders of magnitude higher, and that's where my explanation falls shorter than I'd like.
1e-8 is the maximum reported by proc compare. There were many differences around 1e-14. And there were many amounts in the 100k or greater, not just 1k. I'll verify at work tomorrow that the greatest differences are for the greatest amounts, and if so, mark your response as accepted answer.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.