08-08-2014 01:52 PM
Joining on floating point numbers is a bad idea, yet SAS does not offer an integer type. What precision datetime (milliseconds? microseconds?) can I safely do an equi-join on (or test for equality)? I realize that with a time field, I should get higher precision.
I'm thinking that because the FUZZ functions define fuzz as 1e-12, I should be safe going down to nanoseconds (1e-9) with a time field, and milliseconds (1e-3, maybe even microseconds?) with a datetime field.
Can anyone point me to a definitive reference? Google hasn't turned up anything fo me.
08-08-2014 02:12 PM
Depends on your data and how it's produced.
First off, SAS's precision is 2^53, aside from the floating point. However, DATETIME is a number in the billions usually (# of seconds since 1/1/1960 is about 1.7 billion), which is about 2^31. So you really have about 2^22 more precision available in the decimal; that works out to about microseconds in a datetime. (This is because SAS stores in a 64 bit floating point number 1 sign bit, 11 exponent bits, and 52 bits for the mantissa, which can store up to 2^53 accurately.) This is all assuming you're on an IEEE system, ie, Intel, Windows/Unix, etc. If you're on a mainframe it's a bit different.
This example shows about how the precision works:
format dtval dtval_test DATETIME23.6;
do _pow = -1 to -15 by -1;
dtval_test = dtval - 10**_pow;
diff = dtval-dtval_test;
put "Difference:" dtval= dtval_test= diff= 18.16 _pow=;
It stops working (stops having a difference) at 10^-7, which would be tens of millions. That's where you completely lose precision, ie, SAS won't even consider a difference of 1x10-7 to be a difference.
Before that, the difference seems good to about that spot: for 10^-1, diff is 0.0999999046 for example (this is for datetime when I ran this, actual difference may vary some). Further on that is approximately consistent; around the 7th or 8th digit it diverges from where it should be. The first six digits after the decimal are 'right', the seventh is often but not always right.
This is all contingent on using datetimes approximately like this year's datetimes. If you are using time or datetime intervals, ie, values that are pretty small on the left side of the decimal, you can be much more precise. Basically consider 15 significant digits on either side of the decimal and you're okay. Nanoseconds in times should be fine for example (left has at most 5 digits, right can have 10 safely). If you're solely looking at intervals, and there isn't any left side of the decimal, you can have up to 39 digits after the decimal in theory according to the datetime format (though I don't highly recommend doing that).
08-08-2014 03:50 PM
I used your technique for generating differences, and tested sql joins. With datetimes I get exact matches up to 1e-6 (milliseconds); once the differences get smaller than that sql "mistakenly" joins values that it should not. (That is, it agrees with your statement "SAS won't even consider a deff of 10-7 to be a difference".)
With times (no date part), I get accurate joining down to 1e-11. To play it safe, I won't go beyond nonoseconds.
08-08-2014 02:42 PM
Good question and a very good answer.
My question: you are sure you need a sql join/merge?
Perhaps an other coding is a better fit for your (not described) problem.
I am thinking on a sas data step merge approach using observations in an ordered way.
08-08-2014 02:47 PM
SQL Join allows you to do approximate merge (like PGStats shows). SAS data step merge would require rounding or something else, if your datetimes are very precise (ie, have microseconds or beyond in them).
08-08-2014 02:54 PM
Snoopy. You do not need an exact merge by approach.
With an ordered set check you can use the same logic as in a dow loop is used.
3 gl thinking and forcing sql are different world's in algorithmes. Nosql also is a good one.
08-08-2014 03:04 PM
08-08-2014 03:08 PM
I think the 'how do you join' is pretty irrelevant, basically. SQL join, datastep set/merge, whatever, you have to either:
* Round your data to microseconds or more precise
* Do your test on a range
All of this aside, if you're doing math on your datetimes, don't forget that you still can have slight differences unless you round immediately before you do the join (since 0.1 etc. are not perfectly representable).
08-08-2014 04:03 PM
Snoopy, The basics of SQL are limited to some algorithms with a coding approach.
Theses are failing in some areas or causing a lot of performance overhead. That is why other approaches exist of which some are hying (hadoop nosql) at the moment. Depending on the problem not a one-trick-pony approach. Is it sufficient to solve with what you know well, do not bother. jdmarino argument holds.
08-08-2014 02:44 PM
Joining on floating point numbers is never a good idea. Joining on rounded floating point numbers will work better but is still imperfect. There are always pairs of numbers very close apart that will not round to the same value. That leaves you with join conditions like
on a.dt between b.dt-0.0001 and b.dt+0.0001
which may or may not be optimized in your particular situation (DBMS, indexes).