I have to update dataset 2 with the variable userid. The key to map both datasets are datestamp,amt and reference. The tricky part is, not necessarily there would be a exact match for milli seconds and seconds. However, I have to first find if every thing matches if found update that record of dataset2 with user id. else check if there is atleast a match till seconds if found update that record of dataset2 with user id. else check if there is atleast a match till minutes if found update that record of dataset2 with user id. If a record match is not found then write it to error dataset.
one might question why can't we just compare up to minutes level since we are updating even if the minute matches but the requirement is that they want to be absolutly certain atleast for most of the records.
Can any one let me know a efficient way of acheiving this. dataset1 contains atleast 2 million records and dataset 2 contains atleast 5 million records.
that sounds not too big for an sql join in memory
2011-01-11.12:32:00:655 55000.36 09XYZABCD0885 tom
= 8 + 8 + 13 + 10
* 2M (dataset1)
+ 30 * 3M (dataset2)
around 1.7GB of memory to hold both datasets
even modest desktops might offer that.
The join can be exact on reference but must round amount before comparing, and can convert the timestamp to text or use intnx() to "move to" the minute.
The problem with amount is that is stored with decimal fractions and comparison of these can fail when they appear (rounded to two decimal places) to be the same. The other ways to make exact comparison of AMT is to convert to and integer after multiplying by 100, or convert to text with format z19.2
Instead of matching milliseconds, then seconds, etc, I suggest you calculate the difference between the time stamps, limiting your matches to those below 60,000 milliseconds if you want (otherwise a record @ 2010dec31:23:59:59.999 will match any record that day instead of matching the 2011jan01:00:00:00.000 record). You can then keep the match with the lowest time difference.
SQL can do this obvioulsy using
group by a.DATESTAMP, a.REFERENCE
having calculated DIF_MILLISECOND= min(calculated DIF_MILLISECOND)
but SQL is a resource hog and you might have to spit the data into small bits (by reference?) depending on your platform.
To limit the size of the fuzzy match data, you can do the exact matches first and work with the rest from then onward:
data T2_MATCH T2_NOMATCH;
by DATESTAMP REFERENCE MILLISECOND;
if B then output T2_MATCH;
else output T2_NOMATCH;
Instead of SQL, I would try to use a hash table with REFERENCE as the key and the multidata option. You can then calculate the time differences and keep the best match.
Work on a subset first to see if this is faster than sql in your case.