02-25-2014 02:20 AM
we are working deduplication process, where in we have processed total bulk data and created cluster table in which it creates unique group of cluster id's. next time we are getting incremental data which may contain existing records as well as new records, now if we run same process it will be creating same set of clusterids which will be simialr to earlier. how to join this clusters. is that we need to process whole data (inclding incremental)all the time? or is there any way how to process incremental data? also we are facing performqnce issue, we are fetching 10 million recs, processing match code clustering and inserting to database table, which is taking mor ethan 4 days. pleas e sugest on this.
thanks in advance
02-25-2014 05:56 AM
One piece that I can make a suggestion on is the insert to database part. If it is running slowly, investigate the bulk load options for whatever database you're using.