Delete duplicate obs with HASH based on three or more variables and on...

Nikos · Posted 11-15-2015 12:53 AM

Dear all,

I hope you are well.

I have a really huge dataset wich is NOT sorted (sorting would take ages if not failed) and I need to find and remove duplicate based on three or even more variables.

ID Var1 Var2 IND1 QUANT1 QUANT2

100 20 15 D 234 5678

200 14 12 C 689 1567

100 20 15 C 567 489

300 12 11 M 7865 9890

200 14 12 D 6476 5763

200 55 10 M 545 3434

200 14 12 S 1687 3323

1.-- In fact I need to apply a "NODUPKEY" situation i.e. keep one of the duplicate records based on the three or more variable.

2.-- Just before doing that I need to apply a condition, if one of the duplicate records has IND1 = C and there is another duplicate record with ID = D then replace QUANT1 & QUANT2 values of the record with IND = D with the corresponding ones of the record with IND = C, then delete record with IND =D and any other duplicate records based on the same combination of the three vars

My WANT data file would look like this

ID Var1 Var2 IND1 QUANT1 QUANT2

100 20 15 D ~~234~~ 567 ~~5678~~ 489

300 12 11 M 7865 9890

200 14 12 D ~~6476~~ 689 ~~5763~~ 1567

200 55 10 M 545 3434

Thank you in advance

Best regards

Nik

Nikos · Posted 11-15-2015 12:55 AM

and the most important I would like to use HASH since my dataset is huge (i.e. millions of records)

pearsoninst · Posted 11-15-2015 05:34 AM

OK you can try this . First remove the duplicates using ID and drop IND1 QUANT1 QUANT2.
Create a 2nd data set and remove all the values of D and Keep C and merge this 2 data set.

Reg
KD

Nikos · Posted 11-17-2015 12:06 PM

Thank you but the datasets are really large and sorting them would require a lot of resources.

I am looking for a HASH alternative due to its speediness.

BR

Delete duplicate obs with HASH based on three or more variables and on some condition

Re: Delete duplicate obs with HASH based on three or more variables and on some condition

Re: Delete duplicate obs with HASH based on three or more variables and on some condition

Re: Delete duplicate obs with HASH based on three or more variables and on some condition

Delete duplicate obs with HASH based on three or more variables and on some condition

Re: Delete duplicate obs with HASH based on three or more variables and on some condition

Re: Delete duplicate obs with HASH based on three or more variables and on some condition

Re: Delete duplicate obs with HASH based on three or more variables and on some condition

Click image to register for webinar

Classroom Training Available!