Dear all,
I hope you are well.
I have a really huge dataset wich is NOT sorted (sorting would take ages if not failed) and I need to find and remove duplicate based on three or even more variables.
ID Var1 Var2 IND1 QUANT1 QUANT2
100 20 15 D 234 5678
200 14 12 C 689 1567
100 20 15 C 567 489
300 12 11 M 7865 9890
200 14 12 D 6476 5763
200 55 10 M 545 3434
200 14 12 S 1687 3323
1.-- In fact I need to apply a "NODUPKEY" situation i.e. keep one of the duplicate records based on the three or more variable.
2.-- Just before doing that I need to apply a condition, if one of the duplicate records has IND1 = C and there is another duplicate record with ID = D then replace QUANT1 & QUANT2 values of the record with IND = D with the corresponding ones of the record with IND = C, then delete record with IND =D and any other duplicate records based on the same combination of the three vars
My WANT data file would look like this
ID Var1 Var2 IND1 QUANT1 QUANT2
100 20 15 D 234 567 5678 489
300 12 11 M 7865 9890
200 14 12 D 6476 689 5763 1567
200 55 10 M 545 3434
Thank you in advance
Best regards
Nik
and the most important I would like to use HASH since my dataset is huge (i.e. millions of records)
Thank you but the datasets are really large and sorting them would require a lot of resources.
I am looking for a HASH alternative due to its speediness.
BR
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.