Dear all,
I hope you are well.
I have a really huge dataset wich is NOT sorted (sorting would take ages if not failed) and I need to find and remove duplicate based on three or even more variables.
ID Var1 Var2 IND1 QUANT1 QUANT2
100 20 15 D 234 5678
200 14 12 C 689 1567
100 20 15 C 567 489
300 12 11 M 7865 9890
200 14 12 D 6476 5763
200 55 10 M 545 3434
200 14 12 S 1687 3323
1.-- In fact I need to apply a "NODUPKEY" situation i.e. keep one of the duplicate records based on the three or more variable.
2.-- Just before doing that I need to apply a condition, if one of the duplicate records has IND1 = C and there is another duplicate record with ID = D then replace QUANT1 & QUANT2 values of the record with IND = D with the corresponding ones of the record with IND = C, then delete record with IND =D and any other duplicate records based on the same combination of the three vars
My WANT data file would look like this
ID Var1 Var2 IND1 QUANT1 QUANT2
100 20 15 D 234 567 5678 489
300 12 11 M 7865 9890
200 14 12 D 6476 689 5763 1567
200 55 10 M 545 3434
Thank you in advance
Best regards
Nik
and the most important I would like to use HASH since my dataset is huge (i.e. millions of records)
Thank you but the datasets are really large and sorting them would require a lot of resources.
I am looking for a HASH alternative due to its speediness.
BR
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.