BookmarkSubscribeRSS Feed
Nikos
Fluorite | Level 6

Dear all,

 

I hope you are well.

 

I have a really huge dataset wich is NOT sorted (sorting would take ages if not failed) and I need to find and remove duplicate based on three or even more variables.

 

ID     Var1 Var2  IND1 QUANT1 QUANT2 

100  20      15     D        234        5678

200  14      12     C         689        1567

100  20      15     C         567          489

300  12      11     M        7865        9890 

200  14      12     D       6476        5763

200   55     10     M         545         3434

200   14     12    S        1687         3323

 

1.-- In fact I need to apply a "NODUPKEY" situation i.e. keep one of the duplicate records based on the three or more variable.

 

2.-- Just before doing that I need to apply a condition, if one of the duplicate records has IND1 = C and there is another duplicate record with ID = D then replace QUANT1 & QUANT2 values of the record with IND = D with the corresponding ones of the record with IND = C, then delete record with IND =D and any other duplicate records based on the same combination of the three vars

 

My WANT data file would look like this

 

ID     Var1 Var2  IND1 QUANT1     QUANT2 

100  20      15     D        234  567       5678   489

300  12      11     M        7865           9890 

200  14      12     D       6476   689         5763  1567

200   55     10     M         545            3434

 

 

 

Thank you in advance

 

Best regards

 

Nik

 

 

 

 

 

 

3 REPLIES 3
Nikos
Fluorite | Level 6

and the most important I would like to use HASH since my dataset is huge (i.e. millions of records)

pearsoninst
Pyrite | Level 9
OK you can try this . First remove the duplicates using ID and drop IND1 QUANT1 QUANT2.
Create a 2nd data set and remove all the values of D and Keep C and merge this 2 data set.

Reg
KD
Nikos
Fluorite | Level 6

Thank you but the datasets are really large and sorting them would require a lot of resources.

I am looking for a HASH alternative due to its speediness.

 

BR

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 3 replies
  • 1240 views
  • 0 likes
  • 2 in conversation