Help using Base SAS procedures

Delete duplicate obs with HASH based on three or more variables and on some condition

Reply
Contributor
Posts: 68

Delete duplicate obs with HASH based on three or more variables and on some condition

Dear all,

 

I hope you are well.

 

I have a really huge dataset wich is NOT sorted (sorting would take ages if not failed) and I need to find and remove duplicate based on three or even more variables.

 

ID     Var1 Var2  IND1 QUANT1 QUANT2 

100  20      15     D        234        5678

200  14      12     C         689        1567

100  20      15     C         567          489

300  12      11     M        7865        9890 

200  14      12     D       6476        5763

200   55     10     M         545         3434

200   14     12    S        1687         3323

 

1.-- In fact I need to apply a "NODUPKEY" situation i.e. keep one of the duplicate records based on the three or more variable.

 

2.-- Just before doing that I need to apply a condition, if one of the duplicate records has IND1 = C and there is another duplicate record with ID = D then replace QUANT1 & QUANT2 values of the record with IND = D with the corresponding ones of the record with IND = C, then delete record with IND =D and any other duplicate records based on the same combination of the three vars

 

My WANT data file would look like this

 

ID     Var1 Var2  IND1 QUANT1     QUANT2 

100  20      15     D        234  567       5678   489

300  12      11     M        7865           9890 

200  14      12     D       6476   689         5763  1567

200   55     10     M         545            3434

 

 

 

Thank you in advance

 

Best regards

 

Nik

 

 

 

 

 

 

Contributor
Posts: 68

Re: Delete duplicate obs with HASH based on three or more variables and on some condition

and the most important I would like to use HASH since my dataset is huge (i.e. millions of records)

Frequent Contributor
Posts: 108

Re: Delete duplicate obs with HASH based on three or more variables and on some condition

OK you can try this . First remove the duplicates using ID and drop IND1 QUANT1 QUANT2.
Create a 2nd data set and remove all the values of D and Keep C and merge this 2 data set.

Reg
KD
Contributor
Posts: 68

Re: Delete duplicate obs with HASH based on three or more variables and on some condition

Thank you but the datasets are really large and sorting them would require a lot of resources.

I am looking for a HASH alternative due to its speediness.

 

BR

Ask a Question
Discussion stats
  • 3 replies
  • 306 views
  • 0 likes
  • 2 in conversation