Solved: Finding Duplicates using summation

Ins_Analytic · Posted 08-30-2018 04:01 PM

I need to find duplicate rows based on 10 out of 30 fields. I have been using proc sql and group by to accomplish this. The only issue is that I need to also need to find these rows unique to a summed field (allowed_amount). Here is an example of my current output, simplified to only 2 fields grouped:

Field 1	Field 2	Allowed_Amount
A1	B2	-$151.00
A1	B2	$151.00
A1	B2	$220.00
C1	D2	$84.00
C1	D2	$111.00
C1	D2	$25.00
C1	D2	-$84.00

Since I am grouped by field1 and field 2, the 3 rows with A1 and B2 are returned. The way I need to find duplicates, rows with -151 and 151 should cancel, leaving only the row with 220. In the final table, this should not count as a duplicate.

For C1 and D2 the -84 and 84 would also cancel, but since there are unique rows with this grouping (111 and 25), I need only the 111 and 25 returned as duplicates. I am having issues finding a way to return the 111 and 25 for C1/D2 while ignoring the rest.

I can solve the first issue by summing the alloweds in an intermediate table then regrouping, but this will not return the duplicates involved in the second example. Please let me know if you need anymore clarity.

SuryaKiran · Posted 09-05-2018 09:45 AM

I understand your problem, proc transpose with logical approach might work for you.

Test the below code and see if that works for your data

data have;
input Field1 :$8.    Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 $151.00
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $220.00
A1 B2 -$151.00
A1 B2 $85.00
A1 B2 $151.00
A1 B2 $151.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;
proc sort data=have;
by Field1 Field2  Allowed_Amount;
run;

data have;
set have;
by Field1 Field2  Allowed_Amount;
if first.Allowed_Amount then dummy=1;
else dummy+1;
Abs_Amount=ABS(Allowed_Amount);
run;

proc sort data=have;
by  Field1	Field2	abs_Amount 	dummy;
run;

proc transpose data=test out=test_trans(where=(col2 is null) drop=abs_Amount	dummy	_NAME_);
by Field1	Field2	abs_Amount 	dummy;
var Allowed_Amount;
run;
data want;
set test_trans(rename=(col1=Allowed_Amount) drop=col2);
run;

Thanks,
Suryakiran

View solution in original post

SuryaKiran · Posted 08-30-2018 04:37 PM

Try this:


data have;
input Field1 :$8.	Field2:$8.	Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $220.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;

proc sql;
create table test as 
select * from have
order by 1,2,abs(Allowed_Amount);
quit;

data want(keep=Field1 Field2 Allowed_Amount);
merge test test(firstobs=2 rename=(Allowed_Amount=Allowed_Amount_) keep=Allowed_Amount);
lag_Allowed_Amount=lag(Allowed_Amount);
Lag_Field1=lag(field1);
if Lag_Field1^=field1 then call missing(lag_Allowed_Amount);
if sum(Allowed_Amount,Allowed_Amount_)=0 or sum(Allowed_Amount,lag_Allowed_Amount)=0 then delete;
run;

Thanks,
Suryakiran

Ins_Analytic · Posted 09-04-2018 11:39 AM

I was previously unfamiliar with the lag function so thank you for bringing that to my attention. I am running into one issue, though.

Let's say that field1/field2 A1/B2 also has another row with 151. Then there are three total rows for A1/B2 and the final data set should still include one row with 151 since the first 2 rows cancel out. With the above given code, all rows are removed. I cannot figure out how to adjust this properly.

SuryaKiran · Posted 09-04-2018 11:48 AM

Remove that duplicate record using PROC SORT NODUPKEY.

proc sort data=have nodupkey;
by Field1 Field2 Allowed_Amount;
run;

Thanks,
Suryakiran

Ins_Analytic · Posted 09-04-2018 11:55 AM

I need the duplicate record in the final data set. Please let me clarify. I have added a row here:

Field 1	Field 2	Allowed_Amount
A1	B2	$151.00
A1	B2	($151.00)
A1	B2	$151.00
A1	B2	$220.00
C1	D2	$84.00
C1	D2	$111.00
C1	D2	$25.00
C1	D2	($84.00)

I need the output to have:

Field 1	Field 2	Allowed_Amount
A1	B2	$151.00
A1	B2	$220.00
C1	D2	$111.00
C1	D2	$25.00

SuryaKiran · Posted 09-04-2018 12:03 PM

Please check your code, my previous solution works fine with that extra $150. I guess your sort order is the reason why all are canceled. If the order is $150, ($150), $150 then my logic will cancel all 3 records. If the order is ($150), $150, $150 then only first 2 records will be canceled.

data have;
input Field1 :$8.	Field2:$8.	Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $151.00
A1 B2 $220.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;

proc sql;
create table test as 
select * from have
order by 1,2,abs(Allowed_Amount);
quit;

data want(keep=Field1 Field2 Allowed_Amount);
merge test test(firstobs=2 rename=(Allowed_Amount=Allowed_Amount_) keep=Allowed_Amount);
lag_Allowed_Amount=lag(Allowed_Amount);
Lag_Field1=lag(field1);
if Lag_Field1^=field1 then call missing(lag_Allowed_Amount);
if sum(Allowed_Amount,Allowed_Amount_)=0 or sum(Allowed_Amount,lag_Allowed_Amount)=0 then delete;
run;

Thanks,
Suryakiran

Ins_Analytic · Posted 09-04-2018 01:37 PM

You are correct, if the rows alternate positive, negative, positive then all are deleted, which I am trying to avoid. This is also an issue in my original dataset, as the allowed amounts can be in any order. I attempted to control for this by first ordering by 1,2,allowed_amount and then ordering by 1,2,abs(allowed_amount) but it appears the rows revert back to their original order? I am unsure why this is happening:

proc sql;

create table test1 as

select * from have

order by 1,2,Allowed_Amount;

quit;

proc sql;

create table test2 as

select * from test1

order by 1,2,abs(Allowed_Amount);

quit;

 
/*
test2 Results:
Field1     Field2     Allowed_Amount

A1           B2           $151

A1           B2           $-151

A1           B2           $151

A1           B2           $220

C1           D2           $25

C1           D2           $-84

C1           D2           $84

C1           D2           $111
*/

When this happens, then all rows are cancelled. I have the same issue with my actual data set. When the rows are manually sorted to positive, positive, negative then the code gives the desired results. My issue is that there can be any number of rows that need to be cancelled. So this logic works with two positives and a negative, but fails to cancel all rows with two positives and two negatives.

For example:

data have;

input Field1 :$8.    Field2:$8. Allowed_Amount:dollar8.;

format Allowed_Amount dollar8.;

datalines;

A1 B2 $151.00

A1 B2 -$151.00

A1 B2 $151.00

A1 B2 $220.00

A1 B2 -$151.00

C1 D2 $84.00

C1 D2 $111.00

C1 D2 $25.00

C1 D2 -$84.00

;

run;

 

proc sql;

create table test1 as

select * from have

order by 1,2,Allowed_Amount;

quit;

proc sql;

create table test2 as

select * from test1

order by 1,2,abs(Allowed_Amount);

quit;

/*
test2 results:
Field1     Field2     Allowed_Amount

A1           B2           $151

A1           B2           $151

A1           B2           $-151

A1           B2           $-151

A1           B2           $220

C1           D2           $25

C1           D2           $-84

C1           D2           $84

C1           D2           $111
*/

So for three rows, the modified order works with your code. With four rows, you original code that orders by abs(allowed_amount) gives me what I'm looking for. I am having issues adjusting the code to fit all combinations of canceling rows, in my original data set. For example:

data have;

input Field1 :$8.    Field2:$8. Allowed_Amount:dollar8.;

format Allowed_Amount dollar8.;

datalines;

A1 B2 $151.00

A1 B2 -$151.00

A1 B2 $151.00

A1 B2 $220.00

A1 B2 -$151.00

A1 B2 $85.00

A1 B2 $151.00

C1 D2 $84.00

C1 D2 $111.00

C1 D2 $25.00

C1 D2 -$84.00

;

run;

 

proc sql;

create table test1 as

select * from have

order by 1,2,abs(Allowed_Amount);

quit;

 

test1 results:

A1           B2           $85

A1           B2           $151

A1           B2           $-151

A1           B2           $151

A1           B2           $151

A1           B2           $-151

A1           B2           $220

C1           D2           $25

C1           D2           $-84

C1           D2           $84

C1           D2           $111

 

Want results:

A1           B2           $85

A1           B2           $220

C1           D2           $25

C1           D2           $111

SuryaKiran · Posted 09-04-2018 01:56 PM

Proper sorting of the records will resolve the issue, I would create a dummy variable to create the order.

proc sort data=have;
by Field1 Field2  Allowed_Amount;
run;
data have;
set have;
by Field1 Field2  Allowed_Amount;
if first.Allowed_Amount then dummy=1;
else dummy+1;
run;

proc sql;
create table test as 
select * from have
order by 1,2,abs(Allowed_Amount),dummy;
quit;

data want(keep=Field1 Field2 Allowed_Amount);
merge test test(firstobs=2 rename=(Allowed_Amount=Allowed_Amount_) keep=Allowed_Amount);
lag_Allowed_Amount=lag(Allowed_Amount);
Lag_Field1=lag(field1);
if Lag_Field1^=field1 then call missing(lag_Allowed_Amount);
if sum(Allowed_Amount,Allowed_Amount_)=0 or sum(Allowed_Amount,lag_Allowed_Amount)=0 then delete;
run;

Thanks,
Suryakiran

Ins_Analytic · Posted 09-04-2018 04:58 PM

Using the modifications to your code above, with one extra dataline to reflect a problem:

data have;
input Field1 :$8.    Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;

datalines;
A1 B2 $151.00
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $220.00
A1 B2 -$151.00
A1 B2 $85.00
A1 B2 $151.00
A1 B2 $151.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;

I am still missing a line with $151. Looking at how the logic is applied, it looks like the error occurs when the logic finds sum(allowed_amount,lag_allowed_amount)=0 on the row with allowed_amount=$151 and dummy=3 in the test table. The row above this with -$151 allowed_amount has already been "used" to cancel out 2 rows above. I wanted to try and incorporate some logic with the dummy variable to help prevent this, but for some reason I cannot rename dummy=dummy_ in a similar way allowed_amount=allowed_amount_:

data want;

     merge test
             test(firstobs=2 rename=(Allowed_Amount=Allowed_Amount_ dummy=dummy_)
                   keep=Allowed_Amount);
     lag_Allowed_Amount=lag(Allowed_Amount);
     lag_dummy=lag(dummy);
     Lag_Field1=lag(field1);
     if Lag_Field1^=field1 then call missing(lag_Allowed_Amount);
     if lag_dummy^=dummy then call missing (lag_dummy);
     if sum(Allowed_Amount,lag_Allowed_Amount)=0 then delete;
     if dummy^=dummy_ and sum(Allowed_Amount,Allowed_Amount_)=0 then delete /*I haven't been able to experiment with this since I can't rename dummy*/;

run;

I would like to add logic so that the line is only deleted a row when sum(allowed_amount,allowed_amount_)=0 when the 2 dummy variables are equal. I think this will account for all issues. Does this make sense, based on the problem above? How can I rename the dummy variable?

SuryaKiran · Posted 09-05-2018 09:45 AM

I understand your problem, proc transpose with logical approach might work for you.

Test the below code and see if that works for your data

data have;
input Field1 :$8.    Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 $151.00
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $220.00
A1 B2 -$151.00
A1 B2 $85.00
A1 B2 $151.00
A1 B2 $151.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;
proc sort data=have;
by Field1 Field2  Allowed_Amount;
run;

data have;
set have;
by Field1 Field2  Allowed_Amount;
if first.Allowed_Amount then dummy=1;
else dummy+1;
Abs_Amount=ABS(Allowed_Amount);
run;

proc sort data=have;
by  Field1	Field2	abs_Amount 	dummy;
run;

proc transpose data=test out=test_trans(where=(col2 is null) drop=abs_Amount	dummy	_NAME_);
by Field1	Field2	abs_Amount 	dummy;
var Allowed_Amount;
run;
data want;
set test_trans(rename=(col1=Allowed_Amount) drop=col2);
run;

Thanks,
Suryakiran

Ins_Analytic · Posted 09-05-2018 01:00 PM

That proc transpose step is very slick. The only issue I am having applying it to the actual data is that I have 50+ columns that the data is NOT grouped by. The proc transpose step eliminates these columns. I can join these columns back in, but the logic may be a bit challenging to get exactly what is needed. I tried modifying your original solution using merge, but I added a forward/backward dummy variable, similar to what you created for allowed:

data have;
input Field1 :$8.    Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 $151.00
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $151.00
A1 B2 $220.00
A1 B2 -$151.00
A1 B2 $85.00
A1 B2 $151.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;

proc sort data=have;
by Field1 Field2  Allowed_Amount;
run;

data have;
set have;
by Field1 Field2  Allowed_Amount;
if first.Allowed_Amount then dummy=1;
else dummy+1;
run;

proc sql;
create table test as
select * from have
order by 1,2,abs(Allowed_Amount),dummy;
quit;

data test_mod (rename=(Allowed_Amount=Allowed_Amount_ dummy=dummy_));
     set test; run;
data want (keep=Field1 Field2 Allowed_Amount);
     merge test (in=t)
             test_mod (firstobs=2);
     lag_Allowed_Amount=lag(Allowed_Amount);
     lag_dummy=lag(dummy);
     Lag_Field1=lag(field1);
     if t;
     if Lag_Field1^=field1 then call missing(lag_Allowed_Amount);
     if lag_dummy^=dummy then call missing (lag_dummy);
     if dummy=dummy_ and sum(Allowed_Amount,Allowed_Amount_)=0 then delete;
     if dummy=lag_dummy and sum(Allowed_Amount,lag_Allowed_Amount)=0 then delete;
run;

This provides me with the appropriate remaining allowed amounts, but field1 and field2 have shifted down from merging with the "forward a row" table. The last $151 row should still have A1/B2. Do you know a fix to this issue, using this approach? I'm worried the problem will compound and simple shift will not be enough when applied to the large data set.

Finding Duplicates using summation

Re: Finding Duplicates using summation

Re: Finding Duplicates using summation

Re: Finding Duplicates using summation

Re: Finding Duplicates using summation

Re: Finding Duplicates using summation

Re: Finding Duplicates using summation

Re: Finding Duplicates using summation

Re: Finding Duplicates using summation

Re: Finding Duplicates using summation

Re: Finding Duplicates using summation

Re: Finding Duplicates using summation

Register Today!

SAS Training: Just a Click Away