I need to find duplicate rows based on 10 out of 30 fields. I have been using proc sql and group by to accomplish this. The only issue is that I need to also need to find these rows unique to a summed field (allowed_amount). Here is an example of my current output, simplified to only 2 fields grouped:
Field 1 | Field 2 | Allowed_Amount |
A1 | B2 | -$151.00 |
A1 | B2 | $151.00 |
A1 | B2 | $220.00 |
C1 | D2 | $84.00 |
C1 | D2 | $111.00 |
C1 | D2 | $25.00 |
C1 | D2 | -$84.00 |
Since I am grouped by field1 and field 2, the 3 rows with A1 and B2 are returned. The way I need to find duplicates, rows with -151 and 151 should cancel, leaving only the row with 220. In the final table, this should not count as a duplicate.
For C1 and D2 the -84 and 84 would also cancel, but since there are unique rows with this grouping (111 and 25), I need only the 111 and 25 returned as duplicates. I am having issues finding a way to return the 111 and 25 for C1/D2 while ignoring the rest.
I can solve the first issue by summing the alloweds in an intermediate table then regrouping, but this will not return the duplicates involved in the second example. Please let me know if you need anymore clarity.
I understand your problem, proc transpose with logical approach might work for you.
Test the below code and see if that works for your data
data have;
input Field1 :$8. Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 $151.00
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $220.00
A1 B2 -$151.00
A1 B2 $85.00
A1 B2 $151.00
A1 B2 $151.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;
proc sort data=have;
by Field1 Field2 Allowed_Amount;
run;
data have;
set have;
by Field1 Field2 Allowed_Amount;
if first.Allowed_Amount then dummy=1;
else dummy+1;
Abs_Amount=ABS(Allowed_Amount);
run;
proc sort data=have;
by Field1 Field2 abs_Amount dummy;
run;
proc transpose data=test out=test_trans(where=(col2 is null) drop=abs_Amount dummy _NAME_);
by Field1 Field2 abs_Amount dummy;
var Allowed_Amount;
run;
data want;
set test_trans(rename=(col1=Allowed_Amount) drop=col2);
run;
Try this:
data have;
input Field1 :$8. Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $220.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;
proc sql;
create table test as
select * from have
order by 1,2,abs(Allowed_Amount);
quit;
data want(keep=Field1 Field2 Allowed_Amount);
merge test test(firstobs=2 rename=(Allowed_Amount=Allowed_Amount_) keep=Allowed_Amount);
lag_Allowed_Amount=lag(Allowed_Amount);
Lag_Field1=lag(field1);
if Lag_Field1^=field1 then call missing(lag_Allowed_Amount);
if sum(Allowed_Amount,Allowed_Amount_)=0 or sum(Allowed_Amount,lag_Allowed_Amount)=0 then delete;
run;
I was previously unfamiliar with the lag function so thank you for bringing that to my attention. I am running into one issue, though.
Let's say that field1/field2 A1/B2 also has another row with 151. Then there are three total rows for A1/B2 and the final data set should still include one row with 151 since the first 2 rows cancel out. With the above given code, all rows are removed. I cannot figure out how to adjust this properly.
Remove that duplicate record using PROC SORT NODUPKEY.
proc sort data=have nodupkey;
by Field1 Field2 Allowed_Amount;
run;
I need the duplicate record in the final data set. Please let me clarify. I have added a row here:
Field 1 | Field 2 | Allowed_Amount |
A1 | B2 | $151.00 |
A1 | B2 | ($151.00) |
A1 | B2 | $151.00 |
A1 | B2 | $220.00 |
C1 | D2 | $84.00 |
C1 | D2 | $111.00 |
C1 | D2 | $25.00 |
C1 | D2 | ($84.00) |
I need the output to have:
Field 1 | Field 2 | Allowed_Amount |
A1 | B2 | $151.00 |
A1 | B2 | $220.00 |
C1 | D2 | $111.00 |
C1 | D2 | $25.00 |
Please check your code, my previous solution works fine with that extra $150. I guess your sort order is the reason why all are canceled. If the order is $150, ($150), $150 then my logic will cancel all 3 records. If the order is ($150), $150, $150 then only first 2 records will be canceled.
data have;
input Field1 :$8. Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $151.00
A1 B2 $220.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;
proc sql;
create table test as
select * from have
order by 1,2,abs(Allowed_Amount);
quit;
data want(keep=Field1 Field2 Allowed_Amount);
merge test test(firstobs=2 rename=(Allowed_Amount=Allowed_Amount_) keep=Allowed_Amount);
lag_Allowed_Amount=lag(Allowed_Amount);
Lag_Field1=lag(field1);
if Lag_Field1^=field1 then call missing(lag_Allowed_Amount);
if sum(Allowed_Amount,Allowed_Amount_)=0 or sum(Allowed_Amount,lag_Allowed_Amount)=0 then delete;
run;
You are correct, if the rows alternate positive, negative, positive then all are deleted, which I am trying to avoid. This is also an issue in my original dataset, as the allowed amounts can be in any order. I attempted to control for this by first ordering by 1,2,allowed_amount and then ordering by 1,2,abs(allowed_amount) but it appears the rows revert back to their original order? I am unsure why this is happening:
proc sql;
create table test1 as
select * from have
order by 1,2,Allowed_Amount;
quit;
proc sql;
create table test2 as
select * from test1
order by 1,2,abs(Allowed_Amount);
quit;
/*
test2 Results:
Field1 Field2 Allowed_Amount
A1 B2 $151
A1 B2 $-151
A1 B2 $151
A1 B2 $220
C1 D2 $25
C1 D2 $-84
C1 D2 $84
C1 D2 $111
*/
When this happens, then all rows are cancelled. I have the same issue with my actual data set. When the rows are manually sorted to positive, positive, negative then the code gives the desired results. My issue is that there can be any number of rows that need to be cancelled. So this logic works with two positives and a negative, but fails to cancel all rows with two positives and two negatives.
For example:
data have;
input Field1 :$8. Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 $151.00
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $220.00
A1 B2 -$151.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;
proc sql;
create table test1 as
select * from have
order by 1,2,Allowed_Amount;
quit;
proc sql;
create table test2 as
select * from test1
order by 1,2,abs(Allowed_Amount);
quit;
/*
test2 results:
Field1 Field2 Allowed_Amount
A1 B2 $151
A1 B2 $151
A1 B2 $-151
A1 B2 $-151
A1 B2 $220
C1 D2 $25
C1 D2 $-84
C1 D2 $84
C1 D2 $111
*/
So for three rows, the modified order works with your code. With four rows, you original code that orders by abs(allowed_amount) gives me what I'm looking for. I am having issues adjusting the code to fit all combinations of canceling rows, in my original data set. For example:
data have;
input Field1 :$8. Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 $151.00
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $220.00
A1 B2 -$151.00
A1 B2 $85.00
A1 B2 $151.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;
proc sql;
create table test1 as
select * from have
order by 1,2,abs(Allowed_Amount);
quit;
test1 results:
A1 B2 $85
A1 B2 $151
A1 B2 $-151
A1 B2 $151
A1 B2 $151
A1 B2 $-151
A1 B2 $220
C1 D2 $25
C1 D2 $-84
C1 D2 $84
C1 D2 $111
Want results:
A1 B2 $85
A1 B2 $220
C1 D2 $25
C1 D2 $111
Proper sorting of the records will resolve the issue, I would create a dummy variable to create the order.
proc sort data=have;
by Field1 Field2 Allowed_Amount;
run;
data have;
set have;
by Field1 Field2 Allowed_Amount;
if first.Allowed_Amount then dummy=1;
else dummy+1;
run;
proc sql;
create table test as
select * from have
order by 1,2,abs(Allowed_Amount),dummy;
quit;
data want(keep=Field1 Field2 Allowed_Amount);
merge test test(firstobs=2 rename=(Allowed_Amount=Allowed_Amount_) keep=Allowed_Amount);
lag_Allowed_Amount=lag(Allowed_Amount);
Lag_Field1=lag(field1);
if Lag_Field1^=field1 then call missing(lag_Allowed_Amount);
if sum(Allowed_Amount,Allowed_Amount_)=0 or sum(Allowed_Amount,lag_Allowed_Amount)=0 then delete;
run;
Using the modifications to your code above, with one extra dataline to reflect a problem:
data have;
input Field1 :$8. Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 $151.00
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $220.00
A1 B2 -$151.00
A1 B2 $85.00
A1 B2 $151.00
A1 B2 $151.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;
I am still missing a line with $151. Looking at how the logic is applied, it looks like the error occurs when the logic finds sum(allowed_amount,lag_allowed_amount)=0 on the row with allowed_amount=$151 and dummy=3 in the test table. The row above this with -$151 allowed_amount has already been "used" to cancel out 2 rows above. I wanted to try and incorporate some logic with the dummy variable to help prevent this, but for some reason I cannot rename dummy=dummy_ in a similar way allowed_amount=allowed_amount_:
data want;
merge test
test(firstobs=2 rename=(Allowed_Amount=Allowed_Amount_ dummy=dummy_)
keep=Allowed_Amount);
lag_Allowed_Amount=lag(Allowed_Amount);
lag_dummy=lag(dummy);
Lag_Field1=lag(field1);
if Lag_Field1^=field1 then call missing(lag_Allowed_Amount);
if lag_dummy^=dummy then call missing (lag_dummy);
if sum(Allowed_Amount,lag_Allowed_Amount)=0 then delete;
if dummy^=dummy_ and sum(Allowed_Amount,Allowed_Amount_)=0 then delete /*I haven't been able to experiment with this since I can't rename dummy*/;
run;
I would like to add logic so that the line is only deleted a row when sum(allowed_amount,allowed_amount_)=0 when the 2 dummy variables are equal. I think this will account for all issues. Does this make sense, based on the problem above? How can I rename the dummy variable?
I understand your problem, proc transpose with logical approach might work for you.
Test the below code and see if that works for your data
data have;
input Field1 :$8. Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 $151.00
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $220.00
A1 B2 -$151.00
A1 B2 $85.00
A1 B2 $151.00
A1 B2 $151.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;
proc sort data=have;
by Field1 Field2 Allowed_Amount;
run;
data have;
set have;
by Field1 Field2 Allowed_Amount;
if first.Allowed_Amount then dummy=1;
else dummy+1;
Abs_Amount=ABS(Allowed_Amount);
run;
proc sort data=have;
by Field1 Field2 abs_Amount dummy;
run;
proc transpose data=test out=test_trans(where=(col2 is null) drop=abs_Amount dummy _NAME_);
by Field1 Field2 abs_Amount dummy;
var Allowed_Amount;
run;
data want;
set test_trans(rename=(col1=Allowed_Amount) drop=col2);
run;
That proc transpose step is very slick. The only issue I am having applying it to the actual data is that I have 50+ columns that the data is NOT grouped by. The proc transpose step eliminates these columns. I can join these columns back in, but the logic may be a bit challenging to get exactly what is needed. I tried modifying your original solution using merge, but I added a forward/backward dummy variable, similar to what you created for allowed:
data have;
input Field1 :$8. Field2:$8. Allowed_Amount:dollar8.;
format Allowed_Amount dollar8.;
datalines;
A1 B2 $151.00
A1 B2 -$151.00
A1 B2 $151.00
A1 B2 $151.00
A1 B2 $220.00
A1 B2 -$151.00
A1 B2 $85.00
A1 B2 $151.00
C1 D2 $84.00
C1 D2 $111.00
C1 D2 $25.00
C1 D2 -$84.00
;
run;
proc sort data=have;
by Field1 Field2 Allowed_Amount;
run;
data have;
set have;
by Field1 Field2 Allowed_Amount;
if first.Allowed_Amount then dummy=1;
else dummy+1;
run;
proc sql;
create table test as
select * from have
order by 1,2,abs(Allowed_Amount),dummy;
quit;
data test_mod (rename=(Allowed_Amount=Allowed_Amount_ dummy=dummy_));
set test; run;
data want (keep=Field1 Field2 Allowed_Amount);
merge test (in=t)
test_mod (firstobs=2);
lag_Allowed_Amount=lag(Allowed_Amount);
lag_dummy=lag(dummy);
Lag_Field1=lag(field1);
if t;
if Lag_Field1^=field1 then call missing(lag_Allowed_Amount);
if lag_dummy^=dummy then call missing (lag_dummy);
if dummy=dummy_ and sum(Allowed_Amount,Allowed_Amount_)=0 then delete;
if dummy=lag_dummy and sum(Allowed_Amount,lag_Allowed_Amount)=0 then delete;
run;
This provides me with the appropriate remaining allowed amounts, but field1 and field2 have shifted down from merging with the "forward a row" table. The last $151 row should still have A1/B2. Do you know a fix to this issue, using this approach? I'm worried the problem will compound and simple shift will not be enough when applied to the large data set.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.