I am attempting to dedupe some data, but how each group is deduped requires some extra logic than just comparing a few columns. Below is an example of my starting data.
data test;
input order line product $5. product_descriptor $5. dollars 4.;
datalines;
1 10 line1 shoe 200
1 10 line2 shoe 200
1 10 line3 shoe 200
2 10 line1 shoe 200
2 10 line2 shoe 200
2 10 line3 shoe 200
3 10 line4 tops 300
3 10 line5 tape 200
4 10 line6 tots 200
4 10 line7 limo 800
5 10 line8 hats 500
5 10 line9 pens 500
;
There are essentially three rules I'm trying to build into the deduping process.
1. If order and line match, keep the product and product descriptor with the most dollars
2. If order, line, dollars and product descriptor match, choose any 1 data line to keep (doesn't matter which one).
3. If order, line and dollars match but product descriptor is different, keep none of the lines.
If these rules are followed, it should produce a dataset like the below
data test2;
input order line product $5. product_descriptor $5. dollars 4.;
datalines;
1 10 line1 shoe 200
2 10 line1 shoe 200
3 10 line4 tops 300
4 10 line7 limo 800
;
I am guessing I need to use some sort of macro to accomplish this? I've been playing around with a few ideas but so far I'm kind of stuck. Any help is appreciated!
Edit: I have added some more description to the three scenarios to illustrate how they differ.
Sorry, I used the wrong data in my data step. How about now?
proc sort data=test out=temp;
by order dollars;
run;
data want;
set temp;
by order;
if first.order then do;
_line=line;
_pd=product_descriptor;
_dollars=dollars;
end;
if last.order then do;
if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
else output;
end;
retain _:;
drop _:;
run;
If it doesn't matter which one is chosen in (2), could the logic then be the same as in (1)?
Ok. Does this meet your needs?
proc sort data=test out=temp;
by order line product_descriptor dollars;
run;
data want;
set test;
by order;
if first.order then do;
_line=line;
_pd=product_descriptor;
_dollars=dollars;
end;
if last.order then do;
if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
else output;
end;
retain _:;
drop _:;
run;
So it definitely looks like it is getting close, I think maybe my descriptions for each scenario were not precise enough so I apologize if my edit is what messed you up. The difference is it looks like for order 3, your logic is keeping product line5 and product_descriptor "tape" when it should be retaining product line4 and "tops" (due to that line having a higher dollar amount).
Sorry, I used the wrong data in my data step. How about now?
proc sort data=test out=temp;
by order dollars;
run;
data want;
set temp;
by order;
if first.order then do;
_line=line;
_pd=product_descriptor;
_dollars=dollars;
end;
if last.order then do;
if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
else output;
end;
retain _:;
drop _:;
run;
Anytime, glad to help 🙂
@A_SAS_Man wrote:
I am attempting to dedupe some data, but how each group is deduped requires some extra logic than just comparing a few columns. Below is an example of my starting data.
data test; input order line product $5. product_descriptor $5. dollars 4.; datalines; 1 10 line1 shoe 200 1 10 line2 shoe 200 1 10 line3 shoe 200 2 10 line1 shoe 200 2 10 line2 shoe 200 2 10 line3 shoe 200 3 10 line4 tops 300 3 10 line5 tape 200 4 10 line6 tots 200 4 10 line7 limo 800 5 10 line8 hats 500 5 10 line9 pens 500 ;
There are essentially three rules I'm trying to build into the deduping process.
1. If order and line match, keep the product and product descriptor with the most dollars
2. If order, line and product descriptor match, choose any 1 data line to keep (doesn't matter which one).
3. If order, line and dollars match, keep none of the lines.
If these rules are followed, it should produce a dataset like the below
data test2; input order line product $5. product_descriptor $5. dollars 4.; datalines; 1 10 line1 shoe 200 2 10 line1 shoe 200 3 10 line4 tops 300 4 10 line7 limo 800 ;
I am guessing I need to use some sort of macro to accomplish this? I've been playing around with a few ideas but so far I'm kind of stuck. Any help is appreciated!
I think that you need to indicate which is the hierarchy, ie. which rule goes first. Since every one of the records that match rule 2 and 3 would also satisfy rule 1 you might get solutions that in effect only implement rule 1 as that is the one listed first. It actually appears that you might want 1 as the last rule to check after the other two. And you have a similar issue of which of rule 2 and 3 should actually be applied first.
Not macro should be needed but since you need to look at multiple rows simultaneously this is likely NOT going to be a single data step solution either.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.