topic Re: Conditional Deduping in SAS Programming

Conditional Deduping

A_SAS_Man — Thu, 18 Jul 2019 15:59:23 GMT

I am attempting to dedupe some data, but how each group is deduped requires some extra logic than just comparing a few columns. Below is an example of my starting data.

data test;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
1 10 line2 shoe 200
1 10 line3 shoe 200
2 10 line1 shoe 200
2 10 line2 shoe 200
2 10 line3 shoe 200
3 10 line4 tops 300
3 10 line5 tape 200
4 10 line6 tots 200
4 10 line7 limo 800
5 10 line8 hats 500
5 10 line9 pens 500
 ;

There are essentially three rules I'm trying to build into the deduping process.

1. If order and line match, keep the product and product descriptor with the most dollars

2. If order, line, dollars and product descriptor match, choose any 1 data line to keep (doesn't matter which one).

3. If order, line and dollars match but product descriptor is different, keep none of the lines.

If these rules are followed, it should produce a dataset like the below

data test2;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
2 10 line1 shoe 200
3 10 line4 tops 300
4 10 line7 limo 800
 ;

I am guessing I need to use some sort of macro to accomplish this? I've been playing around with a few ideas but so far I'm kind of stuck. Any help is appreciated!

Edit: I have added some more description to the three scenarios to illustrate how they differ.

Re: Conditional Deduping

PeterClemmensen — Thu, 18 Jul 2019 15:49:36 GMT

If it doesn't matter which one is chosen in (2), could the logic then be the same as in (1)?

Re: Conditional Deduping

A_SAS_Man — Thu, 18 Jul 2019 15:51:29 GMT

Sorry, I had a typo on what two is. It is a slightly different scenario than one, please see updated post.

Re: Conditional Deduping

ballardw — Thu, 18 Jul 2019 15:52:53 GMT

@A_SAS_Man wrote:

I am attempting to dedupe some data, but how each group is deduped requires some extra logic than just comparing a few columns. Below is an example of my starting data.
data test;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
1 10 line2 shoe 200
1 10 line3 shoe 200
2 10 line1 shoe 200
2 10 line2 shoe 200
2 10 line3 shoe 200
3 10 line4 tops 300
3 10 line5 tape 200
4 10 line6 tots 200
4 10 line7 limo 800
5 10 line8 hats 500
5 10 line9 pens 500
 ;
There are essentially three rules I'm trying to build into the deduping process.

1. If order and line match, keep the product and product descriptor with the most dollars

2. If order, line and product descriptor match, choose any 1 data line to keep (doesn't matter which one).

3. If order, line and dollars match, keep none of the lines.

If these rules are followed, it should produce a dataset like the below
data test2;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
2 10 line1 shoe 200
3 10 line4 tops 300
4 10 line7 limo 800
 ;
I am guessing I need to use some sort of macro to accomplish this? I've been playing around with a few ideas but so far I'm kind of stuck. Any help is appreciated!

I think that you need to indicate which is the hierarchy, ie. which rule goes first. Since every one of the records that match rule 2 and 3 would also satisfy rule 1 you might get solutions that in effect only implement rule 1 as that is the one listed first. It actually appears that you might want 1 as the last rule to check after the other two. And you have a similar issue of which of rule 2 and 3 should actually be applied first.

Not macro should be needed but since you need to look at multiple rows simultaneously this is likely NOT going to be a single data step solution either.

Re: Conditional Deduping

A_SAS_Man — Thu, 18 Jul 2019 15:55:37 GMT

See updated post, I think the three scenarios should be distinct here. But if a hierarchy was needed I would want 1 to be the one used whenever possible, then 2, then 3.

Re: Conditional Deduping

PeterClemmensen — Thu, 18 Jul 2019 16:13:02 GMT

Ok. Does this meet your needs?

proc sort data=test out=temp;
   by order line product_descriptor dollars;
run;

data want;
   set test;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;

Re: Conditional Deduping

A_SAS_Man — Thu, 18 Jul 2019 16:24:01 GMT

So it definitely looks like it is getting close, I think maybe my descriptions for each scenario were not precise enough so I apologize if my edit is what messed you up. The difference is it looks like for order 3, your logic is keeping product line5 and product_descriptor "tape" when it should be retaining product line4 and "tops" (due to that line having a higher dollar amount).

Re: Conditional Deduping

PeterClemmensen — Thu, 18 Jul 2019 16:30:08 GMT

Sorry, I used the wrong data in my data step. How about now?

proc sort data=test out=temp;
   by order dollars;
run;

data want;
   set temp;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;

Re: Conditional Deduping

A_SAS_Man — Thu, 18 Jul 2019 16:58:27 GMT

That does the trick, thank you!

Re: Conditional Deduping

PeterClemmensen — Thu, 18 Jul 2019 19:15:07 GMT

Anytime, glad to help 🙂