Solved: Re: Conditional Deduping

A_SAS_Man · Posted 07-18-2019 11:45 AM

I am attempting to dedupe some data, but how each group is deduped requires some extra logic than just comparing a few columns. Below is an example of my starting data.

data test;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
1 10 line2 shoe 200
1 10 line3 shoe 200
2 10 line1 shoe 200
2 10 line2 shoe 200
2 10 line3 shoe 200
3 10 line4 tops 300
3 10 line5 tape 200
4 10 line6 tots 200
4 10 line7 limo 800
5 10 line8 hats 500
5 10 line9 pens 500
 ;

There are essentially three rules I'm trying to build into the deduping process.

1. If order and line match, keep the product and product descriptor with the most dollars

2. If order, line, dollars and product descriptor match, choose any 1 data line to keep (doesn't matter which one).

3. If order, line and dollars match but product descriptor is different, keep none of the lines.

If these rules are followed, it should produce a dataset like the below

data test2;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
2 10 line1 shoe 200
3 10 line4 tops 300
4 10 line7 limo 800
 ;

I am guessing I need to use some sort of macro to accomplish this? I've been playing around with a few ideas but so far I'm kind of stuck. Any help is appreciated!

Edit: I have added some more description to the three scenarios to illustrate how they differ.

PeterClemmensen · Posted 07-18-2019 12:29 PM

Sorry, I used the wrong data in my data step. How about now?

proc sort data=test out=temp;
   by order dollars;
run;

data want;
   set temp;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

PeterClemmensen · Posted 07-18-2019 11:49 AM

If it doesn't matter which one is chosen in (2), could the logic then be the same as in (1)?

The DATA to DATA Step Macro
Blog: SASnrd

A_SAS_Man · Posted 07-18-2019 11:51 AM

Sorry, I had a typo on what two is. It is a slightly different scenario than one, please see updated post.

PeterClemmensen · Posted 07-18-2019 12:13 PM

Ok. Does this meet your needs?

proc sort data=test out=temp;
   by order line product_descriptor dollars;
run;

data want;
   set test;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;

The DATA to DATA Step Macro
Blog: SASnrd

A_SAS_Man · Posted 07-18-2019 12:24 PM

So it definitely looks like it is getting close, I think maybe my descriptions for each scenario were not precise enough so I apologize if my edit is what messed you up. The difference is it looks like for order 3, your logic is keeping product line5 and product_descriptor "tape" when it should be retaining product line4 and "tops" (due to that line having a higher dollar amount).

PeterClemmensen · Posted 07-18-2019 12:29 PM

Sorry, I used the wrong data in my data step. How about now?

proc sort data=test out=temp;
   by order dollars;
run;

data want;
   set temp;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;

The DATA to DATA Step Macro
Blog: SASnrd

A_SAS_Man · Posted 07-18-2019 12:58 PM

That does the trick, thank you!

PeterClemmensen · Posted 07-18-2019 03:15 PM

Anytime, glad to help 🙂

The DATA to DATA Step Macro
Blog: SASnrd

ballardw · Posted 07-18-2019 11:52 AM

@A_SAS_Man wrote:

I am attempting to dedupe some data, but how each group is deduped requires some extra logic than just comparing a few columns. Below is an example of my starting data.
data test;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
1 10 line2 shoe 200
1 10 line3 shoe 200
2 10 line1 shoe 200
2 10 line2 shoe 200
2 10 line3 shoe 200
3 10 line4 tops 300
3 10 line5 tape 200
4 10 line6 tots 200
4 10 line7 limo 800
5 10 line8 hats 500
5 10 line9 pens 500
 ;
There are essentially three rules I'm trying to build into the deduping process.

1. If order and line match, keep the product and product descriptor with the most dollars

2. If order, line and product descriptor match, choose any 1 data line to keep (doesn't matter which one).

3. If order, line and dollars match, keep none of the lines.

If these rules are followed, it should produce a dataset like the below
data test2;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
2 10 line1 shoe 200
3 10 line4 tops 300
4 10 line7 limo 800
 ;
I am guessing I need to use some sort of macro to accomplish this? I've been playing around with a few ideas but so far I'm kind of stuck. Any help is appreciated!

I think that you need to indicate which is the hierarchy, ie. which rule goes first. Since every one of the records that match rule 2 and 3 would also satisfy rule 1 you might get solutions that in effect only implement rule 1 as that is the one listed first. It actually appears that you might want 1 as the last rule to check after the other two. And you have a similar issue of which of rule 2 and 3 should actually be applied first.

Not macro should be needed but since you need to look at multiple rows simultaneously this is likely NOT going to be a single data step solution either.

A_SAS_Man · Posted 07-18-2019 11:55 AM

See updated post, I think the three scenarios should be distinct here. But if a hierarchy was needed I would want 1 to be the one used whenever possible, then 2, then 3.

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away