BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
A_SAS_Man
Pyrite | Level 9

I am attempting to dedupe some data, but how each group is deduped requires some extra logic than just comparing a few columns. Below is an example of my starting data.

 

data test;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
1 10 line2 shoe 200
1 10 line3 shoe 200
2 10 line1 shoe 200
2 10 line2 shoe 200
2 10 line3 shoe 200
3 10 line4 tops 300
3 10 line5 tape 200
4 10 line6 tots 200
4 10 line7 limo 800
5 10 line8 hats 500
5 10 line9 pens 500
 ;

There are essentially three rules I'm trying to build into the deduping process.

1. If order and line match, keep the product and product descriptor with the most dollars

2. If order, line, dollars and product descriptor match, choose any 1 data line to keep (doesn't matter which one).

3. If order, line and dollars match but product descriptor is different, keep none of the lines.

 

If these rules are followed, it should produce a dataset like the below

 

data test2;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
2 10 line1 shoe 200
3 10 line4 tops 300
4 10 line7 limo 800
 ;

I am guessing I need to use some sort of macro to accomplish this? I've been playing around with a few ideas but so far I'm kind of stuck. Any help is appreciated!

 

Edit: I have added some more description to the three scenarios to illustrate how they differ.

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Sorry, I used the wrong data in my data step. How about now?

 

proc sort data=test out=temp;
   by order dollars;
run;

data want;
   set temp;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;

View solution in original post

9 REPLIES 9
PeterClemmensen
Tourmaline | Level 20

If it doesn't matter which one is chosen in (2), could the logic then be the same as in (1)?

A_SAS_Man
Pyrite | Level 9
Sorry, I had a typo on what two is. It is a slightly different scenario than one, please see updated post.
PeterClemmensen
Tourmaline | Level 20

Ok. Does this meet your needs?

 

proc sort data=test out=temp;
   by order line product_descriptor dollars;
run;

data want;
   set test;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;
A_SAS_Man
Pyrite | Level 9

So it definitely looks like it is getting close, I think maybe my descriptions for each scenario were not precise enough so I apologize if my edit is what messed you up. The difference is it looks like for order 3, your logic is keeping product line5 and product_descriptor "tape" when it should be retaining product line4 and "tops" (due to that line having a higher dollar amount).

PeterClemmensen
Tourmaline | Level 20

Sorry, I used the wrong data in my data step. How about now?

 

proc sort data=test out=temp;
   by order dollars;
run;

data want;
   set temp;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;
A_SAS_Man
Pyrite | Level 9
That does the trick, thank you!
PeterClemmensen
Tourmaline | Level 20

Anytime, glad to help 🙂

ballardw
Super User

@A_SAS_Man wrote:

I am attempting to dedupe some data, but how each group is deduped requires some extra logic than just comparing a few columns. Below is an example of my starting data.

 

data test;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
1 10 line2 shoe 200
1 10 line3 shoe 200
2 10 line1 shoe 200
2 10 line2 shoe 200
2 10 line3 shoe 200
3 10 line4 tops 300
3 10 line5 tape 200
4 10 line6 tots 200
4 10 line7 limo 800
5 10 line8 hats 500
5 10 line9 pens 500
 ;

There are essentially three rules I'm trying to build into the deduping process.

1. If order and line match, keep the product and product descriptor with the most dollars

2. If order, line and product descriptor match, choose any 1 data line to keep (doesn't matter which one).

3. If order, line and dollars match, keep none of the lines.

 

If these rules are followed, it should produce a dataset like the below

 

data test2;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
2 10 line1 shoe 200
3 10 line4 tops 300
4 10 line7 limo 800
 ;

I am guessing I need to use some sort of macro to accomplish this? I've been playing around with a few ideas but so far I'm kind of stuck. Any help is appreciated!


I think that you need to indicate which is the hierarchy, ie. which rule goes first. Since every one of the records that match rule 2 and 3 would also satisfy rule 1 you might get solutions that in effect only implement rule 1 as that is the one listed first. It actually appears that you might want 1 as the last rule to check after the other two. And you have a similar issue of which of rule 2 and 3 should actually be applied first.

 

Not macro should be needed but since you need to look at multiple rows simultaneously this is likely NOT going to be a single data step solution either. 

A_SAS_Man
Pyrite | Level 9
See updated post, I think the three scenarios should be distinct here. But if a hierarchy was needed I would want 1 to be the one used whenever possible, then 2, then 3.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 1446 views
  • 0 likes
  • 3 in conversation