BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
A_SAS_Man
Pyrite | Level 9

I am attempting to dedupe some data, but how each group is deduped requires some extra logic than just comparing a few columns. Below is an example of my starting data.

 

data test;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
1 10 line2 shoe 200
1 10 line3 shoe 200
2 10 line1 shoe 200
2 10 line2 shoe 200
2 10 line3 shoe 200
3 10 line4 tops 300
3 10 line5 tape 200
4 10 line6 tots 200
4 10 line7 limo 800
5 10 line8 hats 500
5 10 line9 pens 500
 ;

There are essentially three rules I'm trying to build into the deduping process.

1. If order and line match, keep the product and product descriptor with the most dollars

2. If order, line, dollars and product descriptor match, choose any 1 data line to keep (doesn't matter which one).

3. If order, line and dollars match but product descriptor is different, keep none of the lines.

 

If these rules are followed, it should produce a dataset like the below

 

data test2;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
2 10 line1 shoe 200
3 10 line4 tops 300
4 10 line7 limo 800
 ;

I am guessing I need to use some sort of macro to accomplish this? I've been playing around with a few ideas but so far I'm kind of stuck. Any help is appreciated!

 

Edit: I have added some more description to the three scenarios to illustrate how they differ.

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Sorry, I used the wrong data in my data step. How about now?

 

proc sort data=test out=temp;
   by order dollars;
run;

data want;
   set temp;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;

View solution in original post

9 REPLIES 9
PeterClemmensen
Tourmaline | Level 20

If it doesn't matter which one is chosen in (2), could the logic then be the same as in (1)?

A_SAS_Man
Pyrite | Level 9
Sorry, I had a typo on what two is. It is a slightly different scenario than one, please see updated post.
PeterClemmensen
Tourmaline | Level 20

Ok. Does this meet your needs?

 

proc sort data=test out=temp;
   by order line product_descriptor dollars;
run;

data want;
   set test;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;
A_SAS_Man
Pyrite | Level 9

So it definitely looks like it is getting close, I think maybe my descriptions for each scenario were not precise enough so I apologize if my edit is what messed you up. The difference is it looks like for order 3, your logic is keeping product line5 and product_descriptor "tape" when it should be retaining product line4 and "tops" (due to that line having a higher dollar amount).

PeterClemmensen
Tourmaline | Level 20

Sorry, I used the wrong data in my data step. How about now?

 

proc sort data=test out=temp;
   by order dollars;
run;

data want;
   set temp;
   by order;
   if first.order then do;
      _line=line;
      _pd=product_descriptor;
      _dollars=dollars;
   end;

   if last.order then do;
      if _line=line & _dollars=dollars & product_descriptor ne _pd then return;
      else output;
   end;
   retain _:;
   drop _:;
run;
A_SAS_Man
Pyrite | Level 9
That does the trick, thank you!
PeterClemmensen
Tourmaline | Level 20

Anytime, glad to help 🙂

ballardw
Super User

@A_SAS_Man wrote:

I am attempting to dedupe some data, but how each group is deduped requires some extra logic than just comparing a few columns. Below is an example of my starting data.

 

data test;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
1 10 line2 shoe 200
1 10 line3 shoe 200
2 10 line1 shoe 200
2 10 line2 shoe 200
2 10 line3 shoe 200
3 10 line4 tops 300
3 10 line5 tape 200
4 10 line6 tots 200
4 10 line7 limo 800
5 10 line8 hats 500
5 10 line9 pens 500
 ;

There are essentially three rules I'm trying to build into the deduping process.

1. If order and line match, keep the product and product descriptor with the most dollars

2. If order, line and product descriptor match, choose any 1 data line to keep (doesn't matter which one).

3. If order, line and dollars match, keep none of the lines.

 

If these rules are followed, it should produce a dataset like the below

 

data test2;
 input order line product $5. product_descriptor $5. dollars 4.;
 datalines;
1 10 line1 shoe 200
2 10 line1 shoe 200
3 10 line4 tops 300
4 10 line7 limo 800
 ;

I am guessing I need to use some sort of macro to accomplish this? I've been playing around with a few ideas but so far I'm kind of stuck. Any help is appreciated!


I think that you need to indicate which is the hierarchy, ie. which rule goes first. Since every one of the records that match rule 2 and 3 would also satisfy rule 1 you might get solutions that in effect only implement rule 1 as that is the one listed first. It actually appears that you might want 1 as the last rule to check after the other two. And you have a similar issue of which of rule 2 and 3 should actually be applied first.

 

Not macro should be needed but since you need to look at multiple rows simultaneously this is likely NOT going to be a single data step solution either. 

A_SAS_Man
Pyrite | Level 9
See updated post, I think the three scenarios should be distinct here. But if a hierarchy was needed I would want 1 to be the one used whenever possible, then 2, then 3.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 916 views
  • 0 likes
  • 3 in conversation