Solved: Removing Duplicate rows with one based on multiple columns

Sujithpeta · Posted 04-04-2017 02:55 PM

Hello,

I've a dataset which has customer purchases and returns information. When a customer returns an item the purchased record is duplicated with negative cost and quantitity value (Please see the image below). Even though it's a duplicate it's given an unique sequence number and record number. I want to remove the records which have the purchase and it's returns by looking at the records which have similar entrier across all variables (except for Sequence and Record No).

Dataset (have):

Dataset (want):

Thank you!

Ksharp · Posted 04-05-2017 10:19 AM

You can do this like the following, 
I have no time to make dataset.

data positive negative;
 set have;
 if cost <0 then output negative;
 else output positive;
run;

data negative;
 set negative;
keep custome_id quality cost;
quality=-quality;
cost=-cost;
run;

data want;
 merge positive(in=ina) negative(in=inb);
 by customer_id quality cost;
 if not (ina and inb);
run;

View solution in original post

PaigeMiller · Posted 04-04-2017 03:09 PM

I have thought about this, and I don't think there's a unique solution the way you have described it. Perhaps it doesn't matter to you that the solution is not unique, but I don't know if it matters or not.

If there is a second purchase for $200 on 5/4/2012, which of the $200 purchases do you delete? Or doesn't it matter?

There also seems to be no logical way to pair sequence #6 with sequence #11, unless additional assumptions are made. Please clarify.

--
Paige Miller

Sujithpeta · Posted 04-04-2017 03:41 PM

Hello,

It doesn't matter if $200 purchase is of the same service and all other variables (Category1, Category2, Date and Quantity) are same.

Every purchase and it's returns are in pairs, If I drop "Sequence No" and "Record No" then you can see every variable is similar, only cost and quantity are in compliment values (Purchase +ve and Return -ve).

I'm new to the SAS tool, don't know if it needs an unique anwser or not. It would be great help if you come up with one.

Thanks

rogerjdeangelis · Posted 04-04-2017 03:51 PM

Removing Duplicate rows with one based on multiple columns

this post
https://goo.gl/0x7a4v
https://communities.sas.com/t5/General-SAS-Programming/Removing-Duplicate-rows-with-one-based-on-multiple-columns/m-p/347150

Keep the first record where num1-num7 are in a dup group, even though seq and cid are unique

HAVE
====

Up to 40 obs WORK.HAVE total obs=271

Obs    SEQ    CID     NUM1    NUM2    NUM3    NUM4    NUM5    NUM6    NUM7

  1    000    0000     578     393     613     704     928     813     125 DUPS
  2    X00    X000     578     393     613     704     928     813     125
  3    Z00    Z000     578     393     613     704     928     813     125

  4    731    5731     514     301     802      34     703     218     338

  5    462    1462     839     114     283     549      73      89     748

  6    193    7193     457     379     774     663     240     344     785

  7    924    2924     343     384      64     491     362     314     361

  8    655    8655     362     431     598     698     946      56     362 DUPS
  9    X55    X655     362     431     598     698     946      56     362

 10    386    4386     321      15     788     932     639     391     194 DUPS
 11    X86    X386     321      15     788     932     639     391     194
 12    Z86    Z386     321      15     788     932     639     391     194

 13    117    0117     772     826     164     560     317     986     897
 14    848    5848     856     911      99     742      75     125     528

 15    579    1579      29     995     879     992     716     961     108 DUPS
 16    X79    X579      29     995     879     992     716     961     108
 17    Z79    Z579      29     995     879     992     716     961     108

WANT
===

Obs    SEQ    CID     NUM1    NUM2    NUM3    NUM4    NUM5    NUM6    NUM7

  1    000    0000     578     393     613     704     928     813     125

  4    731    5731     514     301     802      34     703     218     338

  5    462    1462     839     114     283     549      73      89     748

  6    193    7193     457     379     774     663     240     344     785

  7    924    2924     343     384      64     491     362     314     361

  8    655    8655     362     431     598     698     946      56     362

 10    386    4386     321      15     788     932     639     391     194

 13    117    0117     772     826     164     560     317     986     897

 14    848    5848     856     911      99     742      75     125     528

 15    579    1579      29     995     879     992     716     961     108

* create tha data;
data have(where=(num7 ne 0));
  retain seq cid;
  retain num1-num7 0;
  array therest num1-num7;
  do dig=1000000 to 2000000 by 5731;
    seq=substr(put(dig,7.),5);
    cid=substr(put(dig,7.),4);
    if uniform(5731)<.33 then do;
       do over therest;
         therest=int(1000*uniform(5764));
       end;
       output;
       substr(cid,1,1)='X';
       substr(seq,1,1)='X';
       output;
       if uniform(6754)<.66 then do;
          substr(cid,1,1)='Z';
          substr(seq,1,1)='Z';
          output;
       end;
    end;
    else do;
       do over therest;
         therest=int(1000*uniform(5764));
       end;
       output;
    end;
  end;
  drop dig;
run;quit;

* SOLUTION;

data want;
  set have;
  by num1--num7 notsorted; * note in your case record_num--cost;
  if first.num1;
run;quit;

Sujithpeta · Posted 04-04-2017 05:25 PM

Hello, Thanks for trying to help me. Your solution doesn't remove the purchase and return pairs (Which I want) but keeps the purchase and remove the returns.

Thanks

rogerjdeangelis · Posted 04-04-2017 07:02 PM

My mistake. I think this may be what you want.

data have(where=(num7 ne 0));
  retain seq cid;
  retain num1-num7 0;
  array therest num1-num7;
  do dig=1000000 to 2000000 by 5731;
    seq=substr(put(dig,7.),5);
    cid=substr(put(dig,7.),4);
    if uniform(5731)<.33 then do;
       do over therest;
         therest=int(1000*uniform(5764));
       end;
       output;
       substr(cid,1,1)='X';
       substr(seq,1,1)='X';
       output;
       if uniform(6754)<.66 then do;
          substr(cid,1,1)='Z';
          substr(seq,1,1)='Z';
          output;
       end;
    end;
    else do;
       do over therest;
         therest=int(1000*uniform(5764));
       end;
       output;
    end;
  end;
  drop dig;
run;quit;

* SOLUTION;

data want;
  set have;
  by num1--num7 notsorted; * note in your case record_num--cost;
  if first.num7 and last.num7;
run;quit;


Up to 40 obs WORK.WANT total obs=114

Obs    SEQ    CID     NUM1    NUM2    NUM3    NUM4    NUM5    NUM6    NUM7

  1    731    5731     514     301     802      34     703     218     338
  2    462    1462     839     114     283     549      73      89     748
  3    193    7193     457     379     774     663     240     344     785
  4    924    2924     343     384      64     491     362     314     361
  5    117    0117     772     826     164     560     317     986     897
  6    848    5848     856     911      99     742      75     125     528
  7    310    7310     599      37     836      77     101     829     245
  8    041    3041     395      33     557     885     650     990     343
  9    503    4503     744     490     117     202     182     869     767
 10    965    5965     132      17     475     692     988     401     334

Sujithpeta · Posted 04-05-2017 10:18 AM

Hello Roger, I ran the code, there is no change in the dataset.

Ksharp · Posted 04-05-2017 10:19 AM

You can do this like the following, 
I have no time to make dataset.

data positive negative;
 set have;
 if cost <0 then output negative;
 else output positive;
run;

data negative;
 set negative;
keep custome_id quality cost;
quality=-quality;
cost=-cost;
run;

data want;
 merge positive(in=ina) negative(in=inb);
 by customer_id quality cost;
 if not (ina and inb);
run;

Sujithpeta · Posted 04-05-2017 11:19 AM

Hello Ksharp,

It worked, didn't imagine it can be as simple as this. I can see why you're a grand advisor.

Thanks a ton!

--Sujith

Ksharp · Posted 04-06-2017 08:39 AM

If you have multiple duplicated obs like

custome_id quality cost;
1 1 200
1 1 200
1 -1 -200


My code couldn't work.and you could try:

data want;
ina=0;inb=0;
 merge positive(in=ina) negative(in=inb);
 by customer_id quality cost;
 if not (ina and inb);
run;

Removing Duplicate rows with one based on multiple columns

Re: Removing Duplicate rows with one based on multiple columns

Re: Removing Duplicate rows with one based on multiple columns

Re: Removing Duplicate rows with one based on multiple columns

Re: Removing Duplicate rows with one based on multiple columns

Re: Removing Duplicate rows with one based on multiple columns

Re: Removing Duplicate rows with one based on multiple columns

Re: Removing Duplicate rows with one based on multiple columns

Re: Removing Duplicate rows with one based on multiple columns

Re: Removing Duplicate rows with one based on multiple columns

Re: Removing Duplicate rows with one based on multiple columns

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away