BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
akberali67
Calcite | Level 5

Hi,

I want to remove two instances of a complete duplicate observation. I mean using nodup I can remove one observation, how can I remove both of them, can I write a condition?

Example of data:

ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid

ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid

WXYZ  2323  1  3212  672  020       6,543.89   2011-02-28  Allowance

I want to remove both of the first two lines which duplicate each other. I am just not able to figure out the logic. I have 9 variables in the data. What I am trying to do is remove equal Amount of debit and a credit of a transaction. So it should not remove all the duplicates but any two perfect duplicates throughout the data.Please help.

Thanks in advance.:smileyplain:

--Akber.

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

A more efficient way, I think, would be to remove the duplicates in a datastep.  e.g.:

options datestyle=ymd;

data have;

  informat var1 $10.;

  informat var2-var6 12.;

  informat var7 comma12.;

  informat var8 anydtdte10.;

  informat var9 $15.;

  input var1-var8 var9 &;

  cards;

ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid

ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid

WXYZ  2323  1  3212  672  020       6,543.89   2011-02-28  Allowance

;

proc sort data=have;

  by var1-var9;

run;

data want;

  set have;

  by var1-var9;

  if first.var9 and last.var9;

run;

View solution in original post

21 REPLIES 21
FriedEgg
SAS Employee

There is probably a more efficient way to do this but here is one way...  Use proc sort with dupout option and nodupkey, merge the dupout file back to the original data to remove the blanks.

data foo;

  *this contains my data;

run;

proc sort data=foo out=bar dupout=drops nodupkey; by id trans_id; run;

data bar;

  merge drops(in=b) bar(in=a);

  by id trans_id;

  if a and not b;

run;

art297
Opal | Level 21

A more efficient way, I think, would be to remove the duplicates in a datastep.  e.g.:

options datestyle=ymd;

data have;

  informat var1 $10.;

  informat var2-var6 12.;

  informat var7 comma12.;

  informat var8 anydtdte10.;

  informat var9 $15.;

  input var1-var8 var9 &;

  cards;

ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid

ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid

WXYZ  2323  1  3212  672  020       6,543.89   2011-02-28  Allowance

;

proc sort data=have;

  by var1-var9;

run;

data want;

  set have;

  by var1-var9;

  if first.var9 and last.var9;

run;

MikeZdeb
Rhodochrosite | Level 12

Hi ... another idea using Art's data ...

proc sql;

create table want as

select *

from have

group by var1,var2,var3,var4,var5,var6,var7,var8,var9

having count(*) eq 1;

quit;

Ksharp
Super User

Another way is SQL. If you like:

options datestyle ymd;
data have;
  informat var1 $10.;
  informat var2-var6 12.;
  informat var7 comma12.;
  informat var8 anydtdte10.;
  informat var9 $15.;
  input var1-var8 var9 &;
  cards;
ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid
ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid
WXYZ  2323  1  3212  672  020       6,543.89   2011-02-28  Allowance
;
run;
proc sql;
 create table want as
  select * 
   from have
    group by var1,var2,var3,var4,var5,var6,var7,var8,var9
     having count(*) eq 1;
quit;

Ksharp

art297
Opal | Level 21

If we're going to go a proc sql route I'd use:

proc sql noprint;

  select name into :vars separated by ","

    from dictionary.columns

       where libname eq "WORK"

         and memname eq "HAVE"

  ;

create table want as

  select *

    from have

      group by &vars.

        having count(*) eq 1

  ;

quit;

Ksharp
Super User

En. It is a better way to save typing.

MikeZdeb
Rhodochrosite | Level 12

hi ... depends, in this case ...

94 characters

select name into :vars separated by ","

    from dictionary.columns

       where libname eq "WORK"

         and memname eq "HAVE"

  ;

(plus &vars = 99)

44 characters

var1,var2,var3,var4,var5,var6,var7,var8,var9

ps  the "when to hold 'em versus when to fold 'em" rule


art297
Opal | Level 21

and/or the "when to make use of the SAS Macro/Abbreviations rule that advises to assign a hotkey to quickly insert frequently used code snippets"

akberali67
Calcite | Level 5

Hi,

How can I simply remove TWO observations with equal Amount with a different sign, based just on amount and no other field?? Any two observations, no more no less.

ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid

ABCD  1234  1  1202  671  020       -1,234.56   2011-01-31  Fees Paid

ABCD  2345  1  1203  672  030       1,234.56   2011-01-31  Fees Paid

ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid

WXYZ  2323  1  3212  672  020       6,543.89   2011-02-28  Allowance

Like line 1 & 2 or line 2 & 3 or line 2 & 4 but not more than 2. I mean any two random lines which

Thanks,

Akber.

art297
Opal | Level 21

I'm not sure exactly what you are trying to do.  You can always use the abs() function to make all values positive in your comparisons.

akberali67
Calcite | Level 5

yes but that would then remove any two transactions. I only want a positive and negative with the same value removed

say 1234 and 2345 is the reference# then how do I remove the first and second observation (both 1234), both of which will add to net out to zero, how can I delete them from my dataset

ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid

ABCD  1234  2  1203  671  020       -1,234.56   2011-01-31  Fees Paid

ABCD  2345  1  1203  672  030       1,234.56   2011-01-31  Fees Paid

art297
Opal | Level 21

You sort (order) the file, say with proc sql, by reference # and the abs() of the variable you want to use.  Then, you can use the look ahead-look back methodology to see if any adjacent values add to zero (see: http://www.sascommunity.org/wiki/Look-Ahead_and_Look-Back )

OS2Rules
Obsidian | Level 7

Art:

If you are going with the Proc SQL then you may be limited buy the size of th :into var.

A large dataset could blow the variable size limit whereas the first. / last. on the sort won't.

Just sayin'

art297
Opal | Level 21

My original suggestion didn't use either proc sql or macro variables.  However, you are definitely correct, although that limit is now 65,534.  If one has a collection of variable names that exceed that number of characters, I definitely wouldn't suggest a method that requires manually entering them.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 21 replies
  • 1194 views
  • 5 likes
  • 6 in conversation