Solved: Re: Removing both observations in a duplicate

akberali67 · Posted 11-07-2011 11:16 AM

Hi,

I want to remove two instances of a complete duplicate observation. I mean using nodup I can remove one observation, how can I remove both of them, can I write a condition?

Example of data:

ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid

WXYZ 2323 1 3212 672 020 6,543.89 2011-02-28 Allowance

I want to remove both of the first two lines which duplicate each other. I am just not able to figure out the logic. I have 9 variables in the data. What I am trying to do is remove equal Amount of debit and a credit of a transaction. So it should not remove all the duplicates but any two perfect duplicates throughout the data.Please help.

Thanks in advance.:smileyplain:

--Akber.

art297 · Posted 11-07-2011 11:27 AM

A more efficient way, I think, would be to remove the duplicates in a datastep. e.g.:

options datestyle=ymd;

data have;

informat var1 $10.;

informat var2-var6 12.;

informat var7 comma12.;

informat var8 anydtdte10.;

informat var9 $15.;

input var1-var8 var9 &;

cards;

ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid

WXYZ 2323 1 3212 672 020 6,543.89 2011-02-28 Allowance

;

proc sort data=have;

by var1-var9;

run;

data want;

set have;

by var1-var9;

if first.var9 and last.var9;

run;

View solution in original post

FriedEgg · Posted 11-07-2011 11:22 AM

There is probably a more efficient way to do this but here is one way... Use proc sort with dupout option and nodupkey, merge the dupout file back to the original data to remove the blanks.

data foo;

*this contains my data;

run;

proc sort data=foo out=bar dupout=drops nodupkey; by id trans_id; run;

data bar;

merge drops(in=b) bar(in=a);

by id trans_id;

if a and not b;

run;

art297 · Posted 11-07-2011 11:27 AM

A more efficient way, I think, would be to remove the duplicates in a datastep. e.g.:

options datestyle=ymd;

data have;

informat var1 $10.;

informat var2-var6 12.;

informat var7 comma12.;

informat var8 anydtdte10.;

informat var9 $15.;

input var1-var8 var9 &;

cards;

ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid

WXYZ 2323 1 3212 672 020 6,543.89 2011-02-28 Allowance

;

proc sort data=have;

by var1-var9;

run;

data want;

set have;

by var1-var9;

if first.var9 and last.var9;

run;

MikeZdeb · Posted 11-07-2011 09:00 PM

Hi ... another idea using Art's data ...

proc sql;

create table want as

select *

from have

group by var1,var2,var3,var4,var5,var6,var7,var8,var9

having count(*) eq 1;

quit;

Ksharp · Posted 11-07-2011 10:19 PM

Another way is SQL. If you like:

options datestyle ymd;
data have;
  informat var1 $10.;
  informat var2-var6 12.;
  informat var7 comma12.;
  informat var8 anydtdte10.;
  informat var9 $15.;
  input var1-var8 var9 &;
  cards;
ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid
ABCD  1234  1  1202  671  020       1,234.56   2011-01-31  Fees Paid
WXYZ  2323  1  3212  672  020       6,543.89   2011-02-28  Allowance
;
run;
proc sql;
 create table want as
  select * 
   from have
    group by var1,var2,var3,var4,var5,var6,var7,var8,var9
     having count(*) eq 1;
quit;

Ksharp

art297 · Posted 11-07-2011 11:07 PM

If we're going to go a proc sql route I'd use:

proc sql noprint;

select name into :vars separated by ","

from dictionary.columns

where libname eq "WORK"

and memname eq "HAVE"

;

create table want as

select *

from have

group by &vars.

having count(*) eq 1

;

quit;

Ksharp · Posted 11-08-2011 03:23 AM

En. It is a better way to save typing.

MikeZdeb · Posted 11-08-2011 09:22 AM

hi ... depends, in this case ...

94 characters

select name into :vars separated by ","

from dictionary.columns

where libname eq "WORK"

and memname eq "HAVE"

;

(plus &vars = 99)

44 characters

var1,var2,var3,var4,var5,var6,var7,var8,var9

ps the "when to hold 'em versus when to fold 'em" rule

art297 · Posted 11-08-2011 09:28 AM

and/or the "when to make use of the SAS Macro/Abbreviations rule that advises to assign a hotkey to quickly insert frequently used code snippets"

akberali67 · Posted 11-08-2011 10:49 AM

Hi,

How can I simply remove TWO observations with equal Amount with a different sign, based just on amount and no other field?? Any two observations, no more no less.

ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid

ABCD 1234 1 1202 671 020 -1,234.56 2011-01-31 Fees Paid

ABCD 2345 1 1203 672 030 1,234.56 2011-01-31 Fees Paid

ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid

WXYZ 2323 1 3212 672 020 6,543.89 2011-02-28 Allowance

Like line 1 & 2 or line 2 & 3 or line 2 & 4 but not more than 2. I mean any two random lines which

Thanks,

Akber.

art297 · Posted 11-08-2011 11:31 AM

I'm not sure exactly what you are trying to do. You can always use the abs() function to make all values positive in your comparisons.

akberali67 · Posted 11-09-2011 05:45 AM

yes but that would then remove any two transactions. I only want a positive and negative with the same value removed

say 1234 and 2345 is the reference# then how do I remove the first and second observation (both 1234), both of which will add to net out to zero, how can I delete them from my dataset

ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid

ABCD 1234 2 1203 671 020 -1,234.56 2011-01-31 Fees Paid

ABCD 2345 1 1203 672 030 1,234.56 2011-01-31 Fees Paid

art297 · Posted 11-09-2011 08:17 AM

You sort (order) the file, say with proc sql, by reference # and the abs() of the variable you want to use. Then, you can use the look ahead-look back methodology to see if any adjacent values add to zero (see: http://www.sascommunity.org/wiki/Look-Ahead_and_Look-Back )

OS2Rules · Posted 11-08-2011 11:54 AM

Art:

If you are going with the Proc SQL then you may be limited buy the size of th :into var.

A large dataset could blow the variable size limit whereas the first. / last. on the sort won't.

Just sayin'

art297 · Posted 11-08-2011 12:03 PM

My original suggestion didn't use either proc sql or macro variables. However, you are definitely correct, although that limit is now 65,534. If one has a collection of variable names that exceed that number of characters, I definitely wouldn't suggest a method that requires manually entering them.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!