Hi,
I want to remove two instances of a complete duplicate observation. I mean using nodup I can remove one observation, how can I remove both of them, can I write a condition?
Example of data:
ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid
ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid
WXYZ 2323 1 3212 672 020 6,543.89 2011-02-28 Allowance
I want to remove both of the first two lines which duplicate each other. I am just not able to figure out the logic. I have 9 variables in the data. What I am trying to do is remove equal Amount of debit and a credit of a transaction. So it should not remove all the duplicates but any two perfect duplicates throughout the data.Please help.
Thanks in advance.:smileyplain:
--Akber.
A more efficient way, I think, would be to remove the duplicates in a datastep. e.g.:
options datestyle=ymd;
data have;
informat var1 $10.;
informat var2-var6 12.;
informat var7 comma12.;
informat var8 anydtdte10.;
informat var9 $15.;
input var1-var8 var9 &;
cards;
ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid
ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid
WXYZ 2323 1 3212 672 020 6,543.89 2011-02-28 Allowance
;
proc sort data=have;
by var1-var9;
run;
data want;
set have;
by var1-var9;
if first.var9 and last.var9;
run;
There is probably a more efficient way to do this but here is one way... Use proc sort with dupout option and nodupkey, merge the dupout file back to the original data to remove the blanks.
data foo;
*this contains my data;
run;
proc sort data=foo out=bar dupout=drops nodupkey; by id trans_id; run;
data bar;
merge drops(in=b) bar(in=a);
by id trans_id;
if a and not b;
run;
A more efficient way, I think, would be to remove the duplicates in a datastep. e.g.:
options datestyle=ymd;
data have;
informat var1 $10.;
informat var2-var6 12.;
informat var7 comma12.;
informat var8 anydtdte10.;
informat var9 $15.;
input var1-var8 var9 &;
cards;
ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid
ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid
WXYZ 2323 1 3212 672 020 6,543.89 2011-02-28 Allowance
;
proc sort data=have;
by var1-var9;
run;
data want;
set have;
by var1-var9;
if first.var9 and last.var9;
run;
Hi ... another idea using Art's data ...
proc sql;
create table want as
select *
from have
group by var1,var2,var3,var4,var5,var6,var7,var8,var9
having count(*) eq 1;
quit;
Another way is SQL. If you like:
options datestyle ymd; data have; informat var1 $10.; informat var2-var6 12.; informat var7 comma12.; informat var8 anydtdte10.; informat var9 $15.; input var1-var8 var9 &; cards; ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid WXYZ 2323 1 3212 672 020 6,543.89 2011-02-28 Allowance ; run; proc sql; create table want as select * from have group by var1,var2,var3,var4,var5,var6,var7,var8,var9 having count(*) eq 1; quit;
Ksharp
If we're going to go a proc sql route I'd use:
proc sql noprint;
select name into :vars separated by ","
from dictionary.columns
where libname eq "WORK"
and memname eq "HAVE"
;
create table want as
select *
from have
group by &vars.
having count(*) eq 1
;
quit;
En. It is a better way to save typing.
hi ... depends, in this case ...
94 characters
select name into :vars separated by ","
from dictionary.columns
where libname eq "WORK"
and memname eq "HAVE"
;
(plus &vars = 99)
44 characters
var1,var2,var3,var4,var5,var6,var7,var8,var9
ps the "when to hold 'em versus when to fold 'em" rule
and/or the "when to make use of the SAS Macro/Abbreviations rule that advises to assign a hotkey to quickly insert frequently used code snippets"
Hi,
How can I simply remove TWO observations with equal Amount with a different sign, based just on amount and no other field?? Any two observations, no more no less.
ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid
ABCD 1234 1 1202 671 020 -1,234.56 2011-01-31 Fees Paid
ABCD 2345 1 1203 672 030 1,234.56 2011-01-31 Fees Paid
ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid
WXYZ 2323 1 3212 672 020 6,543.89 2011-02-28 Allowance
Like line 1 & 2 or line 2 & 3 or line 2 & 4 but not more than 2. I mean any two random lines which
Thanks,
Akber.
I'm not sure exactly what you are trying to do. You can always use the abs() function to make all values positive in your comparisons.
yes but that would then remove any two transactions. I only want a positive and negative with the same value removed
say 1234 and 2345 is the reference# then how do I remove the first and second observation (both 1234), both of which will add to net out to zero, how can I delete them from my dataset
ABCD 1234 1 1202 671 020 1,234.56 2011-01-31 Fees Paid
ABCD 1234 2 1203 671 020 -1,234.56 2011-01-31 Fees Paid
ABCD 2345 1 1203 672 030 1,234.56 2011-01-31 Fees Paid
You sort (order) the file, say with proc sql, by reference # and the abs() of the variable you want to use. Then, you can use the look ahead-look back methodology to see if any adjacent values add to zero (see: http://www.sascommunity.org/wiki/Look-Ahead_and_Look-Back )
Art:
If you are going with the Proc SQL then you may be limited buy the size of th :into var.
A large dataset could blow the variable size limit whereas the first. / last. on the sort won't.
Just sayin'
My original suggestion didn't use either proc sql or macro variables. However, you are definitely correct, although that limit is now 65,534. If one has a collection of variable names that exceed that number of characters, I definitely wouldn't suggest a method that requires manually entering them.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.