Hi, I have a list of accounts (some with duplicated rows by account # only), and I want to calculate an average between the duplicates (based on balance) and then delete one of the duplicates.
HAVE:
Acct_No Bor Fld1 Bal Score Fld2
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
My list has 200,000 accounts, but showing only a few for the sample.
WANT: (keep all info ffrom B1, except the 2 averages -> Bal and Score from both B1 and B2)
Acct_No Bor Fld1 Bal Score Fld2
123 B1 A 37 295 465 <-(using the average of the 2 rows for both Bal and Score)
156 B1 G 72.5 557 835 <-(using the average of the 2 rows for both Bal and Score)
189 B1 D 34 356 938
199 B1 E 45 891 922
Thanks for your help.
Minor tweak, but this give you the output desired (though that might not be specifically what you want).
data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;
proc sql;
CREATE TABLE want AS
SELECT DISTINCT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
FROM have
GROUP BY acct_no
HAVING max(bor='B1') = (bor='B1');
quit;
The average is easy. Do you not care what which of the duplicates is kept? Based on your example, it looks like you want the dup that has the "B1" value, correct?
I guess I can give you a shot at a solution assuming that's the case...
Untested, though.
proc sql;
CREATE TABLE want AS
SELECT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
FROM have
GROUP BY acct_no
HAVING max(bor='B1') = 1;
quit;
Minor tweak, but this give you the output desired (though that might not be specifically what you want).
data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;
proc sql;
CREATE TABLE want AS
SELECT DISTINCT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
FROM have
GROUP BY acct_no
HAVING max(bor='B1') = (bor='B1');
quit;
You didn't say what you wanted when there are more than two matching records. I presume you still want to keep only one record per group (untested code). And that one record, in addition to having averages also keeps the FIRST instance of the variables not being averaged.
Untested:
data want (drop=i n sum:);
array sum {3};
array var{3} bal score fld2;
do n=1 by 1 until (last.acct_no);
set have;
by acct_no;
do i=1 to dim(sum); /*always wanted to say that in a sas program */
sum{i}=sum(sum{i},var{i});
end;
end;
do until (last.acct_no);
if first.acct_no then do;
do i=1 to dim(sum);
var{i}=sum{i}/n;
end;
output;
end;
end;
run;
data want (drop=_:); set have; by acct_no; retain _bor _fld1 _fld2; if first.acct_no then do; if last.acct_no then output; else do; _bor=bor; _fld1=fld1; _bal=bal; _score=score; _fld2=fld2; _n=1; end; end; else do; _bal+bal; _score+score; _n+1; if last.acct_no then do; bor=_bor; fld1=_fld1; bal=_bal/_n; score=_score/_n; fld2=_fld2; output; end; end; run;
HTH,
Art, CEO, AnalystFinder.com
A different approach to the problem.
data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;
run ;
data test1 test2 ;
set have ;
by Acct_No ;
if first.Acct_No = 1 then output test1 ;
if last.Acct_No = 1 then output test2 ;
run ;
proc sql ;
select test1.Acct_No,test1.Bor,test1.Fld1,test1.Bal,((test1.Score+test2.Score)/2) as Score,test1.fld2
from test1 as test1
inner join test2 as test2
on test1.acct_no = test2.acct_no ;
quit ;
As a "not a SQL guy", I would just replace all of them:
proc summary data=have;
by acct;
var score fld2;
output out=stats (keep=acct score fld2) mean=;
run;
If there was only one observation per acct, it doesn't hurt to replace values with the mean. Then merge back in:
data want;
merge have (drop=score fld2) stats;
by acct;
if first.acct;
run;
I didn't think it mattered, but I should of stated the whole story... There are about 300 fields in the dataset, didn't want to state them all, most of them are different between Bor1 and Bor2, but I still want to keep all the info from Bor1. Thanks
I'm not sure if the SQL approach expands easily to a large number of fields, unless you apply macro language. But PROC SUMMARY still makes it easy:
proc summary data=have;
by acct;
var _numeric_;
output out=stats (drop=_type_ _freq_) mean=;
run;
data want;
merge have (keep=acct bor fld1) stats;
by acct;
if first.acct;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.