## calcuate new field from duplicated obs

Solved
Super Contributor
Posts: 409

# calcuate new field from duplicated obs

Hi,  I have a list of accounts (some with duplicated rows by account # only), and I want to calculate an average between the duplicates (based on balance) and then delete one of the duplicates.

HAVE:

Acct_No    Bor    Fld1   Bal    Score    Fld2

123             B1     A       34       234     465

123             B2     A       40       356     837

156             B1     G       68       129     835

156             B2     G       77       985     293

189             B1     D       34       356     938

199             B1     E       45       891     922

My list has 200,000 accounts, but showing only a few for the sample.

WANT:  (keep all info ffrom B1, except the 2 averages -> Bal and Score from both B1 and B2)

Acct_No    Bor    Fld1   Bal    Score    Fld2

123             B1     A       37       295     465          <-(using the average of the 2 rows for both Bal and Score)

156             B1     G     72.5      557     835          <-(using the average of the 2 rows for both Bal and Score)

189             B1     D       34       356     938

199             B1     E       45       891     922

Accepted Solutions
Solution
‎01-25-2017 09:05 PM
PROC Star
Posts: 311

## Re: calcuate new field from duplicated obs

Minor tweak, but this give you the output desired (though that might not be specifically what you want).

``````data have;
input Acct_No Bor \$ Fld1 \$ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;

proc sql;
CREATE TABLE want AS
SELECT DISTINCT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
FROM have
GROUP BY acct_no
HAVING max(bor='B1') = (bor='B1');
quit;``````

All Replies
PROC Star
Posts: 311

## Re: calcuate new field from duplicated obs

The average is easy. Do you not care what which of the duplicates is kept? Based on your example, it looks like you want the dup that has the "B1" value, correct?

PROC Star
Posts: 311

## Re: calcuate new field from duplicated obs

I guess I can give you a shot at a solution assuming that's the case...

Untested, though.

``````proc sql;
CREATE TABLE want AS
SELECT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
FROM have
GROUP BY acct_no
HAVING max(bor='B1') = 1;
quit;``````
Solution
‎01-25-2017 09:05 PM
PROC Star
Posts: 311

## Re: calcuate new field from duplicated obs

Minor tweak, but this give you the output desired (though that might not be specifically what you want).

``````data have;
input Acct_No Bor \$ Fld1 \$ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;

proc sql;
CREATE TABLE want AS
SELECT DISTINCT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
FROM have
GROUP BY acct_no
HAVING max(bor='B1') = (bor='B1');
quit;``````
Posts: 1,394

## Re: calcuate new field from duplicated obs

You didn't say what you wanted when there are more than two matching records.  I presume you still want to keep only one record per group (untested code).  And that one record, in addition to having averages also keeps the FIRST instance of the variables not being averaged.

Untested:

data want (drop=i n sum;

array sum {3};

array var{3} bal score fld2;

do n=1 by 1 until (last.acct_no);

set have;

by acct_no;

do i=1 to dim(sum);   /*always wanted to say that in a sas program */

sum{i}=sum(sum{i},var{i});

end;

end;

do until (last.acct_no);

if first.acct_no then do;

do i=1 to dim(sum);

var{i}=sum{i}/n;

end;

output;

end;
end;

run;

Super User
Posts: 8,218

## Re: calcuate new field from duplicated obs

```data want (drop=_:);
set have;
by acct_no;
retain _bor _fld1 _fld2;
if first.acct_no then do;
if last.acct_no then output;
else do;
_bor=bor;
_fld1=fld1;
_bal=bal;
_score=score;
_fld2=fld2;
_n=1;
end;
end;
else do;
_bal+bal;
_score+score;
_n+1;
if last.acct_no then do;
bor=_bor;
fld1=_fld1;
bal=_bal/_n;
score=_score/_n;
fld2=_fld2;
output;
end;
end;
run;
```

HTH,

Art, CEO, AnalystFinder.com

Contributor
Posts: 44

## Re: calcuate new field from duplicated obs

A different approach to the problem.

``````data have;
input Acct_No Bor \$ Fld1 \$ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;
run ;

data test1 test2 ;
set have ;
by Acct_No ;
if first.Acct_No = 1  then output test1 ;
if  last.Acct_No = 1 then output test2 ;
run ;

proc sql ;
select test1.Acct_No,test1.Bor,test1.Fld1,test1.Bal,((test1.Score+test2.Score)/2) as Score,test1.fld2
from test1 as test1
inner join test2 as test2
on test1.acct_no = test2.acct_no ;
quit ;``````
Super User
Posts: 6,934

## Re: calcuate new field from duplicated obs

As a "not a SQL guy", I would just replace all of them:

proc summary data=have;

by acct;

var score fld2;

output out=stats (keep=acct score fld2) mean=;

run;

If there was only one observation per acct, it doesn't hurt to replace values with the mean.  Then merge back in:

data want;

merge have (drop=score fld2) stats;

by acct;

if first.acct;

run;

Super Contributor
Posts: 409

## Re: calcuate new field from duplicated obs

[ Edited ]

I didn't think it mattered, but I should of stated the whole story... There are about 300 fields in the dataset, didn't want to state them all, most of them are different between Bor1 and Bor2, but I still want to keep all the info from Bor1.  Thanks

Super User
Posts: 6,934

## Re: calcuate new field from duplicated obs

I'm not sure if the SQL approach expands easily to a large number of fields, unless you apply macro language.  But PROC SUMMARY still makes it easy:

proc summary data=have;

by acct;

var _numeric_;

output out=stats (drop=_type_ _freq_) mean=;

run;

data want;

merge have (keep=acct bor fld1) stats;

by acct;

if first.acct;

run;

☑ This topic is solved.