Solved: Re: calcuate new field from duplicated obs

podarum · Posted 01-25-2017 05:13 PM

Hi, I have a list of accounts (some with duplicated rows by account # only), and I want to calculate an average between the duplicates (based on balance) and then delete one of the duplicates.

HAVE:

Acct_No Bor Fld1 Bal Score Fld2

123 B1 A 34 234 465

123 B2 A 40 356 837

156 B1 G 68 129 835

156 B2 G 77 985 293

189 B1 D 34 356 938

199 B1 E 45 891 922

My list has 200,000 accounts, but showing only a few for the sample.

WANT: (keep all info ffrom B1, except the 2 averages -> Bal and Score from both B1 and B2)

Acct_No Bor Fld1 Bal Score Fld2

123 B1 A 37 295 465 <-(using the average of the 2 rows for both Bal and Score)

156 B1 G 72.5 557 835 <-(using the average of the 2 rows for both Bal and Score)

189 B1 D 34 356 938

199 B1 E 45 891 922

Thanks for your help.

collinelliot · Posted 01-25-2017 06:00 PM

Minor tweak, but this give you the output desired (though that might not be specifically what you want).

data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;

proc sql;
    CREATE TABLE want AS
    SELECT DISTINCT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
    FROM have
    GROUP BY acct_no
    HAVING max(bor='B1') = (bor='B1');
quit;

View solution in original post

collinelliot · Posted 01-25-2017 05:29 PM

The average is easy. Do you not care what which of the duplicates is kept? Based on your example, it looks like you want the dup that has the "B1" value, correct?

collinelliot · Posted 01-25-2017 05:31 PM

I guess I can give you a shot at a solution assuming that's the case...

Untested, though.

proc sql;
    CREATE TABLE want AS
    SELECT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
    FROM have
    GROUP BY acct_no
    HAVING max(bor='B1') = 1;
quit;

collinelliot · Posted 01-25-2017 06:00 PM

Minor tweak, but this give you the output desired (though that might not be specifically what you want).

data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;

proc sql;
    CREATE TABLE want AS
    SELECT DISTINCT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
    FROM have
    GROUP BY acct_no
    HAVING max(bor='B1') = (bor='B1');
quit;

mkeintz · Posted 01-25-2017 05:31 PM

You didn't say what you wanted when there are more than two matching records. I presume you still want to keep only one record per group (untested code). And that one record, in addition to having averages also keeps the FIRST instance of the variables not being averaged.

Untested:

data want (drop=i n sum:);

array sum {3};

array var{3} bal score fld2;

do n=1 by 1 until (last.acct_no);

set have;

by acct_no;

do i=1 to dim(sum); /*always wanted to say that in a sas program */

sum{i}=sum(sum{i},var{i});

end;

do until (last.acct_no);

if first.acct_no then do;

do i=1 to dim(sum);

var{i}=sum{i}/n;

end;

output;

end;
end;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

art297 · Posted 01-25-2017 05:40 PM

data want (drop=_:);
  set have;
  by acct_no;
  retain _bor _fld1 _fld2;
  if first.acct_no then do;
    if last.acct_no then output;
    else do;
      _bor=bor;
      _fld1=fld1;
      _bal=bal;
      _score=score;
      _fld2=fld2;
      _n=1;
    end;
  end;
  else do;
    _bal+bal;
    _score+score;
    _n+1;
    if last.acct_no then do;
      bor=_bor;
      fld1=_fld1;
      bal=_bal/_n;
      score=_score/_n;
      fld2=_fld2;
      output;
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

anoopmohandas7 · Posted 01-25-2017 07:15 PM

A different approach to the problem.

data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;
run ;

data test1 test2 ;
set have ;
by Acct_No ;
if first.Acct_No = 1  then output test1 ;
if  last.Acct_No = 1 then output test2 ;
run ;

proc sql ;
select test1.Acct_No,test1.Bor,test1.Fld1,test1.Bal,((test1.Score+test2.Score)/2) as Score,test1.fld2
from test1 as test1
inner join test2 as test2
on test1.acct_no = test2.acct_no ;
quit ;

Astounding · Posted 01-25-2017 07:28 PM

As a "not a SQL guy", I would just replace all of them:

proc summary data=have;

by acct;

var score fld2;

output out=stats (keep=acct score fld2) mean=;

run;

If there was only one observation per acct, it doesn't hurt to replace values with the mean. Then merge back in:

data want;

merge have (drop=score fld2) stats;

by acct;

if first.acct;

run;

podarum · Posted 01-25-2017 08:40 PM

I didn't think it mattered, but I should of stated the whole story... There are about 300 fields in the dataset, didn't want to state them all, most of them are different between Bor1 and Bor2, but I still want to keep all the info from Bor1. Thanks

Astounding · Posted 01-25-2017 10:11 PM

I'm not sure if the SQL approach expands easily to a large number of fields, unless you apply macro language. But PROC SUMMARY still makes it easy:

proc summary data=have;

by acct;

var _numeric_;

output out=stats (drop=_type_ _freq_) mean=;

run;

data want;

merge have (keep=acct bor fld1) stats;

by acct;

if first.acct;

run;

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away