DATA Step, Macro, Functions and more

calcuate new field from duplicated obs

Accepted Solution Solved
Reply
Super Contributor
Posts: 401
Accepted Solution

calcuate new field from duplicated obs

Hi,  I have a list of accounts (some with duplicated rows by account # only), and I want to calculate an average between the duplicates (based on balance) and then delete one of the duplicates.

 

HAVE:

Acct_No    Bor    Fld1   Bal    Score    Fld2

123             B1     A       34       234     465

123             B2     A       40       356     837

156             B1     G       68       129     835

156             B2     G       77       985     293

189             B1     D       34       356     938

199             B1     E       45       891     922

My list has 200,000 accounts, but showing only a few for the sample.

 

WANT:  (keep all info ffrom B1, except the 2 averages -> Bal and Score from both B1 and B2)

Acct_No    Bor    Fld1   Bal    Score    Fld2

123             B1     A       37       295     465          <-(using the average of the 2 rows for both Bal and Score)

156             B1     G     72.5      557     835          <-(using the average of the 2 rows for both Bal and Score)

189             B1     D       34       356     938

199             B1     E       45       891     922

 

Thanks for your help.


Accepted Solutions
Solution
‎01-25-2017 09:05 PM
PROC Star
Posts: 307

Re: calcuate new field from duplicated obs

Posted in reply to collinelliot

Minor tweak, but this give you the output desired (though that might not be specifically what you want).

 

data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;

proc sql;
    CREATE TABLE want AS
    SELECT DISTINCT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
    FROM have
    GROUP BY acct_no
    HAVING max(bor='B1') = (bor='B1');
quit;

View solution in original post


All Replies
PROC Star
Posts: 307

Re: calcuate new field from duplicated obs

The average is easy. Do you not care what which of the duplicates is kept? Based on your example, it looks like you want the dup that has the "B1" value, correct?

PROC Star
Posts: 307

Re: calcuate new field from duplicated obs

I guess I can give you a shot at a solution assuming that's the case...

 

Untested, though.

 

 

proc sql;
    CREATE TABLE want AS
    SELECT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
    FROM have
    GROUP BY acct_no
    HAVING max(bor='B1') = 1;
quit;
Solution
‎01-25-2017 09:05 PM
PROC Star
Posts: 307

Re: calcuate new field from duplicated obs

Posted in reply to collinelliot

Minor tweak, but this give you the output desired (though that might not be specifically what you want).

 

data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;

proc sql;
    CREATE TABLE want AS
    SELECT DISTINCT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
    FROM have
    GROUP BY acct_no
    HAVING max(bor='B1') = (bor='B1');
quit;
Trusted Advisor
Posts: 1,022

Re: calcuate new field from duplicated obs

You didn't say what you wanted when there are more than two matching records.  I presume you still want to keep only one record per group (untested code).  And that one record, in addition to having averages also keeps the FIRST instance of the variables not being averaged.

 

Untested:

 

data want (drop=i n sumSmiley Happy;

  array sum {3};

  array var{3} bal score fld2;

  do n=1 by 1 until (last.acct_no);

     set have;

     by acct_no;

     do i=1 to dim(sum);   /*always wanted to say that in a sas program */

       sum{i}=sum(sum{i},var{i});

     end;

  end;

  do until (last.acct_no);

    if first.acct_no then do;

      do i=1 to dim(sum);

        var{i}=sum{i}/n;

      end;

      output;

    end;
  end;

run;

PROC Star
Posts: 7,492

Re: calcuate new field from duplicated obs

 

data want (drop=_:);
  set have;
  by acct_no;
  retain _bor _fld1 _fld2;
  if first.acct_no then do;
    if last.acct_no then output;
    else do;
      _bor=bor;
      _fld1=fld1;
      _bal=bal;
      _score=score;
      _fld2=fld2;
      _n=1;
    end;
  end;
  else do;
    _bal+bal;
    _score+score;
    _n+1;
    if last.acct_no then do;
      bor=_bor;
      fld1=_fld1;
      bal=_bal/_n;
      score=_score/_n;
      fld2=_fld2;
      output;
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

 

Contributor
Posts: 44

Re: calcuate new field from duplicated obs

A different approach to the problem.

 

data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;
run ;

data test1 test2 ;
set have ;
by Acct_No ;
if first.Acct_No = 1  then output test1 ;
if  last.Acct_No = 1 then output test2 ;
run ;

proc sql ;
select test1.Acct_No,test1.Bor,test1.Fld1,test1.Bal,((test1.Score+test2.Score)/2) as Score,test1.fld2
from test1 as test1
inner join test2 as test2
on test1.acct_no = test2.acct_no ;
quit ;
Super User
Posts: 5,516

Re: calcuate new field from duplicated obs

As a "not a SQL guy", I would just replace all of them:

 

proc summary data=have;

by acct;

var score fld2;

output out=stats (keep=acct score fld2) mean=;

run;

 

If there was only one observation per acct, it doesn't hurt to replace values with the mean.  Then merge back in:

 

data want;

merge have (drop=score fld2) stats;

by acct;

if first.acct;

run;

Super Contributor
Posts: 401

Re: calcuate new field from duplicated obs

[ Edited ]

I didn't think it mattered, but I should of stated the whole story... There are about 300 fields in the dataset, didn't want to state them all, most of them are different between Bor1 and Bor2, but I still want to keep all the info from Bor1.  Thanks

Super User
Posts: 5,516

Re: calcuate new field from duplicated obs

I'm not sure if the SQL approach expands easily to a large number of fields, unless you apply macro language.  But PROC SUMMARY still makes it easy:

 

proc summary data=have;

by acct;

var _numeric_;

output out=stats (drop=_type_ _freq_) mean=;

run;

 

data want;

merge have (keep=acct bor fld1) stats;

by acct;

if first.acct;

run;

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 187 views
  • 2 likes
  • 6 in conversation