BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
podarum
Quartz | Level 8

Hi,  I have a list of accounts (some with duplicated rows by account # only), and I want to calculate an average between the duplicates (based on balance) and then delete one of the duplicates.

 

HAVE:

Acct_No    Bor    Fld1   Bal    Score    Fld2

123             B1     A       34       234     465

123             B2     A       40       356     837

156             B1     G       68       129     835

156             B2     G       77       985     293

189             B1     D       34       356     938

199             B1     E       45       891     922

My list has 200,000 accounts, but showing only a few for the sample.

 

WANT:  (keep all info ffrom B1, except the 2 averages -> Bal and Score from both B1 and B2)

Acct_No    Bor    Fld1   Bal    Score    Fld2

123             B1     A       37       295     465          <-(using the average of the 2 rows for both Bal and Score)

156             B1     G     72.5      557     835          <-(using the average of the 2 rows for both Bal and Score)

189             B1     D       34       356     938

199             B1     E       45       891     922

 

Thanks for your help.

1 ACCEPTED SOLUTION

Accepted Solutions
collinelliot
Barite | Level 11

Minor tweak, but this give you the output desired (though that might not be specifically what you want).

 

data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;

proc sql;
    CREATE TABLE want AS
    SELECT DISTINCT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
    FROM have
    GROUP BY acct_no
    HAVING max(bor='B1') = (bor='B1');
quit;

View solution in original post

9 REPLIES 9
collinelliot
Barite | Level 11

The average is easy. Do you not care what which of the duplicates is kept? Based on your example, it looks like you want the dup that has the "B1" value, correct?

collinelliot
Barite | Level 11

I guess I can give you a shot at a solution assuming that's the case...

 

Untested, though.

 

 

proc sql;
    CREATE TABLE want AS
    SELECT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
    FROM have
    GROUP BY acct_no
    HAVING max(bor='B1') = 1;
quit;
collinelliot
Barite | Level 11

Minor tweak, but this give you the output desired (though that might not be specifically what you want).

 

data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;

proc sql;
    CREATE TABLE want AS
    SELECT DISTINCT acct_no, bor, fld1, mean(bal) AS bal, score, fld2
    FROM have
    GROUP BY acct_no
    HAVING max(bor='B1') = (bor='B1');
quit;
mkeintz
PROC Star

You didn't say what you wanted when there are more than two matching records.  I presume you still want to keep only one record per group (untested code).  And that one record, in addition to having averages also keeps the FIRST instance of the variables not being averaged.

 

Untested:

 

data want (drop=i n sum:);

  array sum {3};

  array var{3} bal score fld2;

  do n=1 by 1 until (last.acct_no);

     set have;

     by acct_no;

     do i=1 to dim(sum);   /*always wanted to say that in a sas program */

       sum{i}=sum(sum{i},var{i});

     end;

  end;

  do until (last.acct_no);

    if first.acct_no then do;

      do i=1 to dim(sum);

        var{i}=sum{i}/n;

      end;

      output;

    end;
  end;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
art297
Opal | Level 21

 

data want (drop=_:);
  set have;
  by acct_no;
  retain _bor _fld1 _fld2;
  if first.acct_no then do;
    if last.acct_no then output;
    else do;
      _bor=bor;
      _fld1=fld1;
      _bal=bal;
      _score=score;
      _fld2=fld2;
      _n=1;
    end;
  end;
  else do;
    _bal+bal;
    _score+score;
    _n+1;
    if last.acct_no then do;
      bor=_bor;
      fld1=_fld1;
      bal=_bal/_n;
      score=_score/_n;
      fld2=_fld2;
      output;
    end;
  end;
run;

HTH,

Art, CEO, AnalystFinder.com

 

anoopmohandas7
Quartz | Level 8

A different approach to the problem.

 

data have;
input Acct_No Bor $ Fld1 $ Bal Score Fld2;
datalines;
123 B1 A 34 234 465
123 B2 A 40 356 837
156 B1 G 68 129 835
156 B2 G 77 985 293
189 B1 D 34 356 938
199 B1 E 45 891 922
;
run ;

data test1 test2 ;
set have ;
by Acct_No ;
if first.Acct_No = 1  then output test1 ;
if  last.Acct_No = 1 then output test2 ;
run ;

proc sql ;
select test1.Acct_No,test1.Bor,test1.Fld1,test1.Bal,((test1.Score+test2.Score)/2) as Score,test1.fld2
from test1 as test1
inner join test2 as test2
on test1.acct_no = test2.acct_no ;
quit ;
Astounding
PROC Star

As a "not a SQL guy", I would just replace all of them:

 

proc summary data=have;

by acct;

var score fld2;

output out=stats (keep=acct score fld2) mean=;

run;

 

If there was only one observation per acct, it doesn't hurt to replace values with the mean.  Then merge back in:

 

data want;

merge have (drop=score fld2) stats;

by acct;

if first.acct;

run;

podarum
Quartz | Level 8

I didn't think it mattered, but I should of stated the whole story... There are about 300 fields in the dataset, didn't want to state them all, most of them are different between Bor1 and Bor2, but I still want to keep all the info from Bor1.  Thanks

Astounding
PROC Star

I'm not sure if the SQL approach expands easily to a large number of fields, unless you apply macro language.  But PROC SUMMARY still makes it easy:

 

proc summary data=have;

by acct;

var _numeric_;

output out=stats (drop=_type_ _freq_) mean=;

run;

 

data want;

merge have (keep=acct bor fld1) stats;

by acct;

if first.acct;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1042 views
  • 2 likes
  • 6 in conversation