Solved: Re: summarized data: summing by group for some columns and report last...

krish_101 · Posted 07-17-2019 04:29 AM

I want to group by one column, by "id" in this example, to compute sum for some columns and report the last entry for some other columns (those columns would have to be specified).

Say I have this data:

data have;
input id time sales bonus commission toalstock totalcash totalvalue;
     datalines;
       1   1   10     1        1         100       50        200
       1   2   10     1        1         100       50        200
       1   3   15     2        1          80       50        200
       1   4   10     1        1          90       50        200
       1   5   20     2        2         100       50        200
       1   6   23     2        2         133       52        222
       ;

My desired output would be:

data want;
input id time sales bonus commission toalstock totalcash totalvalue;
       1    6    88    9        8        133        52        222
       ;

My main concern is computational efficiency as I'll be running this on tables with many more columns (about 200) and data entry (about 30000 rows) with different ids.

Proc SQL or native SAS are both welcome.

Rows might be shuffled if necessary.

Thanks,

Krish

Kurt_Bremser · Posted 07-17-2019 05:48 AM

I have found a way to utilize proc summary in taking the last value of a group:

data have;
input id time sales bonus commission toalstock totalcash totalvalue;
     datalines;
       1   1   10     1        1         100       50        200
       1   2   10     1        1         100       50        200
       1   3   15     2        1          80       50        200
       1   4   10     1        1          90       50        200
       1   5   20     2        2         100       50        200
       1   6   23     2        2         133       52        222
       ;

proc summary data=have;
by id;
output
  out=want (drop=_type_ _freq_)
  sum(sales bonus commission)=
  idgroup (last out (time toalstock totalcash totalvalue)=)
;
run;

proc print data=want noobs;
run;

Result:

id    sales    bonus    commission    time    toalstock    totalcash    totalvalue

 1      88       9           8          6        133           52           222

As you can see, you now "only" need to populate the sum()= and idgroup() options with the respective variable names.

If you want to keep the original horizontal order of variables, you will need another data step:

data want;
set
  have (obs=0) /* retrieves the order of variables */
  want /* retrieves the data */
;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

krish_101 · Posted 07-17-2019 04:11 AM

I want to group by one column, by "id" in this example, to compute sum for some columns and report the last entry for some other columns (those columns would have to be specified).

Say I have this data:

data have;
input id time sales bonus commission toalstock totalcash totalvalue;
 datalines;
       1   1   10     1        1         100       50        200
       1   2   10     1        1         100       50        200
       1   3   15     2        1          80       50        200
       1   4   10     1        1          90       50        200
       1   5   20     2        2         100       50        200
       1   6   23     2        2         133       52        222
       
       2...                                                            
       ...
       3.. 
       ;

My desired output would be:

data want;
input id time sales bonus commission toalstock totalcash totalvalue;
       1    6    88    9        8        133        52        222
       2...
       3...
       ;

My main concern is computational efficiency as I'll be running this on tables with many more columns (about 200) and data entry (about 30000 rows).

Proc SQL or native SAS are both welcome.

Rows might be shuffled if necessary.

Thanks,

Krish

PeterClemmensen · Posted 07-17-2019 04:34 AM

data want(drop=_:);
    do until (last.id);
        set have;
        by id;
        _sales+sales; _bonus+bonus; _commission+commission;
    end;
    sales=_sales; bonus=_bonus; commission=_commission;
run;

The DATA to DATA Step Macro
Blog: SASnrd

Kurt_Bremser · Posted 07-17-2019 04:44 AM

My preferred way of doing this is a data step:

data have;
input id time sales bonus commission toalstock totalcash totalvalue;
     datalines;
       1   1   10     1        1         100       50        200
       1   2   10     1        1         100       50        200
       1   3   15     2        1          80       50        200
       1   4   10     1        1          90       50        200
       1   5   20     2        2         100       50        200
       1   6   23     2        2         133       52        222
       ;

data want;
retain id time sales bonus commission toalstock totalcash totalvalue;
set have (
  rename=(sales=_sales bonus=_bonus commission=_commission)
);
by id;
if first.id
then do;
  sales = _sales;
  bonus = _bonus;
  commission = _commission;
end;
else do;
  sales + _sales;
  bonus + _bonus;
  commission + _commission;
end;
if last.id;
drop _:;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

krish_101 · Posted 07-17-2019 05:06 AM

What if I have to do the same on 200+ columns of which some need to be summed and for others the last entry will be reported?

Do I need to define the columns which used last.id in a separate table or using hash object?

Kurt_Bremser · Posted 07-17-2019 05:26 AM

As you can see, my code does not use a hash object for 3 variables, so you won't need it for 200 either.

You have to play the rename-initialize-sum-drop game for all the variables that need summing.

Note:

The retain statement in my code is not necessary; it is mostly there to keep the horizontal order of columns. The retain for tzhe newly created variables is implicitly declared by using the increment statements.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

krish_101 · Posted 07-17-2019 04:55 AM

Thanks for the quick replies.

Please note, in this example the computation is done on 6 columns and the suggested solutions so far work fine in this case.

What if I have to do the same on 200+ columns of which some need to be summed and for others the last entry will be reported?

Do I need to define the columns which used last.id in a separate table or perhaps using hash object?

PeterClemmensen · Posted 07-17-2019 05:08 AM

How many of those +200 columns do you want to sum?

The DATA to DATA Step Macro
Blog: SASnrd

krish_101 · Posted 07-17-2019 05:20 AM

about 100 columns to be summed
and 100 other columns where last id is reported.

PeterClemmensen · Posted 07-17-2019 05:25 AM

Ok. Well, the logic posted by both me and @Kurt_Bremser holds even for several hundred variables. Obviously, you would have to type out the variables that you want to sum. So unless there is some naming convention that is similar among the sum variables, I don't think you will gain much from using an array or a hash object.

The DATA to DATA Step Macro
Blog: SASnrd

Kurt_Bremser · Posted 07-17-2019 05:42 AM

@krish_101 wrote:
about 100 columns to be summed
and 100 other columns where last id is reported.

You are the coder, so it's up to you to tell SAS what to do. If you have some resource that can make the code data-driven (list(s) of variable names in a dataset), then we can show you how to utilize that to automatically create the code.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

andreas_lds · Posted 07-17-2019 05:52 AM

The following step is a small modification of what @Kurt_Bremser posted: it uses two arrays for the variables the need to be summed. Wrapping the step in a macro to further reduce listing the variables should be the next step.

data want;
   retain id time sales bonus commission toalstock totalcash totalvalue;
   set have (
      rename=(sales=_sales bonus=_bonus commission=_commission)
      );
   by id;

   array vars _sales _bonus _commission;
   array sums sales bonus commission;

   if first.id then do;
      do _i = 1 to dim(vars);
         sums[_i] = vars[_i];
      end;
   end;
   else do;
      do _i = 1 to dim(vars);
         sums[_i] + vars[_i];
      end;
   end;

   if last.id;
   drop _:;
run;

Kurt_Bremser · Posted 07-17-2019 05:48 AM

I have found a way to utilize proc summary in taking the last value of a group:

data have;
input id time sales bonus commission toalstock totalcash totalvalue;
     datalines;
       1   1   10     1        1         100       50        200
       1   2   10     1        1         100       50        200
       1   3   15     2        1          80       50        200
       1   4   10     1        1          90       50        200
       1   5   20     2        2         100       50        200
       1   6   23     2        2         133       52        222
       ;

proc summary data=have;
by id;
output
  out=want (drop=_type_ _freq_)
  sum(sales bonus commission)=
  idgroup (last out (time toalstock totalcash totalvalue)=)
;
run;

proc print data=want noobs;
run;

Result:

id    sales    bonus    commission    time    toalstock    totalcash    totalvalue

 1      88       9           8          6        133           52           222

As you can see, you now "only" need to populate the sum()= and idgroup() options with the respective variable names.

If you want to keep the original horizontal order of variables, you will need another data step:

data want;
set
  have (obs=0) /* retrieves the order of variables */
  want /* retrieves the data */
;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

krish_101 · Posted 07-17-2019 06:22 AM

looks great
I created a list(s) of variable names in a dataset called valist which groups the columns to be summed under sumvars

data varlist;
input  sumvars         lastvars;
          name1         name2
          name5         name3

          ...           ...
          name70        name50
          name100       name102
;

proc sql noprint;
select sumvars into : sumvarlist separated by " "
from varlist
;


quit;
proc sql noprint;
select lastvars into : lastvarlist separated by " "
from varlist
;
quit;

modify this part of the code


 sum(&sumvarlist)= idgroup (last out (&lastvarlist)=)

summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

summing multiple variable by group and report last entry for other variables

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Re: summarized data: summing by group for some columns and report last entry for other columns

Classroom Training Available!