BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
krish_101
Fluorite | Level 6

I want to group by one column, by "id" in this example, to compute sum for some columns and report the last entry for some other columns (those columns would have to be specified).

 

Say I have this data:

data have;
input id time sales bonus commission toalstock totalcash totalvalue;
     datalines;
       1   1   10     1        1         100       50        200
       1   2   10     1        1         100       50        200
       1   3   15     2        1          80       50        200
       1   4   10     1        1          90       50        200
       1   5   20     2        2         100       50        200
       1   6   23     2        2         133       52        222
       ;

 

My desired output would be:

data want;
input id time sales bonus commission toalstock totalcash totalvalue;
       1    6    88    9        8        133        52        222
       ;

 

My main concern is computational efficiency as I'll be running this on tables with many more columns (about 200) and data entry (about 30000 rows) with different ids.

 

Proc SQL or native SAS are both welcome.

Rows might be shuffled if necessary.

 

 

Thanks,

Krish

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

I have found a way to utilize proc summary in taking the last value of a group:

data have;
input id time sales bonus commission toalstock totalcash totalvalue;
     datalines;
       1   1   10     1        1         100       50        200
       1   2   10     1        1         100       50        200
       1   3   15     2        1          80       50        200
       1   4   10     1        1          90       50        200
       1   5   20     2        2         100       50        200
       1   6   23     2        2         133       52        222
       ;

proc summary data=have;
by id;
output
  out=want (drop=_type_ _freq_)
  sum(sales bonus commission)=
  idgroup (last out (time toalstock totalcash totalvalue)=)
;
run;

proc print data=want noobs;
run;

Result:

id    sales    bonus    commission    time    toalstock    totalcash    totalvalue

 1      88       9           8          6        133           52           222   

As you can see, you now "only" need to populate the sum()= and idgroup() options with the respective variable names.

If you want to keep the original horizontal order of variables, you will need another data step:

data want;
set
  have (obs=0) /* retrieves the order of variables */
  want /* retrieves the data */
;
run;

View solution in original post

13 REPLIES 13
krish_101
Fluorite | Level 6

I want to group by one column, by "id" in this example, to compute sum for some columns and report the last entry for some other columns (those columns would have to be specified).

 

Say I have this data:

data have;
input id time sales bonus commission toalstock totalcash totalvalue;
datalines; 1 1 10 1 1 100 50 200 1 2 10 1 1 100 50 200 1 3 15 2 1 80 50 200 1 4 10 1 1 90 50 200 1 5 20 2 2 100 50 200 1 6 23 2 2 133 52 222

2...
... 3.. ;

 

My desired output would be:

 

data want;
input id time sales bonus commission toalstock totalcash totalvalue;
       1    6    88    9        8        133        52        222
       2...
       3...
;

My main concern is computational efficiency as I'll be running this on tables with many more columns (about 200) and data entry (about 30000 rows).

 

Proc SQL or native SAS are both welcome.

Rows might be shuffled if necessary.

 

 

Thanks,

Krish

PeterClemmensen
Tourmaline | Level 20
data want(drop=_:);
    do until (last.id);
        set have;
        by id;
        _sales+sales; _bonus+bonus; _commission+commission;
    end;
    sales=_sales; bonus=_bonus; commission=_commission;
run;
Kurt_Bremser
Super User

My preferred way of doing this is a data step:

data have;
input id time sales bonus commission toalstock totalcash totalvalue;
     datalines;
       1   1   10     1        1         100       50        200
       1   2   10     1        1         100       50        200
       1   3   15     2        1          80       50        200
       1   4   10     1        1          90       50        200
       1   5   20     2        2         100       50        200
       1   6   23     2        2         133       52        222
       ;

data want;
retain id time sales bonus commission toalstock totalcash totalvalue;
set have (
  rename=(sales=_sales bonus=_bonus commission=_commission)
);
by id;
if first.id
then do;
  sales = _sales;
  bonus = _bonus;
  commission = _commission;
end;
else do;
  sales + _sales;
  bonus + _bonus;
  commission + _commission;
end;
if last.id;
drop _:;
run;
krish_101
Fluorite | Level 6
What if I have to do the same on 200+ columns of which some need to be summed and for others the last entry will be reported?

Do I need to define the columns which used last.id in a separate table or using hash object?
Kurt_Bremser
Super User

As you can see, my code does not use a hash object for 3 variables, so you won't need it for 200 either.

You have to play the rename-initialize-sum-drop game for all the variables that need summing.

 

Note:

The retain statement in my code is not necessary; it is mostly there to keep the horizontal order of columns. The retain for tzhe newly created variables is implicitly declared by using the increment statements.

krish_101
Fluorite | Level 6

Thanks for the quick replies.

 

Please note, in this example the computation is done on 6 columns and the suggested solutions so far work fine in this case.

 

What if I have to do the same on 200+ columns of which some need to be summed and for others the last entry will be reported?

Do I need to define the columns which used last.id in a separate table or perhaps using hash object?

 

krish_101
Fluorite | Level 6
about 100 columns to be summed
and 100 other columns where last id is reported.
PeterClemmensen
Tourmaline | Level 20

Ok. Well, the logic posted by both me and @Kurt_Bremser holds even for several hundred variables. Obviously, you would have to type out the variables that you want to sum. So unless there is some naming convention that is similar among the sum variables, I don't think you will gain much from using an array or a hash object.

Kurt_Bremser
Super User

@krish_101 wrote:
about 100 columns to be summed
and 100 other columns where last id is reported.

You are the coder, so it's up to you to tell SAS what to do. If you have some resource that can make the code data-driven (list(s) of variable names in a dataset), then we can show you how to utilize that to automatically create the code.

andreas_lds
Jade | Level 19

The following step is a small modification of what @Kurt_Bremser posted: it uses two arrays for the variables the need to be summed. Wrapping the step in a macro to further reduce listing the variables should be the next step.

 

data want;
   retain id time sales bonus commission toalstock totalcash totalvalue;
   set have (
      rename=(sales=_sales bonus=_bonus commission=_commission)
      );
   by id;

   array vars _sales _bonus _commission;
   array sums sales bonus commission;

   if first.id then do;
      do _i = 1 to dim(vars);
         sums[_i] = vars[_i];
      end;
   end;
   else do;
      do _i = 1 to dim(vars);
         sums[_i] + vars[_i];
      end;
   end;

   if last.id;
   drop _:;
run;
Kurt_Bremser
Super User

I have found a way to utilize proc summary in taking the last value of a group:

data have;
input id time sales bonus commission toalstock totalcash totalvalue;
     datalines;
       1   1   10     1        1         100       50        200
       1   2   10     1        1         100       50        200
       1   3   15     2        1          80       50        200
       1   4   10     1        1          90       50        200
       1   5   20     2        2         100       50        200
       1   6   23     2        2         133       52        222
       ;

proc summary data=have;
by id;
output
  out=want (drop=_type_ _freq_)
  sum(sales bonus commission)=
  idgroup (last out (time toalstock totalcash totalvalue)=)
;
run;

proc print data=want noobs;
run;

Result:

id    sales    bonus    commission    time    toalstock    totalcash    totalvalue

 1      88       9           8          6        133           52           222   

As you can see, you now "only" need to populate the sum()= and idgroup() options with the respective variable names.

If you want to keep the original horizontal order of variables, you will need another data step:

data want;
set
  have (obs=0) /* retrieves the order of variables */
  want /* retrieves the data */
;
run;
krish_101
Fluorite | Level 6

looks great
I created a list(s) of variable names in a dataset called valist which groups the columns to be summed under sumvars

data varlist;
input  sumvars         lastvars;
          name1         name2
          name5         name3

... ... name70 name50 name100 name102 ;
proc sql noprint;
select sumvars into : sumvarlist separated by " "
from varlist
;


quit;
proc sql noprint;
select lastvars into : lastvarlist separated by " "
from varlist
;
quit;

 

modify this part of the code


sum(&sumvarlist)= idgroup (last out (&lastvarlist)=)

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1441 views
  • 9 likes
  • 4 in conversation