Solved: How to summarize both character & numeric variables by id?

nop · Posted 12-10-2021 02:52 PM

Hi all,

May I ask for your help? Here's the data I have & want. I have difficulty in summarizing both character & numeric variables, as I need to combine the character variable "product_ID" which cannot use proc sql (I thought). Also, I have many numeric variables actually; is there any way that I don't need to list the variable names one by one? Many thanks for your help in advance!

data Have;
input
ID $ visit_date :mmddyy8. product_ID $ Plan_A Plan_B Plan_C Cost;
format visit_date mmddyy8.;
datalines;
AB 10/01/98 ABCD 50 10 20 30
AB 10/01/98 EFGH 50 10 20 30
AB 10/02/98 ABCD 50 10 20 30
AC 10/01/98 ABCD 50 10 20 30
AC 10/01/98 EFGH 50 10 20 30
AD 10/01/98 ABCD 50 10 20 30
AD 10/02/98 EFGH 50 10 20 30
AD 10/03/98 HIJK 50 10 20 30
;
run;

data Want;
input
ID $ visit_date :mmddyy8. product_ID $ Plan_A Plan_B Plan_C Cost;
format visit_date mmddyy8.;
datalines;
AB 10/01/98 ABCD&EFGH 100 20 40 60
AB 10/02/98 ABCD 50 10 20 30
AC 10/01/98 ABCD&EFGH 100 20 40 60
AD 10/01/98 ABCD 50 10 20 30
AD 10/02/98 EFGH 50 10 20 30
AD 10/03/98 HIJK 50 10 20 30
;
run;

Kurt_Bremser · Posted 12-11-2021 05:25 AM

The basic data step for summarizing looks like this:

/* set up the data step, with new and incoming dataset, and defining groups */
data want;
set have;
by id visit_date;
/* define new variables */
length
  all_id $100
  sum_plan_a 8
  sum_plan_b 8
  sum_plan_c 8
  sum_cost 8
;
/* add a FORMAT statement if formats are needed */
/* RETAIN them */
retain
  all_id
  sum_plan_a
  sum_plan_b
  sum_plan_c
  sum_cost
;
/* Initialize at beginning of group */
if first.visit_date
then do;
  all_id = product_id;
  sum_plan_a = plan_a;
  sum_plan_b = plan_b;
  sum_plan_c = plan_c;
  sum_cost = cost;
end;
/* otherwise summarize */
else do;
  all_id = catx('&',all_id,product_id);
  sum_plan_a + plan_a;
  sum_plan_b + plan_b;
  sum_plan_c + plan_c;
  sum_cost + cost;
end;
/* output at end of group only with a Subsetting IF */
if last.visit_date;
/* KEEP only variables of interest */
keep
  id
  visit_date
  all_id
  sum_plan_a
  sum_plan_b
  sum_plan_c
  sum_cost
;
run;

This is the code that the data step compiler needs to see; you now need to decide the more efficient approach: write it all out manually (the above code needed about 5 minutes to write), or expend the effort needed to automate it. Which only makes sense if this is to be used repeatedly on datasets with changing structure.

Automation needs a clear definition of variables to be automatically summarized, so you can pull their names from DICTIONARY.COLUMNS or SASHELP.VCOLUMN and use that to create code dynamically. This definition can be based on name patterns, types, formats, you name it.

You also need to clearly define what kind of summarization you want (just sums, or min/max/averages, or some kind of flags) for each variable.

You can also do a multiple-step approach by using PROC SUMMARY for the numeric variables:

data want1;
set have;
by id visit_date;
length
  all_id $100
;
length
  all_id $100
;
retain
  all_id
;
if first.visit_date
then do;
  all_id = product_id;
end;
else do;
  all_id = catx('&',all_id,product_id);
end;
if last.visit_date;
keep
  id
  visit_date
  all_id
;
run;

proc summary data=have;
by id visit_date;
var plan_a--cost;
output out=want2 (drop=_type_ _freq_) sum()=;
run;

data want;
merge
  want1
  want2
;
by id visit_date;
run;

Note that I could not use the keyword _NUMERIC_ in the VAR statement because of the presence of the numeric date.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 12-10-2021 04:36 PM

Concatenating character values over a group is done in a DATA step with a RETAINed variable.

data want;
set have;
by id visit_date;
retain all_id;
length all_id $100;
if first.visit_date
then all_id = product_id;
else all_id = catx('&',all_id,product_id);
/* add similar code for summarizing the numeric variables */
if last.visit_date;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

nop · Posted 12-11-2021 04:12 AM

Thanks alot Kurt! It works very good for the first part of character variables. But for numeric variables, sorry as I am not very familiar with the summing in data-set (only know bit about proc sql). Is there any way I can summarize the variable by visit_date e.g. plan_A--Cost without listing each variable name one by one? Thanks so much!

Kurt_Bremser · Posted 12-11-2021 05:25 AM

The basic data step for summarizing looks like this:

/* set up the data step, with new and incoming dataset, and defining groups */
data want;
set have;
by id visit_date;
/* define new variables */
length
  all_id $100
  sum_plan_a 8
  sum_plan_b 8
  sum_plan_c 8
  sum_cost 8
;
/* add a FORMAT statement if formats are needed */
/* RETAIN them */
retain
  all_id
  sum_plan_a
  sum_plan_b
  sum_plan_c
  sum_cost
;
/* Initialize at beginning of group */
if first.visit_date
then do;
  all_id = product_id;
  sum_plan_a = plan_a;
  sum_plan_b = plan_b;
  sum_plan_c = plan_c;
  sum_cost = cost;
end;
/* otherwise summarize */
else do;
  all_id = catx('&',all_id,product_id);
  sum_plan_a + plan_a;
  sum_plan_b + plan_b;
  sum_plan_c + plan_c;
  sum_cost + cost;
end;
/* output at end of group only with a Subsetting IF */
if last.visit_date;
/* KEEP only variables of interest */
keep
  id
  visit_date
  all_id
  sum_plan_a
  sum_plan_b
  sum_plan_c
  sum_cost
;
run;

This is the code that the data step compiler needs to see; you now need to decide the more efficient approach: write it all out manually (the above code needed about 5 minutes to write), or expend the effort needed to automate it. Which only makes sense if this is to be used repeatedly on datasets with changing structure.

Automation needs a clear definition of variables to be automatically summarized, so you can pull their names from DICTIONARY.COLUMNS or SASHELP.VCOLUMN and use that to create code dynamically. This definition can be based on name patterns, types, formats, you name it.

You also need to clearly define what kind of summarization you want (just sums, or min/max/averages, or some kind of flags) for each variable.

You can also do a multiple-step approach by using PROC SUMMARY for the numeric variables:

data want1;
set have;
by id visit_date;
length
  all_id $100
;
length
  all_id $100
;
retain
  all_id
;
if first.visit_date
then do;
  all_id = product_id;
end;
else do;
  all_id = catx('&',all_id,product_id);
end;
if last.visit_date;
keep
  id
  visit_date
  all_id
;
run;

proc summary data=have;
by id visit_date;
var plan_a--cost;
output out=want2 (drop=_type_ _freq_) sum()=;
run;

data want;
merge
  want1
  want2
;
by id visit_date;
run;

Note that I could not use the keyword _NUMERIC_ in the VAR statement because of the presence of the numeric date.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 12-11-2021 05:54 AM


data Have;
input
ID $ visit_date :mmddyy8. product_ID $ Plan_A Plan_B Plan_C Cost;
format visit_date mmddyy8.;
datalines;
AB 10/01/98 ABCD 50 10 20 30
AB 10/01/98 EFGH 50 10 20 30
AB 10/02/98 ABCD 50 10 20 30
AC 10/01/98 ABCD 50 10 20 30
AC 10/01/98 EFGH 50 10 20 30
AD 10/01/98 ABCD 50 10 20 30
AD 10/02/98 EFGH 50 10 20 30
AD 10/03/98 HIJK 50 10 20 30
;

proc summary data=have nway;
class id visit_date;
var _numeric_;
output out=part1 sum=;
run;

data part2;
 do until(last.visit_date);
  set have;
  by id visit_date;
  length products $ 200;
  products=catx('&',products,product_ID);
 end;
keep id visit_date products;
run;

data want;
 merge part2 part1;
 by id visit_date;
 drop _type_ _freq_;
run;

nop · Posted 12-11-2021 10:47 AM

Thank you so much Kurt and Ksharp!! Both solutions worked so well and I have used them for my real data. I actually spent 2 days but without proper training, I really couldn't figure that out. Million thanks for the guidance and illustration from Kurt and Ksharp! Have a good day!

Tom · Posted 12-11-2021 10:48 AM

Use the FINDW() function to tell whether the PRODUCT_ID is duplicated.

data Have;
  input ID $ visit_date :mmddyy8. product_ID $ Plan_A Plan_B Plan_C Cost;
  format visit_date yymmdd10.;
datalines;
AB 10/01/98 ABCD 50 10 20 30
AB 10/01/98 EFGH 50 10 20 30
AB 10/02/98 ABCD 50 10 20 30
AC 10/01/98 ABCD 50 10 20 30
AC 10/01/98 EFGH 50 10 20 30
AD 10/01/98 ABCD 50 10 20 30
AD 10/02/98 EFGH 50 10 20 30
AD 10/03/98 HIJK 50 10 20 30
;

data want;
  if 0 then set have;
  length product_list $200;
  array raw plan_a plan_b plan_c cost;
  array total sum_a sum_b sum_c sum_cost ;
  do until (last.visit_date);
    set have;
    by id visit_date ;
    if not findw(product_list,product_id,',','it') then product_list=catx(',',product_list,product_id);
    do index=1 to dim(raw);
      total[index]=sum(total[index],raw[index]);
    end;
  end;
  drop index product_id plan_a plan_b plan_c cost;
  rename
    product_list=product_ID
    sum_a=Plan_A
    sum_b=Plan_B
    sum_c=Plan_C
    sum_cost=Cost
  ; 
run;

Results

                           product_
Obs    ID    visit_date    ID           Plan_A    Plan_B    Plan_C    Cost

 1     AB    1998-10-01    ABCD,EFGH      100       20        40       60
 2     AB    1998-10-02    ABCD            50       10        20       30
 3     AC    1998-10-01    ABCD,EFGH      100       20        40       60
 4     AD    1998-10-01    ABCD            50       10        20       30
 5     AD    1998-10-02    EFGH            50       10        20       30
 6     AD    1998-10-03    HIJK            50       10        20       30

PS Do NOT use only two years for dates. Also avoid using MDY or DMY order to display dates as either choice will confuse half of your audience.

Tom · Posted 12-11-2021 10:56 AM

Do you really want to write your own summary using data step?

Why not let PROC SUMMARY do it?

proc summary data=have nway;
  class id visit_date ;
  var _numeric_;
  output out=want(drop=_type_ ) sum=
         idgroup (out[5] (product_id)=)
  ;
run;

If you really want the PRODUCT_ID stuffed into a single variable then add a post-processing step.

data want ;
  set want ;
  length product_ID $200;
  product_id=catx(',',of product_id_:);
  drop product_id_: ;
run;

Results

                                                                           product_
Obs    ID    visit_date    _FREQ_    Plan_A    Plan_B    Plan_C    Cost    ID

 1     AB    1998-10-01       2        100       20        40       60     ABCD,EFGH
 2     AB    1998-10-02       1         50       10        20       30     ABCD
 3     AC    1998-10-01       2        100       20        40       60     ABCD,EFGH
 4     AD    1998-10-01       1         50       10        20       30     ABCD
 5     AD    1998-10-02       1         50       10        20       30     EFGH
 6     AD    1998-10-03       1         50       10        20       30     HIJK

nop · Posted 12-11-2021 01:34 PM

Thanks alot Tom! I agree; like you, Kurt and Ksharp mentioned, proc summary is definitely a much better way as I only knew proc sql before. Also, thanks for pointing out about the date format and will sure note to use "yymmdd10." in the future. Great thanks Tom!

acordes · Posted 12-13-2021 12:11 PM

Proc IML solution.

@Rick_SAS how can I concatenate more efficiently a char vector like a={"abcd", "xyz"} into a single string with delimiter like "|" between them? So I want to get "abcd|xyz".

proc iml;
use have;
read all var _CHAR_  into y[colname=varny1];
read all var _num_  into x[colname=varny2];
close;

_date_id=catx("_", putn(x[,{"visit_date"}], 'ddmmyy10.'), y[,{"id"}]);

u=unique(_date_id);

res=j(ncol(u), 5,.);
_prod=j(ncol(u),2,"                               ");

do i=1 to ncol(u);
idx=loc(element(_date_id, u[i]));
res[i, ]= x[idx[1], 1] || x[idx, 2:5] [+,];
_prod[i,1]=u[i];
temp=y[idx, {"product_id"}]`;
if ncol(temp)=1 then _prod[i,2]=temp[1];
if ncol(temp) > 1 then do;
do n=1 to ncol(temp);
_prod[i,2]=strip(_prod[i,2]) + "&" + temp[n];
end;
end;
end;

create _x from res  [colname=varny2];
append from res;
close;

create _y from _prod[colname={"date+id", "product_combi"}];
append from _prod;
close;

data want;
merge _x _y;
format visit_date date9.;
run;

Rick_SAS · Posted 12-13-2021 01:58 PM

Hi @acordes The easiest way to see if I have solved a problem is to use the SITE: in your favorite search engine. For example, type the following into a search engine:

concatenate string delimiter site:blogs.sas.com/content/iml

You will find "Tips for concatenating strings in SAS/IML," which describes this problem and provides a solution.

How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Re: How to summarize both character & numeric variables by id?

Registration is open

SAS Training: Just a Click Away