Hi all,
May I ask for your help? Here's the data I have & want. I have difficulty in summarizing both character & numeric variables, as I need to combine the character variable "product_ID" which cannot use proc sql (I thought). Also, I have many numeric variables actually; is there any way that I don't need to list the variable names one by one? Many thanks for your help in advance!
data Have;
input
ID $ visit_date :mmddyy8. product_ID $ Plan_A Plan_B Plan_C Cost;
format visit_date mmddyy8.;
datalines;
AB 10/01/98 ABCD 50 10 20 30
AB 10/01/98 EFGH 50 10 20 30
AB 10/02/98 ABCD 50 10 20 30
AC 10/01/98 ABCD 50 10 20 30
AC 10/01/98 EFGH 50 10 20 30
AD 10/01/98 ABCD 50 10 20 30
AD 10/02/98 EFGH 50 10 20 30
AD 10/03/98 HIJK 50 10 20 30
;
run;
data Want;
input
ID $ visit_date :mmddyy8. product_ID $ Plan_A Plan_B Plan_C Cost;
format visit_date mmddyy8.;
datalines;
AB 10/01/98 ABCD&EFGH 100 20 40 60
AB 10/02/98 ABCD 50 10 20 30
AC 10/01/98 ABCD&EFGH 100 20 40 60
AD 10/01/98 ABCD 50 10 20 30
AD 10/02/98 EFGH 50 10 20 30
AD 10/03/98 HIJK 50 10 20 30
;
run;
The basic data step for summarizing looks like this:
/* set up the data step, with new and incoming dataset, and defining groups */
data want;
set have;
by id visit_date;
/* define new variables */
length
all_id $100
sum_plan_a 8
sum_plan_b 8
sum_plan_c 8
sum_cost 8
;
/* add a FORMAT statement if formats are needed */
/* RETAIN them */
retain
all_id
sum_plan_a
sum_plan_b
sum_plan_c
sum_cost
;
/* Initialize at beginning of group */
if first.visit_date
then do;
all_id = product_id;
sum_plan_a = plan_a;
sum_plan_b = plan_b;
sum_plan_c = plan_c;
sum_cost = cost;
end;
/* otherwise summarize */
else do;
all_id = catx('&',all_id,product_id);
sum_plan_a + plan_a;
sum_plan_b + plan_b;
sum_plan_c + plan_c;
sum_cost + cost;
end;
/* output at end of group only with a Subsetting IF */
if last.visit_date;
/* KEEP only variables of interest */
keep
id
visit_date
all_id
sum_plan_a
sum_plan_b
sum_plan_c
sum_cost
;
run;
This is the code that the data step compiler needs to see; you now need to decide the more efficient approach: write it all out manually (the above code needed about 5 minutes to write), or expend the effort needed to automate it. Which only makes sense if this is to be used repeatedly on datasets with changing structure.
Automation needs a clear definition of variables to be automatically summarized, so you can pull their names from DICTIONARY.COLUMNS or SASHELP.VCOLUMN and use that to create code dynamically. This definition can be based on name patterns, types, formats, you name it.
You also need to clearly define what kind of summarization you want (just sums, or min/max/averages, or some kind of flags) for each variable.
You can also do a multiple-step approach by using PROC SUMMARY for the numeric variables:
data want1;
set have;
by id visit_date;
length
all_id $100
;
length
all_id $100
;
retain
all_id
;
if first.visit_date
then do;
all_id = product_id;
end;
else do;
all_id = catx('&',all_id,product_id);
end;
if last.visit_date;
keep
id
visit_date
all_id
;
run;
proc summary data=have;
by id visit_date;
var plan_a--cost;
output out=want2 (drop=_type_ _freq_) sum()=;
run;
data want;
merge
want1
want2
;
by id visit_date;
run;
Note that I could not use the keyword _NUMERIC_ in the VAR statement because of the presence of the numeric date.
Concatenating character values over a group is done in a DATA step with a RETAINed variable.
data want;
set have;
by id visit_date;
retain all_id;
length all_id $100;
if first.visit_date
then all_id = product_id;
else all_id = catx('&',all_id,product_id);
/* add similar code for summarizing the numeric variables */
if last.visit_date;
run;
Thanks alot Kurt! It works very good for the first part of character variables. But for numeric variables, sorry as I am not very familiar with the summing in data-set (only know bit about proc sql). Is there any way I can summarize the variable by visit_date e.g. plan_A--Cost without listing each variable name one by one? Thanks so much!
The basic data step for summarizing looks like this:
/* set up the data step, with new and incoming dataset, and defining groups */
data want;
set have;
by id visit_date;
/* define new variables */
length
all_id $100
sum_plan_a 8
sum_plan_b 8
sum_plan_c 8
sum_cost 8
;
/* add a FORMAT statement if formats are needed */
/* RETAIN them */
retain
all_id
sum_plan_a
sum_plan_b
sum_plan_c
sum_cost
;
/* Initialize at beginning of group */
if first.visit_date
then do;
all_id = product_id;
sum_plan_a = plan_a;
sum_plan_b = plan_b;
sum_plan_c = plan_c;
sum_cost = cost;
end;
/* otherwise summarize */
else do;
all_id = catx('&',all_id,product_id);
sum_plan_a + plan_a;
sum_plan_b + plan_b;
sum_plan_c + plan_c;
sum_cost + cost;
end;
/* output at end of group only with a Subsetting IF */
if last.visit_date;
/* KEEP only variables of interest */
keep
id
visit_date
all_id
sum_plan_a
sum_plan_b
sum_plan_c
sum_cost
;
run;
This is the code that the data step compiler needs to see; you now need to decide the more efficient approach: write it all out manually (the above code needed about 5 minutes to write), or expend the effort needed to automate it. Which only makes sense if this is to be used repeatedly on datasets with changing structure.
Automation needs a clear definition of variables to be automatically summarized, so you can pull their names from DICTIONARY.COLUMNS or SASHELP.VCOLUMN and use that to create code dynamically. This definition can be based on name patterns, types, formats, you name it.
You also need to clearly define what kind of summarization you want (just sums, or min/max/averages, or some kind of flags) for each variable.
You can also do a multiple-step approach by using PROC SUMMARY for the numeric variables:
data want1;
set have;
by id visit_date;
length
all_id $100
;
length
all_id $100
;
retain
all_id
;
if first.visit_date
then do;
all_id = product_id;
end;
else do;
all_id = catx('&',all_id,product_id);
end;
if last.visit_date;
keep
id
visit_date
all_id
;
run;
proc summary data=have;
by id visit_date;
var plan_a--cost;
output out=want2 (drop=_type_ _freq_) sum()=;
run;
data want;
merge
want1
want2
;
by id visit_date;
run;
Note that I could not use the keyword _NUMERIC_ in the VAR statement because of the presence of the numeric date.
data Have;
input
ID $ visit_date :mmddyy8. product_ID $ Plan_A Plan_B Plan_C Cost;
format visit_date mmddyy8.;
datalines;
AB 10/01/98 ABCD 50 10 20 30
AB 10/01/98 EFGH 50 10 20 30
AB 10/02/98 ABCD 50 10 20 30
AC 10/01/98 ABCD 50 10 20 30
AC 10/01/98 EFGH 50 10 20 30
AD 10/01/98 ABCD 50 10 20 30
AD 10/02/98 EFGH 50 10 20 30
AD 10/03/98 HIJK 50 10 20 30
;
proc summary data=have nway;
class id visit_date;
var _numeric_;
output out=part1 sum=;
run;
data part2;
do until(last.visit_date);
set have;
by id visit_date;
length products $ 200;
products=catx('&',products,product_ID);
end;
keep id visit_date products;
run;
data want;
merge part2 part1;
by id visit_date;
drop _type_ _freq_;
run;
Thank you so much Kurt and Ksharp!! Both solutions worked so well and I have used them for my real data. I actually spent 2 days but without proper training, I really couldn't figure that out. Million thanks for the guidance and illustration from Kurt and Ksharp! Have a good day!
Use the FINDW() function to tell whether the PRODUCT_ID is duplicated.
data Have;
input ID $ visit_date :mmddyy8. product_ID $ Plan_A Plan_B Plan_C Cost;
format visit_date yymmdd10.;
datalines;
AB 10/01/98 ABCD 50 10 20 30
AB 10/01/98 EFGH 50 10 20 30
AB 10/02/98 ABCD 50 10 20 30
AC 10/01/98 ABCD 50 10 20 30
AC 10/01/98 EFGH 50 10 20 30
AD 10/01/98 ABCD 50 10 20 30
AD 10/02/98 EFGH 50 10 20 30
AD 10/03/98 HIJK 50 10 20 30
;
data want;
if 0 then set have;
length product_list $200;
array raw plan_a plan_b plan_c cost;
array total sum_a sum_b sum_c sum_cost ;
do until (last.visit_date);
set have;
by id visit_date ;
if not findw(product_list,product_id,',','it') then product_list=catx(',',product_list,product_id);
do index=1 to dim(raw);
total[index]=sum(total[index],raw[index]);
end;
end;
drop index product_id plan_a plan_b plan_c cost;
rename
product_list=product_ID
sum_a=Plan_A
sum_b=Plan_B
sum_c=Plan_C
sum_cost=Cost
;
run;
Results
product_ Obs ID visit_date ID Plan_A Plan_B Plan_C Cost 1 AB 1998-10-01 ABCD,EFGH 100 20 40 60 2 AB 1998-10-02 ABCD 50 10 20 30 3 AC 1998-10-01 ABCD,EFGH 100 20 40 60 4 AD 1998-10-01 ABCD 50 10 20 30 5 AD 1998-10-02 EFGH 50 10 20 30 6 AD 1998-10-03 HIJK 50 10 20 30
PS Do NOT use only two years for dates. Also avoid using MDY or DMY order to display dates as either choice will confuse half of your audience.
Do you really want to write your own summary using data step?
Why not let PROC SUMMARY do it?
proc summary data=have nway;
class id visit_date ;
var _numeric_;
output out=want(drop=_type_ ) sum=
idgroup (out[5] (product_id)=)
;
run;
If you really want the PRODUCT_ID stuffed into a single variable then add a post-processing step.
data want ;
set want ;
length product_ID $200;
product_id=catx(',',of product_id_:);
drop product_id_: ;
run;
Results
product_ Obs ID visit_date _FREQ_ Plan_A Plan_B Plan_C Cost ID 1 AB 1998-10-01 2 100 20 40 60 ABCD,EFGH 2 AB 1998-10-02 1 50 10 20 30 ABCD 3 AC 1998-10-01 2 100 20 40 60 ABCD,EFGH 4 AD 1998-10-01 1 50 10 20 30 ABCD 5 AD 1998-10-02 1 50 10 20 30 EFGH 6 AD 1998-10-03 1 50 10 20 30 HIJK
Thanks alot Tom! I agree; like you, Kurt and Ksharp mentioned, proc summary is definitely a much better way as I only knew proc sql before. Also, thanks for pointing out about the date format and will sure note to use "yymmdd10." in the future. Great thanks Tom!
Proc IML solution.
@Rick_SAS how can I concatenate more efficiently a char vector like a={"abcd", "xyz"} into a single string with delimiter like "|" between them? So I want to get "abcd|xyz".
proc iml;
use have;
read all var _CHAR_ into y[colname=varny1];
read all var _num_ into x[colname=varny2];
close;
_date_id=catx("_", putn(x[,{"visit_date"}], 'ddmmyy10.'), y[,{"id"}]);
u=unique(_date_id);
res=j(ncol(u), 5,.);
_prod=j(ncol(u),2," ");
do i=1 to ncol(u);
idx=loc(element(_date_id, u[i]));
res[i, ]= x[idx[1], 1] || x[idx, 2:5] [+,];
_prod[i,1]=u[i];
temp=y[idx, {"product_id"}]`;
if ncol(temp)=1 then _prod[i,2]=temp[1];
if ncol(temp) > 1 then do;
do n=1 to ncol(temp);
_prod[i,2]=strip(_prod[i,2]) + "&" + temp[n];
end;
end;
end;
create _x from res [colname=varny2];
append from res;
close;
create _y from _prod[colname={"date+id", "product_combi"}];
append from _prod;
close;
data want;
merge _x _y;
format visit_date date9.;
run;
Hi @acordes The easiest way to see if I have solved a problem is to use the SITE: in your favorite search engine. For example, type the following into a search engine:
concatenate string delimiter site:blogs.sas.com/content/iml
You will find "Tips for concatenating strings in SAS/IML," which describes this problem and provides a solution.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.