I need to restructure a data set in order to produce a cross tab output/data set. . The question I'm trying to answer is what additional products were purchased after initial sign up purchases and what are the product combinations - So, if a customer bought prod3 at signup they also eventually bought prod2 and prod4 .
The data set I start with looks like this. Notice some of the product purchases dates are the same as the startup date.
ID | Signup_dt | Prod_type | Purchase_Date |
2232 | 4-Jun-14 | prod1 | 4-Jun-14 |
2232 | 4-Jun-14 | prod2 | 4-Jun-14 |
2232 | 4-Jun-14 | prod3 | 4-Jun-14 |
2232 | 4-Jun-14 | prod4 | |
2232 | 4-Jun-14 | prod5 | 4-Aug-14 |
4545 | 12-Jun-14 | prod1 | |
4545 | 12-Jun-14 | prod2 | 13-Sep-14 |
4545 | 12-Jun-14 | prod3 | 12-Jun-14 |
4545 | 12-Jun-14 | prod4 | 12-Nov-14 |
4545 | 12-Jun-14 | prod5 | 12-Jun-14 |
I need to restructure the data set so it looks like this:
ID | startup_month | Poducts | Purchase_at_Start_Up | prod1 | prod2 | prod3 | prod4 | prod5 |
2232 | June | prod1 | 1 | 1 | ||||
2232 | June | prod2 | 1 | 1 | ||||
2232 | June | prod3 | 1 | 1 | ||||
2232 | June | prod4 | ||||||
2232 | June | prod5 | ||||||
4545 | June | prod1 | ||||||
4545 | June | prod2 | ||||||
4545 | June | prod3 | 1 | 1 | 1 | |||
4545 | June | prod4 | ||||||
4545 | June | prod5 | 1 | 1 | 1 |
And then a summary data set which gives the totals/counts of the product combinations
Startup_month | Poducts | Purchase_Start_up_count | prod1_dt | prod2_dt | prod3_dt | prod4_dt | prod5_dt |
June | prod1 | 1 | 1 | ||||
June | prod2 | 1 | 1 | ||||
June | prod3 | 2 | 1 | 2 | |||
June | prod4 | ||||||
June | prod5 | 1 | 1 | 1 |
I'm having trouble visualizing how to program for this. I'm been using different combinations of proc transpose and proc freq but I can't come close to the desired output. Any help is greatly appreciated.
Since you mention that you already tried PROC TRANSPOSE, I'll give you the steps without a tremendous amount of detail ...
1. Remove any records with a missing purchase date.
2. For each customer, find the earliest date.
3. Create two SAS data sets. One holds all purchases on the earliest date (and includes the calculated variable Startup_month). The other holds all later purchases.
4. Transpose PROD_TYPE in each data set, using different prefixes (such as "first_" and "later_").
5. Merge by customer.
6. Take a look at that point ... and make decisions about how to handle multiple purchases of the same product. Array processing should be fairly straightforward but will have to pay attention to missing values caused by transposing.
Good luck.
Thank you astounding. I'm a little confused as to what to transpose. Do I transpose both the first and later data sets or just the later data set?? Below is what I have so far - any additional feedback will be greatly appreciated. Thanks!!!
/* Remove any records with a missing purchase date.*/
/* Identify earliest purchase date */
/*The earliest purchase date is the same as the signup date */
proc sql;
create table have1 as
select
ID,
Signup_dt,
Prod_type,
Purchase_Date
case when Signup_dt=Purchase_Date
then 1 else 0 end as Frist
from have where Purchase_Date ne . ;
quit;
/*create data set with just earliest Purchase_Date .*/
proc sql;
create table earliest as
select
ID,
month(Signup_dt) as Startup_month,
Prod_type,
from have2
where Frist=1;
quit;
/*create data set with all later purchases .*/
proc sql;
create table later as
select
ID,
Prod_type,
Frist
from have2
where Frist=0;
quit;
proc sort data=later;
by Prod_type ID ;
run;
proc transpose data=later out=later2;
by Product_type ;
run;
Robert,
Both data sets get transposed:
BY ID;
VAR PROD_TYPE;
The "first" data set should keep the month around, probably adding an ID statement to do that.
Both transposed data sets should have one observation per ID.
When transposing, you will need to add the PREFIX= option so the new variables in the transposed "first" data set are named something like "First_1", "First_2", "First_3", etc. And the new variables in the transposed "later" data set should be named "Later_1", "Later_2", "Later_3", etc.
It will always help to test a PROC TRANSPOSE and examine the results, then tweak the program a little (such as adding a PREFIX= option) and repeat.
Ok... I'm getting there...I'm not sure what to do after the merge??
proc sort data=earliest;
BY ID Startup_month;
run;
proc transpose data=earliest out=first PREFIX=first;
BY ID Startup_month;
VAR Product_type
;run;
proc sort data=later;
BY ID ;
run;
proc transpose data=later out=later2 PREFIX=later;
BY ID;
VAR Product_type
;run;
Proc sort data= later2;
by id;
Proc sort data= earliest;
by id; run;
data both;
merge first later2;
by id;
run;
Definitely on its way ...
Look at the transposed data sets. You will see the maximum number of "first" products you have for any ID, and the maximum number of "later" products for any ID. Let's say it's 5 and 9. You will need to add to your final DATA step after the MERGE statement. Use arrays for find all the pairs.
array firsts {5} first1-first5;
array laters {9} later1-later5;
array flags {9} flag1-flag9;
do i=1 to 5;
if firsts{i} > ' ' then do j=1 to 9;
if later{j} = 'prod1' then flag1=1;
else if later{j} = 'prod2' then flag2=1;
...
else if later{j} = 'prod9' then flag9=1;
if j=9 then output;
end;
end;
This might give you the middle result you need, or it might just be close ... you will have to take a look and decide.
Good luck.
Thanks again astounding. I used your array example and yes I'm getting close to the output . Also, I tried to program it a different way as well (below) and I think I'm getting the correct output.
For your example, I still need to condense all the "first" variables into one and transform the "later" variables - I could not figure out how to do that.
data want;
set have;
array first {11} first1-first11;
array later {11} later1-later11;
array flags {16} flag1-flag16;
do i=1 to 11;
if first{i} > ' ' then do j=1 to 11;
if later{j} = 'Prod1' then flag1=1;
else if later{j} = 'Prod2' then flag2=1;
else if later{j} = 'Prod3' then flag3=1;
else if later{j} = 'Prod4' then flag4=1;
else if later{j} = 'Prod5' then flag5=1;
else if later{j} = 'Prod6' then flag6=1;
else if later{j} = 'prod7' then flag7=1;
else if later{j} = 'prod8' then flag8=1;
else if later{j} = 'prod9' then flag9=1;
else if later{j} = 'prod10' then flag10=1;
else if later{j} = 'prod11' then flag11=1;
else if later{j} = 'prod12' then flag12=1;
else if later{j} = 'prod13' then flag12=1;
else if later{j} = 'prod14' then flag13=1;
else if later{j} = 'prod15' then flag14=1;
else if later{j} = 'prod16' then flag15=1;
else if later{j} = 'prod17' then flag14=1;
else if later{j} = 'prod18' then flag15=1;
else if later{j} = 'prod19' then flag16=1;
output;
end;run;
I seem to be getting the output I'm looking for (before summarizing the results) just by using the program below
/* Create month variable and check if purchased at signup */
data want1;
set have;
startup_month = strip(put(Signup_dt, monname9.));
if Signup_dt = Purchase_Date then Purchase_at_Start_Up = 1;
drop Purchase_Date Signup_dt;
run;
/* Create dummy flag for transpose*/
data have1;
set have;
if Purchase_Date then flag = 1;
run;
/* Transpose to get product variables by ID */
proc transpose data = have1 out = want2 (drop = _NAME_);
by ID;
id Prod_type;
var flag;
run;
/* Combine the two */
data want;
merge want1 want2;
by ID;
run;
proc sort data=want;
by id;
run;
Well, actually. When I sum my results I don't get the totals of the combinations of products just totals by month for each product. So, I've got to keep on with this.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.