Re: Restructuring a Data Set for Cross Tabulation

RobertNYC · Posted 09-29-2014 02:15 PM

I need to restructure a data set in order to produce a cross tab output/data set. . The question I'm trying to answer is what additional products were purchased after initial sign up purchases and what are the product combinations - So, if a customer bought prod3 at signup they also eventually bought prod2 and prod4 .

The data set I start with looks like this. Notice some of the product purchases dates are the same as the startup date.

ID	Signup_dt	Prod_type	Purchase_Date
2232	4-Jun-14	prod1	4-Jun-14
2232	4-Jun-14	prod2	4-Jun-14
2232	4-Jun-14	prod3	4-Jun-14
2232	4-Jun-14	prod4
2232	4-Jun-14	prod5	4-Aug-14
4545	12-Jun-14	prod1
4545	12-Jun-14	prod2	13-Sep-14
4545	12-Jun-14	prod3	12-Jun-14
4545	12-Jun-14	prod4	12-Nov-14
4545	12-Jun-14	prod5	12-Jun-14

I need to restructure the data set so it looks like this:

ID	startup_month	Poducts	Purchase_at_Start_Up	prod1	prod2	prod3	prod4	prod5
2232	June	prod1	1					1
2232	June	prod2	1					1
2232	June	prod3	1					1
2232	June	prod4
2232	June	prod5
4545	June	prod1
4545	June	prod2
4545	June	prod3	1		1			1
4545	June	prod4
4545	June	prod5	1		1			1

And then a summary data set which gives the totals/counts of the product combinations

Startup_month	Poducts	Purchase_Start_up_count	prod1_dt	prod2_dt	prod3_dt	prod4_dt	prod5_dt
June	prod1	1					1
June	prod2	1					1
June	prod3	2		1			2
June	prod4
June	prod5	1		1			1

I'm having trouble visualizing how to program for this. I'm been using different combinations of proc transpose and proc freq but I can't come close to the desired output. Any help is greatly appreciated.

Astounding · Posted 09-29-2014 03:49 PM

Since you mention that you already tried PROC TRANSPOSE, I'll give you the steps without a tremendous amount of detail ...

1. Remove any records with a missing purchase date.

2. For each customer, find the earliest date.

3. Create two SAS data sets. One holds all purchases on the earliest date (and includes the calculated variable Startup_month). The other holds all later purchases.

4. Transpose PROD_TYPE in each data set, using different prefixes (such as "first_" and "later_").

5. Merge by customer.

6. Take a look at that point ... and make decisions about how to handle multiple purchases of the same product. Array processing should be fairly straightforward but will have to pay attention to missing values caused by transposing.

Good luck.

RobertNYC · Posted 09-29-2014 05:03 PM

Thank you astounding. I'm a little confused as to what to transpose. Do I transpose both the first and later data sets or just the later data set?? Below is what I have so far - any additional feedback will be greatly appreciated. Thanks!!!

/* Remove any records with a missing purchase date.*/

/* Identify earliest purchase date */

/*The earliest purchase date is the same as the signup date */

proc sql;

create table have1 as

select

ID,

Signup_dt,

Prod_type,

Purchase_Date

case when Signup_dt=Purchase_Date

then 1 else 0 end as Frist

from have where Purchase_Date ne . ;

quit;

/*create data set with just earliest Purchase_Date .*/

proc sql;

create table earliest as

select

ID,

month(Signup_dt) as Startup_month,

Prod_type,

from have2

where Frist=1;

quit;

/*create data set with all later purchases .*/

proc sql;

create table later as

select

ID,

Prod_type,

Frist

from have2

where Frist=0;

quit;

proc sort data=later;

by Prod_type ID ;

run;

proc transpose data=later out=later2;

by Product_type ;

run;

Astounding · Posted 09-29-2014 05:20 PM

Robert,

Both data sets get transposed:

BY ID;

VAR PROD_TYPE;

The "first" data set should keep the month around, probably adding an ID statement to do that.

Both transposed data sets should have one observation per ID.

When transposing, you will need to add the PREFIX= option so the new variables in the transposed "first" data set are named something like "First_1", "First_2", "First_3", etc. And the new variables in the transposed "later" data set should be named "Later_1", "Later_2", "Later_3", etc.

It will always help to test a PROC TRANSPOSE and examine the results, then tweak the program a little (such as adding a PREFIX= option) and repeat.

RobertNYC · Posted 09-29-2014 06:00 PM

Ok... I'm getting there...I'm not sure what to do after the merge??

proc sort data=earliest;

BY ID Startup_month;

run;

proc transpose data=earliest out=first PREFIX=first;

BY ID Startup_month;

VAR Product_type

;run;

proc sort data=later;

BY ID ;

run;

proc transpose data=later out=later2 PREFIX=later;

BY ID;

VAR Product_type

;run;

Proc sort data= later2;

by id;

Proc sort data= earliest;

by id; run;

data both;

merge first later2;

by id;

run;

Astounding · Posted 09-29-2014 09:37 PM

Definitely on its way ...

Look at the transposed data sets. You will see the maximum number of "first" products you have for any ID, and the maximum number of "later" products for any ID. Let's say it's 5 and 9. You will need to add to your final DATA step after the MERGE statement. Use arrays for find all the pairs.

array firsts {5} first1-first5;

array laters {9} later1-later5;

array flags {9} flag1-flag9;

do i=1 to 5;

if firsts{i} > ' ' then do j=1 to 9;

if later{j} = 'prod1' then flag1=1;

else if later{j} = 'prod2' then flag2=1;

...

else if later{j} = 'prod9' then flag9=1;

if j=9 then output;

end;

This might give you the middle result you need, or it might just be close ... you will have to take a look and decide.

Good luck.

RobertNYC · Posted 09-30-2014 10:54 AM

Thanks again astounding. I used your array example and yes I'm getting close to the output . Also, I tried to program it a different way as well (below) and I think I'm getting the correct output.

For your example, I still need to condense all the "first" variables into one and transform the "later" variables - I could not figure out how to do that.

data want;

set have;

array first {11} first1-first11;

array later {11} later1-later11;

array flags {16} flag1-flag16;

do i=1 to 11;

if first{i} > ' ' then do j=1 to 11;

if later{j} = 'Prod1' then flag1=1;

else if later{j} = 'Prod2' then flag2=1;

else if later{j} = 'Prod3' then flag3=1;

else if later{j} = 'Prod4' then flag4=1;

else if later{j} = 'Prod5' then flag5=1;

else if later{j} = 'Prod6' then flag6=1;

else if later{j} = 'prod7' then flag7=1;

else if later{j} = 'prod8' then flag8=1;

else if later{j} = 'prod9' then flag9=1;

else if later{j} = 'prod10' then flag10=1;

else if later{j} = 'prod11' then flag11=1;

else if later{j} = 'prod12' then flag12=1;

else if later{j} = 'prod13' then flag12=1;

else if later{j} = 'prod14' then flag13=1;

else if later{j} = 'prod15' then flag14=1;

else if later{j} = 'prod16' then flag15=1;

else if later{j} = 'prod17' then flag14=1;

else if later{j} = 'prod18' then flag15=1;

else if later{j} = 'prod19' then flag16=1;

output;

end;run;

I seem to be getting the output I'm looking for (before summarizing the results) just by using the program below

/* Create month variable and check if purchased at signup */

data want1;

set have;

startup_month = strip(put(Signup_dt, monname9.));

if Signup_dt = Purchase_Date then Purchase_at_Start_Up = 1;

drop Purchase_Date Signup_dt;

run;

/* Create dummy flag for transpose*/

data have1;

set have;

if Purchase_Date then flag = 1;

run;

/* Transpose to get product variables by ID */

proc transpose data = have1 out = want2 (drop = _NAME_);

by ID;

id Prod_type;

var flag;

run;

/* Combine the two */

data want;

merge want1 want2;

by ID;

run;

proc sort data=want;

by id;

run;

RobertNYC · Posted 09-30-2014 11:05 AM

Well, actually. When I sum my results I don't get the totals of the combinations of products just totals by month for each product. So, I've got to keep on with this.

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away