DATA Step, Macro, Functions and more

Restructuring a Data Set for Cross Tabulation

Reply
Frequent Contributor
Posts: 101

Restructuring a Data Set for Cross Tabulation

I need to restructure a data set in order to produce a cross tab output/data set. . The question I'm trying to answer is  what additional products were purchased after initial sign up purchases and what are the product combinations -  So, if  a customer bought prod3 at signup they also eventually bought prod2 and prod4 .

The data set I start with looks like this. Notice some of the product purchases dates are the same as the startup date. 

ID Signup_dtProd_typePurchase_Date
22324-Jun-14prod14-Jun-14
22324-Jun-14prod24-Jun-14
22324-Jun-14prod34-Jun-14
22324-Jun-14prod4
22324-Jun-14prod54-Aug-14
454512-Jun-14prod1
454512-Jun-14prod213-Sep-14
454512-Jun-14prod312-Jun-14
454512-Jun-14prod412-Nov-14
454512-Jun-14prod512-Jun-14

I need to restructure the data set so it looks like this:

ID startup_monthPoductsPurchase_at_Start_Upprod1prod2prod3prod4prod5
2232Juneprod111
2232Juneprod211
2232Juneprod311
2232Juneprod4
2232Juneprod5
4545Juneprod1
4545Juneprod2
4545Juneprod3111
4545Juneprod4
4545Juneprod5111

And then a summary data set which gives the totals/counts  of the product combinations

Startup_monthPoductsPurchase_Start_up_countprod1_dtprod2_dtprod3_dtprod4_dtprod5_dt
Juneprod111
Juneprod211
Juneprod3212
Juneprod4
Juneprod5111

I'm having trouble visualizing how to program for this. I'm been using different combinations of proc transpose and proc freq but  I can't come close to the desired output.  Any help is greatly appreciated.

Super User
Posts: 5,518

Re: Restructuring a Data Set for Cross Tabulation

Posted in reply to RobertNYC

Since you mention that you already tried PROC TRANSPOSE, I'll give you the steps without a tremendous amount of detail ...

1. Remove any records with a missing purchase date.

2. For each customer, find the earliest date.

3. Create two SAS data sets.  One holds all purchases on the earliest date (and includes the calculated variable Startup_month).  The other holds all later purchases.

4. Transpose PROD_TYPE in each data set, using different prefixes (such as "first_" and "later_").

5. Merge by customer.

6. Take a look at that point ... and make decisions about how to handle multiple purchases of the same product.  Array processing should be fairly straightforward but will have to pay attention to missing values caused by transposing.

Good luck.

Frequent Contributor
Posts: 101

Re: Restructuring a Data Set for Cross Tabulation

Posted in reply to Astounding

Thank you astounding. I'm a little confused as to what to transpose. Do I transpose both the first and later data sets or just the later data set?? Below is what I have so far -  any additional feedback will be greatly appreciated.  Thanks!!!

/* Remove any records with a missing purchase date.*/

/* Identify earliest purchase date */

/*The earliest purchase date is the same as the  signup date */

proc sql;

create table have1 as

select

ID,

Signup_dt,

Prod_type,

Purchase_Date

case when Signup_dt=Purchase_Date

then 1 else 0 end as Frist

from have where Purchase_Date ne . ;

quit;

/*create data set with just earliest Purchase_Date .*/

proc sql;

create table earliest as

select

ID,

month(Signup_dt) as Startup_month,

Prod_type,

from have2

where Frist=1;

quit;

/*create data set with  all later purchases .*/

proc sql;

create table later  as

select

ID,

Prod_type,

Frist

from have2

where Frist=0;

quit;

proc sort data=later;

by Prod_type ID ;

run;

proc transpose data=later out=later2;

by Product_type ;

run;

Super User
Posts: 5,518

Re: Restructuring a Data Set for Cross Tabulation

Posted in reply to RobertNYC

Robert,

Both data sets get transposed:

BY ID;

VAR PROD_TYPE;

The "first" data set should keep the month around, probably adding an ID statement to do that.

Both transposed data sets should have one observation per ID.

When transposing, you will need to add the PREFIX= option so the new variables in the transposed "first" data set are named something like "First_1", "First_2", "First_3", etc.  And the new variables in the transposed "later" data set should be named "Later_1", "Later_2", "Later_3", etc.

It will always help to test a PROC TRANSPOSE and examine the results, then tweak the program a little (such as adding a PREFIX= option) and repeat.


Frequent Contributor
Posts: 101

Re: Restructuring a Data Set for Cross Tabulation

Posted in reply to Astounding

Ok... I'm getting there...I'm not sure what to do after the merge??

proc sort data=earliest;

BY  ID Startup_month;

run;

proc transpose data=earliest out=first PREFIX=first;

BY ID Startup_month;

VAR Product_type

;run;

proc sort data=later;

BY  ID ;

run;

proc transpose data=later out=later2 PREFIX=later;

BY ID;

VAR Product_type

;run;

Proc sort data= later2;

by id;

Proc sort data= earliest;

by id; run;

data both;

   merge first later2;

   by id;

run;

Super User
Posts: 5,518

Re: Restructuring a Data Set for Cross Tabulation

Posted in reply to RobertNYC

Definitely on its way ...

Look at the transposed data sets.  You will see the maximum number of "first" products you have for any ID, and the maximum number of "later" products for any ID.  Let's say it's 5 and 9.  You will need to add to your final DATA step after the MERGE statement.  Use arrays for find all the pairs.

array firsts {5} first1-first5;

array laters {9} later1-later5;

array flags {9} flag1-flag9;

do i=1 to 5;

   if firsts{i} > ' '  then do j=1 to 9;

        if later{j} = 'prod1' then flag1=1;

        else if later{j} = 'prod2' then flag2=1;

        ...

        else if later{j} = 'prod9' then flag9=1;

        if j=9 then output;

  end;

end;

This might give you the middle result you need, or it might just be close ... you will have to take a look and decide.

Good luck.

Frequent Contributor
Posts: 101

Re: Restructuring a Data Set for Cross Tabulation

Posted in reply to Astounding

Thanks again astounding. I used your array example and yes I'm getting close to the output .  Also, I tried to program it a different way as well (below)  and I think  I'm getting the correct output.

For your example, I still need to condense all the "first" variables into one and transform  the "later" variables -  I could not figure out how to do that.

data want;

set have;

array first {11} first1-first11;

array later {11} later1-later11;

array flags {16} flag1-flag16;

do i=1 to 11;

if first{i} > ' '  then do j=1 to 11;

if later{j} = 'Prod1' then flag1=1;

        else if later{j} = 'Prod2' then flag2=1;

        else if later{j} = 'Prod3' then flag3=1;

        else if later{j} = 'Prod4' then flag4=1;

        else if later{j} = 'Prod5' then flag5=1;

        else if later{j} = 'Prod6' then flag6=1;

        else if later{j} = 'prod7' then flag7=1;

        else if later{j} = 'prod8' then flag8=1;

  else if later{j} = 'prod9' then flag9=1;

        else if later{j} = 'prod10' then flag10=1;

  else if later{j} = 'prod11' then flag11=1;

        else if later{j} = 'prod12' then flag12=1;

  else if later{j} = 'prod13' then flag12=1;

        else if later{j} = 'prod14' then flag13=1;

     else if later{j} = 'prod15' then flag14=1;

        else if later{j} = 'prod16' then flag15=1;

  else if later{j} = 'prod17' then flag14=1;

        else if later{j} = 'prod18' then flag15=1;

  else if later{j} = 'prod19' then flag16=1;

        output;

  end;run;

I seem to be getting the output I'm looking for (before summarizing the results)  just by using the program below

/* Create month variable and check if purchased at signup */

data want1;

    set have;

    startup_month = strip(put(Signup_dt, monname9.));

    if Signup_dt = Purchase_Date then Purchase_at_Start_Up = 1;

    drop Purchase_Date Signup_dt;

run;

/* Create dummy flag for transpose*/

data have1;

    set have;

    if Purchase_Date then flag = 1;

run;

/* Transpose to get product variables by ID */

proc transpose data = have1 out = want2 (drop =  _NAME_); 

    by ID;

    id Prod_type;

    var flag;

run;

/* Combine the two */

data want;

    merge want1 want2;

    by ID;

run;

proc sort data=want;

by id;

run;

Frequent Contributor
Posts: 101

Re: Restructuring a Data Set for Cross Tabulation

Posted in reply to RobertNYC

Well, actually. When I sum my results I don't get the totals of the combinations of products just  totals by month for each product. So, I've got to keep on with this.

Ask a Question
Discussion stats
  • 7 replies
  • 349 views
  • 3 likes
  • 2 in conversation