Solved: Restructure Dataset

sasmaverick · Posted 01-04-2018 09:57 AM

I have the below dataset at a customer level-

data have;
infile datalines;
input CUSTOMER_ID : $1. Channel_R : $15. Channel_O : $15. Channel_W : $15. Brand_Brand1 : $15. Brand_Brand2 : $15.;
datalines;
1 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
2 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
3 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
4 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
5 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
6 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
7 $25.01-$50 $0.01-$25 $100.01-$150 $150.01-$200 $200+
8 <$0 $25.01-$50 $25.01-$50 $100.01-$150 $50.01-$75
9 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
10 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
;
run;

I need the output as follows-

data want;
infile datalines;
input Amount : $15. Customer_Count : $5. Brand : $15. Channel : $10.;
datalines;
Missing 100 Brand1 W
<$0 200 Brand1 W
$.01-$25 300 Brand1 W
$25.01-$50 400 Brand1 W
$50.01-$75 500 Brand1 W
$75.01-$100 600 Brand1 W
$100.01-$150 700 Brand1 W
$150.01-$200 800 Brand1 W
$200+ 900 Brand1 W
Missing 100 Brand2 R
<$0 200 Brand2 R
$0.01-$25 300 Brand2 R
$25.01-$50 400 Brand2 R
$50.01-$75 500 Brand2 R
$75.01-$100 600 Brand2 R
$100.01-$150 700 Brand2 R
$150.01-$200 800 Brand2 R
$200+ 900 Brand2 R
;
run;

I tried using PROC Transpose but since the dataset is very large, it takes a huge amount of time..

RW9 · Posted 01-04-2018 10:38 AM

Sorry, only meant the Have dataset as a datastep. Anyways, this goes someway to what you want, not sure how you get the customer_count, as it doesn't match the test data you provide, maybe clarify.

data have;
  infile datalines;
  input CUSTOMER_ID : $1. Channel_R : $15. Channel_O : $15. Channel_W : $15. Brand_Brand1 : $15. Brand_Brand2 : $15.;
datalines; 
1 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
2 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
3 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
4 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
5 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
6 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
7 $25.01-$50 $0.01-$25 $100.01-$150 $150.01-$200 $200+
8 <$0 $25.01-$50 $25.01-$50 $100.01-$150 $50.01-$75
9 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
10 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
;
run;

data inter (keep=customer_id amount channel brand);
  set have;
  length amount channel brand $200;
  array channels{3} channel_r--channel_w;
  array brands{2} brand_brand1 brand_brand2;
  do j=1 to 3;
    amount=channels{j};
    channel=vname(channels{j});
    do i=1 to 2;
      brand=vname(brands{i});
      output;
    end;
  end;
run;

proc sql;
  create table WANT as 
  select  AMOUNT,
          count(distinct CUSTOMER_ID) as CUSTOMER_COUNT,
          CHANNEL,
          BRAND
  from    INTER
  group by AMOUNT,
           CHANNEL,
           BRAND;
quit;

View solution in original post

RW9 · Posted 01-04-2018 10:01 AM

Post test data in the form of a datastep!!

As such I am not typing that in. If you have a large amount of data, it will take a large amount of time, irrespective of what code you use. Although from your output want data I don't see how transpose could get your there unless you have multiple procs. A datastep, which loops of an array of the variables and outputs a row for each item, then left join a count onto that, may be a bit quicker.

sasmaverick · Posted 01-04-2018 10:25 AM

@RW9: Sorry. Below my restructured question. It's a bit challenging for me hence I was looking for a proper code:

data have;
infile datalines;
input CUSTOMER_ID : $1. Channel_R : $15. Channel_O : $15. Channel_W : $15. Brand_Brand1 : $15. Brand_Brand2 : $15.;
datalines;
1 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
2 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
3 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
4 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
5 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
6 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
7 $25.01-$50 $0.01-$25 $100.01-$150 $150.01-$200 $200+
8 <$0 $25.01-$50 $25.01-$50 $100.01-$150 $50.01-$75
9 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
10 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
;
run;

data want;
infile datalines;
input Amount : $15. Customer_Count : $5. Brand : $15. Channel : $10.;
datalines;
Missing 100 Brand1 W
<$0 200 Brand1 W
$.01-$25 300 Brand1 W
$25.01-$50 400 Brand1 W
$50.01-$75 500 Brand1 W
$75.01-$100 600 Brand1 W
$100.01-$150 700 Brand1 W
$150.01-$200 800 Brand1 W
$200+ 900 Brand1 W
Missing 100 Brand2 R
<$0 200 Brand2 R
$0.01-$25 300 Brand2 R
$25.01-$50 400 Brand2 R
$50.01-$75 500 Brand2 R
$75.01-$100 600 Brand2 R
$100.01-$150 700 Brand2 R
$150.01-$200 800 Brand2 R
$200+ 900 Brand2 R
;
run;

RW9 · Posted 01-04-2018 10:38 AM

Sorry, only meant the Have dataset as a datastep. Anyways, this goes someway to what you want, not sure how you get the customer_count, as it doesn't match the test data you provide, maybe clarify.

data have;
  infile datalines;
  input CUSTOMER_ID : $1. Channel_R : $15. Channel_O : $15. Channel_W : $15. Brand_Brand1 : $15. Brand_Brand2 : $15.;
datalines; 
1 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
2 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
3 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
4 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
5 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
6 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
7 $25.01-$50 $0.01-$25 $100.01-$150 $150.01-$200 $200+
8 <$0 $25.01-$50 $25.01-$50 $100.01-$150 $50.01-$75
9 Missing $0.01-$25 $100.01-$150 $150.01-$200 $200+
10 <$0 $25.01-$50 Missing $100.01-$150 $50.01-$75
;
run;

data inter (keep=customer_id amount channel brand);
  set have;
  length amount channel brand $200;
  array channels{3} channel_r--channel_w;
  array brands{2} brand_brand1 brand_brand2;
  do j=1 to 3;
    amount=channels{j};
    channel=vname(channels{j});
    do i=1 to 2;
      brand=vname(brands{i});
      output;
    end;
  end;
run;

proc sql;
  create table WANT as 
  select  AMOUNT,
          count(distinct CUSTOMER_ID) as CUSTOMER_COUNT,
          CHANNEL,
          BRAND
  from    INTER
  group by AMOUNT,
           CHANNEL,
           BRAND;
quit;

sasmaverick · Posted 01-04-2018 11:10 AM

Hi @RW9 Thanks for the effort. I think this works. The customer count in output data shown was just for reference and not exact numbers that we would get. What I am trying to achieve is to get the count of customers unique across an amount-channel-brand group. and I think your solutions addresses that.

Regards,

ballardw · Posted 01-04-2018 10:34 AM

Please describe what (and how as a rule) to get the "amount" for:

Channel_R and Brand_Brand1

Channel_R and Brand_Brand2

Channel_O and Brand_Brand1

Channel_O and Brand_Brand2

Channel_W and Brand_Brand1

Channel_W and Brand_Brand2

from the data in your first row.

I do NOT see any obvious way to accomplish that.

I thought it might be possibly to generate a count data set first to count customers and transpose that set but the issue is the intersection of "channel" and "brand' to generate a single amount is not intuitively obvious to me.

Also, are the values shown actually text or the display of a numeric value using a format?

Restructure Dataset

Re: Restructure Dataset

Re: Restructure Dataset

Re: Restructure Dataset

Re: Restructure Dataset

Re: Restructure Dataset

Re: Restructure Dataset

SAS Innovate 2026 Registration is Open

SAS Training: Just a Click Away