DATA Step, Macro, Functions and more

re: Data Selection

Accepted Solution Solved
Reply
Regular Contributor
Posts: 222
Accepted Solution

re: Data Selection

Hi....I am trying to determine whether or not a client is a new customer or repeat customer based on whether the customer had purchased at least one (1) product on their current purchsed from a previous purchase. The data set looks like this:

Client_ID Product_Number Date

37543 00653 20130403

37543 00345 20130403

37543 00765 20130403

37543 00653 20130812

37543 00345 20130812

37543 00765 20130812

37543 00831 20130812

37543 00765 20131222

37543 00949 20131222

And I was hoping to have dataset as:

Client_ID Date Type

37543 20130403 New

37543 20130812 Repeat

37543 20131222 Repeat

Any suggestions would bee greatly appreciated. Thanks


Accepted Solutions
Solution
‎08-17-2015 11:09 AM
Valued Guide
Posts: 765

Re: re: Data Selection

Hi.  Try this (there is probably a hash approach with a little less code, but I understand this).  I added a second client with slightly different purchases to show how the code behaves when you get a new client (I think it works) ...

data x;

input client product date : yymmdd.;

format date mmddyy10.;

datalines;

37543 00653 20130403

37543 00345 20130403

37543 00765 20130403

37543 00653 20130812

37543 00345 20130812

37543 00765 20130812

37543 00831 20130812

37543 00765 20131222

37543 00949 20131222

97543 00653 20130403

97543 00345 20130812

97543 00765 20130812

97543 00831 20130812

97543 00765 20131222

97543 00949 20131222

;

data y (keep=client date type);

length type $6;

* keep track of products bought within a client;

array bought(30) _temporary_ (30*0);

do until (last.client);

* assume a NEW type;

   type = 'new';

do until (last.date);

  set x;

  by client date;

* check if product within each date is in the array (yes/change type to repeat, no/add to array);

  if product in bought then type = 'repeat';

  else do; j+1; bought(j) = product; end;

end;

output;

end;

* clear array and counter for each new client;

do i = 1 to 30; bought(i) = 0; end;

j=0;

run;

DATA SET Y ...

Obs    type      client          date

1     new        37543    04/03/2013

2     repeat     37543    08/12/2013

3     repeat     37543    12/22/2013

4     new        97543    04/03/2013

5     new        97543    08/12/2013

6     repeat     97543    12/22/2013

View solution in original post


All Replies
Contributor ndp
Contributor
Posts: 61

Re: re: Data Selection

proc sort data=... nodupkey;

     by Client_ID Date;

run;

data ...;

     set ...;

     by Client_ID Date;

    length type $6;

     if first.client_id then type="New";

     else type="Repeat";

run;


Super User
Super User
Posts: 7,401

Re: re: Data Selection

You may want to just add an output so the final dataset only has one record per group:

data ...;

     set ...;

     by Client_ID Date;

    length type $6;

     if first.client_id then type="New";

     else type="Repeat";

     if last.client_id then output;

run;

Regular Contributor
Posts: 222

Re: re: Data Selection

Hi...Thanks for the quick response. I should have mentioned that it is possible that the client may not have purchased a product that they had previous purchased and in this case they would be considered "New" even though they made purchases before. That is why I need to check to see if the client had purchased any of the items in their previous purchases. This is the part that I am having difficulty with. Thanks

Contributor ndp
Contributor
Posts: 61

Re: re: Data Selection

I am not sure i follow your algorithm please provide appropriate example.

Solution
‎08-17-2015 11:09 AM
Valued Guide
Posts: 765

Re: re: Data Selection

Hi.  Try this (there is probably a hash approach with a little less code, but I understand this).  I added a second client with slightly different purchases to show how the code behaves when you get a new client (I think it works) ...

data x;

input client product date : yymmdd.;

format date mmddyy10.;

datalines;

37543 00653 20130403

37543 00345 20130403

37543 00765 20130403

37543 00653 20130812

37543 00345 20130812

37543 00765 20130812

37543 00831 20130812

37543 00765 20131222

37543 00949 20131222

97543 00653 20130403

97543 00345 20130812

97543 00765 20130812

97543 00831 20130812

97543 00765 20131222

97543 00949 20131222

;

data y (keep=client date type);

length type $6;

* keep track of products bought within a client;

array bought(30) _temporary_ (30*0);

do until (last.client);

* assume a NEW type;

   type = 'new';

do until (last.date);

  set x;

  by client date;

* check if product within each date is in the array (yes/change type to repeat, no/add to array);

  if product in bought then type = 'repeat';

  else do; j+1; bought(j) = product; end;

end;

output;

end;

* clear array and counter for each new client;

do i = 1 to 30; bought(i) = 0; end;

j=0;

run;

DATA SET Y ...

Obs    type      client          date

1     new        37543    04/03/2013

2     repeat     37543    08/12/2013

3     repeat     37543    12/22/2013

4     new        97543    04/03/2013

5     new        97543    08/12/2013

6     repeat     97543    12/22/2013

Regular Contributor
Posts: 222

Re: re: Data Selection

Thanks MikeZdeb....it works perfect...exactly what I needed....thanks to everyone who made suggestions as well.

Valued Guide
Posts: 765

Re: re: Data Selection

Hi ... another thought that eliminates the array (since you can concatenate numeric variables) ...

data y (keep=client date type);

length type $6 bought $100;

do until (last.client);

  type = 'new';

do until (last.date);

  set x;

  by client date;

* CAT(PRODUCT) gets rid of warning message in the LOG about data type conversion;

  if findw(bought,cat(product)) then type = 'repeat';

  else bought = catx('/',bought,product);

end;

output;

end;

call missing (bought);

run;

Regular Contributor
Posts: 222

Re: re: Data Selection

Hi MikeZdeb...thanks for taking the time to respond to my request. I am getting an eror message (

 

ERROR: Array subscript out of range at line 187 column 4.)....From the log, it seems there is a problem with " bought(j) = product" statement. Any suggestion why this is happening?

Valued Guide
Posts: 765

Re: re: Data Selection

Hi.   I wrote the code assuming 30 products.  Easy to change, find ...

array bought(30) _temporary_ (30*0);

change to ...

array bought(100) _temporary_ (30*0);

then change ...

do i = 1 to 30; bought(i) = 0; end;

to

do i = 1 to 100; bought(i) = 0; end;

If you get that message again, make the array (and loop) larger.

Regular Contributor
Posts: 222

Re: re: Data Selection

Hi MikeZdeb.....Thanks for your suggestions. I made the changes as you had suggested and your suggestion with the array and loop worked perfectly. The suggestion without the array after making the corrections as suggested, seemed to give slightly different results when I compared the results from both approaches.

Valued Guide
Posts: 765

Re: re: Data Selection

Hi.  Send me a bit of data that produced the different results and I'll figure it out (msz03@albany.edu).

Valued Guide
Posts: 765

Re: re: Data Selection

ps  If you use the other solution I posted that has no array, you could change this line ...

length type $6 bought $100;

to ...

length type $6 bought $1000;

to accommodate more product numbers in the variable BOUGHT.

Super User
Posts: 9,681

Re: re: Data Selection

Code: Program

data x;
input client product date : yymmdd.;
format date mmddyy10.;
datalines;
37543 00653 20130403
37543 00345 20130403
37543 00765 20130403
37543 00653 20130812
37543 00345 20130812
37543 00765 20130812
37543 00831 20130812
37543 00765 20131222
37543 00949 20131222
97543 00653 20130403
97543 00345 20130812
97543 00765 20130812
97543 00831 20130812
97543 00765 20131222
97543 00949 20131222
;
run;
data want;
if _n_ eq 1 then do;
if 0 then set x;
declare hash ha();
ha.definekey('product');
ha.definedone();
end;
set x;
by client date ;
retain type 'new ';
if first.client then ha.clear();
if first.date then type='new';
if ha.check()=0 then type='repeat';
ha.replace();
if last.date;
drop product ;
run;
Super User
Posts: 9,681

Re: re: Data Selection

A problem is if there are missing value in PRODUCT, what you gotta do ?

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 15 replies
  • 431 views
  • 6 likes
  • 5 in conversation