BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Haemoglobin17
Obsidian | Level 7

Hi everyone!

I have a table with multiple observations for customer, each with different date and product variables, like this:

 

Picture 1

A.PNG

 

I would like to group and concatenate Products if the date is within a range of 20 days from the previous date for the same customer, like this:

 

Picture2

B.PNG

 

For this end, I was thinking to create a second date variable which is equal but disarrayed of one row (like the picture below) and then to apply the intck function

 

Picture3

C.PNG

 

The problem is that I don't know how to arrive from picture 1 to picture 3. How would you solve this problem? Would you do differently? 

Thank you very much 😃

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

Hello @Haemoglobin17 

I said combining the products as you are trying to do is unnecessary. I will explain in a minute. However, it would be best in the future if you explain the desired result, rather than specifying the method to get there. What we have here is the XY Problem, where you specify a method to get there which is inefficient and hard to do and this leads us down a path that is inefficient and frustrating.

 

If we ignore your request to combine the product codes on one line, we can get a solution pretty easily. (This solution ignores cases where you have a sequence like AABA, I don't know if it works in that case, but it probably can be easily modified).

 

Here is my code, and fake data (which you need to learn how to create, so now you have an example) that determines how many different products are present within a 20 day time period, and from there I think you can compute whatever you want. Note: I do not try to combine the products as ABAA or anything like that.

 

data fakedata;
    input customer $ date :ddmmyy10. product $;
    cards;
111 01/01/2020 a
111 06/01/2020 b
111 15/06/2020 a
111 18/06/2020 a
111 30/06/2020 b
111 10/07/2020 a
222 18/03/2020 c
222 25/03/2020 a
;

data want;
    set fakedata;
    by customer;
    retain start_of_20_days;
    if first.customer then do;
        seq_in_group=0;
        group+1;
        start_of_20_days=date;
        delta=.;
    end;
    else delta=date-start_of_20_days;
    if delta>20 then do;
        seq_in_group=0;
        group+1;
        start_of_20_days=date;
    end;
    seq_in_group+1;
run;
proc sql;
    create table counts as select distinct customer,group,start_of_20_days, count(distinct product) as count from want
        group by group;
quit;
proc sql;
    create table want_plus_counts as select a.*,b.count
    from want as a left join counts as b
    on a.group=b.group;
quit;

 

 

I also call your attention to Maxim 19 ... Long Beats Wide ... this long data set is usually easier to work with than a wide variable containing all the products. 

--
Paige Miller

View solution in original post

11 REPLIES 11
PaigeMiller
Diamond | Level 26

Are the products always a single letter? How many rows/products could get combined this way? What if there are two rows for a customer that both show product A, should it show AA or just A? The method shown in your Picture 3 fails if there are more than two products in a time interval.


What happens if a purchase of product A is on Jan 1, 2022 and product B is on Jan 15, 2022 (within 20 days of purchase of product A) and product C is purchased on Jan 30, 2022 (within 20 days of purchase of product B but not within 20 days of purchase of product A)?

 

Please provide sample data as working SAS data step code. (Instructions)

--
Paige Miller
Haemoglobin17
Obsidian | Level 7

Hi and thank you all for answering!

@PaigeMiller , I have no problem in describing what I am doing 🙂.  The images that I created are fictitious excel tables just to give an idea of my problem. I am working on health prescription data and I would like to  piece prescription cycles from the single prescription drugs that compose a treatment cycle. The customers are patients, the dates are prescription dates and the product are  prescription codes (five digits alphanumeric codes). I want to group all the prescriptions within 20 days because, according to what I observed, it is very likely that they are part of the same treatment cycle.

It is not important the order, AB is equivalent to BA as well as is not a problem to have a repetition of the same code, like AAA or BBB. If there are three B in a row, I don't want B, but BBB. If B is within 20 days from A, I would like to obtain AB, but If after, C is from 20 days from B, I don't want ABC or AB and BC, I just want that all that it is within 20 days is grouped together, and then, the day after it starts another cycle of 20 days. Unfortunately I can't share data for privacy reasons, but I hope I have answered to your questions 😉

 

@ballardw it is not possible that there are several products for each customer every day because I already grouped for the same date, but you can have two or three products within a couple of weeks.

PaigeMiller
Diamond | Level 26

@Haemoglobin17 wrote:

Hi and thank you all for answering!

@PaigeMiller , I have no problem in describing what I am doing 🙂.  The images that I created are fictitious excel tables just to give an idea of my problem. I am working on health prescription data and I would like to  piece prescription cycles from the single prescription drugs that compose a treatment cycle. The customers are patients, the dates are prescription dates and the product are  prescription codes (five digits alphanumeric codes). I want to group all the prescriptions within 20 days because, according to what I observed, it is very likely that they are part of the same treatment cycle.

It is not important the order, AB is equivalent to BA as well as is not a problem to have a repetition of the same code, like AAA or BBB. If there are three B in a row, I don't want B, but BBB. If B is within 20 days from A, I would like to obtain AB, but If after, C is from 20 days from B, I don't want ABC or AB and BC, I just want that all that it is within 20 days is grouped together, and then, the day after it starts another cycle of 20 days. Unfortunately I can't share data for privacy reasons, but I hope I have answered to your questions 😉

This doesn't really answer my questions. What will you do with the ABB or AC or whatever once you have it? What analysis? What reports? What additional computations? Here is what I actually said: "People sometimes are reluctant to answer questions about how they are going to use it after the data set gets created. So @Haemoglobin17 , don't ignore this question ... what will you do with this data set after it gets created? (Don't just say "analysis", we need specifics, there are likely better ways to do this)." @ballardw said something similar, and I would like an answer before proceeding.

 

As far as not sharing the data for privacy reasons, that's fine, you have shown us made up data, that's also fine, but we need the data as working SAS data step code (I gave instructions) and not as a screen capture.

 

What about the question about the real data (not the made up data) — are the products always identified by a single character? It makes a difference as to what code we create; as code we create for your made up data with products identified by a single character might not be useful if the real product name is "Ora Funky Cat" (which is a real product, apparently).

--
Paige Miller
Haemoglobin17
Obsidian | Level 7

When I will have grouped drugs in treatment cycles I want to flag them into cycle 1 or 2; I will be able to do it because cycle 1 is composed by one prescription, while cycle 2 is composed by multiple prescriptions. After that I will just track for this two different kind of treatment cycles prescription trends in the last years and across different geographical regions. This is the end goal of my analysis.

With regard to the value that the prescription can have, it is an alphanumeric with a format like for example "J01AA02".

Lastly, I read the guide about the working SAS data step code and I find it very difficult, unfortunately I am not able to do it, sorry.

PaigeMiller
Diamond | Level 26

@Haemoglobin17 wrote:

Lastly, I read the guide about the working SAS data step code and I find it very difficult, unfortunately I am not able to do it, sorry.


Then ask questions about the part you don't understand.

 

When I will have grouped drugs in treatment cycles I want to flag them into cycle 1 or 2; I will be able to do it because cycle 1 is composed by one prescription, while cycle 2 is composed by multiple prescriptions. After that I will just track for this two different kind of treatment cycles prescription trends in the last years and across different geographical regions.

 

I don't think combining the prescriptions into AB if there is A followed by B is necessary or helpful. I believe if the above is what you want to do, it can be done without doing this kind of combination.

--
Paige Miller
PaigeMiller
Diamond | Level 26

Hello @Haemoglobin17 

I said combining the products as you are trying to do is unnecessary. I will explain in a minute. However, it would be best in the future if you explain the desired result, rather than specifying the method to get there. What we have here is the XY Problem, where you specify a method to get there which is inefficient and hard to do and this leads us down a path that is inefficient and frustrating.

 

If we ignore your request to combine the product codes on one line, we can get a solution pretty easily. (This solution ignores cases where you have a sequence like AABA, I don't know if it works in that case, but it probably can be easily modified).

 

Here is my code, and fake data (which you need to learn how to create, so now you have an example) that determines how many different products are present within a 20 day time period, and from there I think you can compute whatever you want. Note: I do not try to combine the products as ABAA or anything like that.

 

data fakedata;
    input customer $ date :ddmmyy10. product $;
    cards;
111 01/01/2020 a
111 06/01/2020 b
111 15/06/2020 a
111 18/06/2020 a
111 30/06/2020 b
111 10/07/2020 a
222 18/03/2020 c
222 25/03/2020 a
;

data want;
    set fakedata;
    by customer;
    retain start_of_20_days;
    if first.customer then do;
        seq_in_group=0;
        group+1;
        start_of_20_days=date;
        delta=.;
    end;
    else delta=date-start_of_20_days;
    if delta>20 then do;
        seq_in_group=0;
        group+1;
        start_of_20_days=date;
    end;
    seq_in_group+1;
run;
proc sql;
    create table counts as select distinct customer,group,start_of_20_days, count(distinct product) as count from want
        group by group;
quit;
proc sql;
    create table want_plus_counts as select a.*,b.count
    from want as a left join counts as b
    on a.group=b.group;
quit;

 

 

I also call your attention to Maxim 19 ... Long Beats Wide ... this long data set is usually easier to work with than a wide variable containing all the products. 

--
Paige Miller
Haemoglobin17
Obsidian | Level 7

Hi @PaigeMiller ,

Thank you very much, your brilliant code works perfectly well!! 😀

I want very much to learn the working SAS data step code, I will study better the guide and in case I will ask you questions, Thanks again Paige!!

ballardw
Super User

@Haemoglobin17 wrote:

Hi and thank you all for answering!

@PaigeMiller , I have no problem in describing what I am doing 🙂.  The images that I created are fictitious excel tables just to give an idea of my problem. I am working on health prescription data and I would like to  piece prescription cycles from the single prescription drugs that compose a treatment cycle. The customers are patients, the dates are prescription dates and the product are  prescription codes (five digits alphanumeric codes). I want to group all the prescriptions within 20 days because, according to what I observed, it is very likely that they are part of the same treatment cycle.

It is not important the order, AB is equivalent to BA as well as is not a problem to have a repetition of the same code, like AAA or BBB. If there are three B in a row, I don't want B, but BBB. If B is within 20 days from A, I would like to obtain AB, but If after, C is from 20 days from B, I don't want ABC or AB and BC, I just want that all that it is within 20 days is grouped together, and then, the day after it starts another cycle of 20 days. Unfortunately I can't share data for privacy reasons, but I hope I have answered to your questions 😉

 

@ballardw it is not possible that there are several products for each customer every day because I already grouped for the same date, but you can have two or three products within a couple of weeks.


Prescriptions tells me that you have values based on NDC codes for medications perhaps? How sure are you about the "five digits alphanumeric codes". NDC's come in several flavors. From one simple description: https://www.drugs.com/ndc.html

How is the NDC formatted?

The 10-digit NDC will be in one of the following configurations: 4-4-2, 5-3-2, or 5-4-1, meaning that there are 4 or 5 digits for the labeler code, 4 or 3 digits for the product code and 2 or 1 digit(s) for the package code.

IF you have NDC's 5 digits would be the Labeler, which is a manufacturer and has nothing to do with actual treatment (generally) but the second group of 4 or 3 digits is the actual drug.

 

I would be extremely leery of just smooshing all the digits together for those codes and strongly recommend inserting a separation character .

 

HOW you use the combined value is likely going to be very critical if you have AB and BA combinations they are not the same value for any usage. So you need some extra layers of complexity to ensure that you always get AB for any given combination.

 

I would also be a tad concerned about " I already grouped for the same date". I know I have had multiple medications provided on the same date so are your A B C individual NDCs or some made up result of "grouping" or something else???

 

 

Haemoglobin17
Obsidian | Level 7

I'm not using the NDC system, but the ATC classification;  can use 5 digits to identify a family of drugs or a 7 digits to identify the single drug, like for example "J01AA02".  As I wrote to @PaigeMiller , this is because I want to know whether the cycles are composed by one drug or more. When I cat these code I connect them with a - as a separation character in the between one code and the other. Because I just want to class cycles in those with one drug and those with several drugs, it is indifferent whether I have AB or BA. I hope I have answered to your questions.

ballardw
Super User

I agree with @PaigeMiller that your definition of "within" is lacking some details.

 

Since your are using "Customer" then what happens when your customer has a transaction every day? And have way more than two or three products within a 20 day window?

Do you need to know the order of products? If customer 111 has an AB and customer 999 has a BA combination are they to be treated the same? Are any of the products supposed to duplicate if used multiple times in the same time period? Example if your custome 111 had a Product A on 07/01/2020 (the day after the 06/01/2020), would the result be ABA? AAB? AB? or something else?

 

 

What is the actual purpose of the resulting data set? How will it be used? Things like placing multiple "products" into a single variable quite often lead to much more complexity later for little gain.

PaigeMiller
Diamond | Level 26

@ballardw wrote:

 

What is the actual purpose of the resulting data set? How will it be used? Things like placing multiple "products" into a single variable quite often lead to much more complexity later for little gain.


Yeah, I wish I had said that. People sometimes are reluctant to answer questions about how they are going to use it after the data set gets created, resulting in (often) failure to come up with a better way to do things. So @Haemoglobin17 , don't ignore this question ... what will you do with this data set after it gets created? (Don't just say "analysis", we need specifics, there are likely better ways to do this).

--
Paige Miller

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1878 views
  • 3 likes
  • 3 in conversation