## How to streamline this data generating process?

Solved
Occasional Contributor
Posts: 17

# How to streamline this data generating process?

[ Edited ]

Hello everyone, recently, I have encountered a problem that involves numerous if-else statements. Given the sheer volume of my data, if I were to follow this route, I would have spent days or even weeks keying in if else statements. I am wondering if there is a way to iterate this process through a combinantion of array and macro techniques. An example is shown below:

Let's say there is a data set named "account" that contains 100 observations and 11 variables.

Account_ID M1 M2 M3 M4 M5 M6 M7 M8 M9 M10

A00001       C    M   M   M   M   M   M   M   M  P

A00002       C    P   C    M   M   M   M   M   M  M

A00003       P    P   M    C   M   M   M   M   M  M

.

.

.

A00100      C    M   C    M   M   M   M   M   M  M

What I tried to do is illustated in my codes below:

Data Account_Final;

Set Account;

if M1="C" and M2="M" and M3="M" and M4="M" then P1=3;

else P1=0;

if M2="C" and M3="M" and M4="M" and M5="M" then P2=3;

else P2=0;

if M3="C" and M4="M" and M5="M" and M6="M" then P3=3;

else P3=0;

if M4="C" and M5="M" and M6="M" and M7="M" then P4=3;

else P4=0;

.

.

.

if M7="C" and M8="M" and M9="M" and M10="M" then P7=3;

else P7=0;

/***codes below are slightly different from those above***/

if M1="C" and M2="M" and M3="M" and M4="M"  M5="M" then P8=4;

else P8=0;

.

.

.

if M6="C" and M7="M" and M8="M" and M9="M"  M10then P14=4;

else P7=0;

/*** the "if statement" expands by increment of one "and info" until it reaches the last variable "M10" in the data set***/

if M1="C" and M2="M" and M3="M" and M4="M"  and M5="M" and  ... and M10="M" then PX=X;

else PX=0;

run;

As you can see, I have to write a lot of if else statements to accomplish my goal, but if the variable goes from 11 to 100, I am not able to write the lines of codes like I did above. I am wondering if there is another way for me to efficiently accomplish my goal with way fewer codes.

Thanks so much.

Best,

Accepted Solutions
Solution
‎04-10-2016 09:24 PM
Occasional Contributor
Posts: 17

## Re: How to streamline this data generating process?

Hi SASkiwi, sorry about this delayed reply, it worked out perfectly except some missing values, I changed the codes a bit to make it look like exactly what i wanted.Thanks so much for you help. I will copy this idea and try to expand my variables from M1-M10 to M1-M100.

``````data have;
informat account_id \$8. m1-m10 \$1.;
input Account_ID \$ M1-M10 \$;
cards;
A00001 C M   M    M   M   M   M   M   M  P
A00002 C P   C    M   M   M   M   M   M  M
A00003 P P   M    C   M   M   M   M   M  M
;
run;

data want;
set have;
array m(10) m1-m10;
array p(13)  p1-p13 ;
do I = 1 to 7;
text=catt(m(i), m(i+1), m(i+2), m(i+3));
if index(text, 'CMMM') then p(i) = 3;
else p(i)=0;
end;

do I= 1 to 6;
text=catt(m(i), m(i+1), m(i+2), m(i+3), m(i+4));
if index(text, 'CMMMM') then p(i +7) = 4;
else p(i+7)=0;
end;
drop i text;
run;``````

Best,

All Replies
Super User
Posts: 23,776

## Re: How to streamline this data generating process?

What's the logic behind your rules?

Occasional Contributor
Posts: 17

## Re: How to streamline this data generating process?

[ Edited ]

Hi Reeza, thanks for your reply, let's think of M1- M10 as the first month all the way to the 10th month, and C stands for Core, M for Mid and P for Premier.

My logic is to figure out two things: 1, How many accounts moved from Core to Mid, and they also must stay in Mid for at least three consecutive months, that means "CMMM" is the minimum criteria to consider.2, Once it qualifies for consideration, I also want to figure out how many months those qualified accounts have stayed in Mid?Such as "CMMMMM"should be interpreted as 5, because 1. It qualifies for consideration when it moves from Core to Mid. 2. It has stayed in Mid for 5 months.

I hope I explained myself clearly to you, please let me know if you have any questions on it.

Thanks a lot,

Best,

Super User
Posts: 23,776

## Re: How to streamline this data generating process?

You don't need to post the same thing twice

A slight modification to my code. Instead of all the IF's I count the number of consecutive M, after the initial C. If you want, you can delete 3 from the number to account for the initial requirement.

``````data want;
set have;
array m(10) m1-m10;
array p(10) p1-p10 (10*0);
consecutiveM=0;
text=catt(of m1-m10);
x= index(text, 'CMMM') ;
if x>0 then p(x)=1;

if x>0 then do i=x+1 to dim(m);
if char(text, i)='M' then consecutiveM+1;
else leave;
end;

drop x i;
run;``````
Super User
Posts: 23,776

## Re: How to streamline this data generating process?

[ Edited ]

If you can have multiple occurences of 'CMMM' then you need to modify this code to loop through once you first find your X.

Otherwise you can concatenate the results and then search for the string CMMM. The index will be the first position and you can set the P variable to 1.

``````data have;
informat account_id \$8. m1-m10 \$1.;
input Account_ID \$ M1-M10 \$;
cards;
A00001 C M   M    M   M   M   M   M   M  P
A00002 C P   C    M   M   M   M   M   M  M
A00003 P P   M    C   M   M   M   M   M  M
;
run;

data want;
set have;
array m(10) m1-m10;
array p(10) p1-p10 (10*0);
text=catt(of m1-m10);
x= index(text, 'CMMM') ;
if x>0 then p(x)=1;
run;
``````

EDIT: re-reading this I don't think I got your rule, which is why you may want to explain it

Occasional Contributor
Posts: 17

## Re: How to streamline this data generating process?

Hi Reeza, I am very new to this forum, I just want to be respectful to everyone who responded to me. Clearly I see this is not the right way to interact with anyone in this forum, thanks for the heads up.

I ran your codes, it has done 50% of what I wanted, the desired result I would like to see is:

Account_ID P1 P2 P3 P4 P5 P6 P7

A00001        3   0    0   0    0   0   0

A00002        0   0    3   0    0   0   0

A00003        0   0    0   3    0   0   0

After I got these result, and I would just want to figure out the maximum and minimum months each qualified account has stayed by max/min (of p1-p7).

Thanks

Super User
Posts: 3,927

## Re: How to streamline this data generating process?

I've borrowed @Reeza's code to do just the "3" scenario. This can be enhanced to do the other combinations.

``````data have;
informat account_id \$8. m1-m10 \$1.;
input Account_ID \$ M1-M10 \$;
cards;
A00001 C M   M    M   M   M   M   M   M  P
A00002 C P   C    M   M   M   M   M   M  M
A00003 P P   M    C   M   M   M   M   M  M
;
run;

data want;
set have;
array m(10) m1-m10;
array p(10) p1-p10 (10*0);
do I = 1 to 7;
text=catt(m(i), m(i+1), m(i+2), m(i+3);
if index(text, 'CMMM') then p(i) = 3;
end;
run;``````
Occasional Contributor
Posts: 17

## Re: How to streamline this data generating process?

[ Edited ]

Hi SASkiwi, thanks for your reply, I probably didn't explain my thoughts clearly in my codes previously.

In my previous dataset, let's think of M1- M10 as the first month to the10th month, and C stands for Core, M for Mid and P for Premier.

My logic is to figure out two things: 1, How many accounts moved from Core to Mid, and they also must stay in Mid for at least three consecutive months, that means "CMMM" is the minimum criteria to consider. 2, Once it qualifies for consideration, I also want to figure out how many months those qualified accounts have stayed in Mid? Such as "CMMMMM" should be interpreted as 5, becasue 1. It qualifies for consideration when moving from Core to Mid. 2. It has stayed in Mid for 5 months.

I hope I explained myself clearly to you this time, please let me know if you have any questions.

Thanks a lot,

Best,

Super User
Posts: 3,927

## Re: How to streamline this data generating process?

Thanks for the detailed explanation.

Can you try @Reeza and my example code to see if gives the right result for the 3 consecutive months or not? If it does then we can try the second part.

Occasional Contributor
Posts: 17

## Re: How to streamline this data generating process?

[ Edited ]

Hi SASkiwi, I ran your codes, and below are the results I got.

I think I am not getting the resuts as I expected, let's look at one example by focusing on the values for P1 variable. What I expected to see is P1 for first observation is 3 ( becasue its "CMMM" , and 0 for second observation (for it is "CPCM" and 0 for third observation (for it is "PPMC").On the contrary, I see 3 for all three observations.

Thanks.

Note: Sorry, for some reason, I am not able to line the P variables  correctly with its respective value in the table.

P1       P2         P3

 1 A00001 C M M M M M M M M P 3 0 0
 2 A00002 C P C M M M M M M M 3 0 3
 3 A00003 P P M C M M M M M M 3 0 3
Super User
Posts: 23,776

## Re: How to streamline this data generating process?

I didn't bother with the indicator variables just the total count. Based on your description of the problem it seems to make sense.

You could easily store the values into an array for a running count but that doesn't seem very useful.

PROC Star
Posts: 2,371

## Re: How to streamline this data generating process?

[ Edited ]

Regular expressions easily bring this kind of flexibility.

``````data HAVE;
input (X1-X10) (: \$1.);
cards;
P C M M M M M M M P
P C M M C M M M M P
P C M M C M M C M P
run;
data WANT;
set HAVE;
array P [10];
RE=prxparse('/CM{3,}/');                     %* This means match a C followed by a least 3 Ms;
call prxsubstr(RE,cat(of X1-X10), POS, LEN); %* and return the match position and length;
if POS then P[POS]=LEN-1;
run;
proc print noobs;
var P1-P10;
run;``````

P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
. 7 . . . . . . . .
. . . . 4 . . . . .
. . . . . . . . . .

Exhaustive list of regular expressions supported by SAS in:
http://www.amazon.com/High-Performance-SAS-Coding-Christian-Graffeuille/dp/1512397490

Occasional Contributor
Posts: 17

## Re: How to streamline this data generating process?

Hi SASkiwi, I drew on your insights, and added some other codes to finally create the result i desired, however,  i have to ran several times to arrive at the final result, please see my codes below:

``````data have;
informat account_id \$8. m1-m10 \$1.;
input Account_ID \$ M1-M10 \$;
cards;
A00001 C M   M    M   M   M   M   M   M  P
A00002 C P   C    M   M   M   M   M   M  M
A00003 P P   M    C   M   M   M   M   M  M
;
run;

data want;
set have;
array m(10) m1-m10;
array p(7)  p1-p7 ;
do I = 1 to 7;
text=catt(m(i), m(i+1), m(i+2), m(i+3));
if index(text, 'CMMM') then p(i) = 3;
else p(i)=0;
end;
drop i text;
run;

data want1;
set want;
array m(10) m1-m10;
array p(6)  p8-p13 ;
do I= 1 to 6;
text=catt(m(i), m(i+1), m(i+2), m(i+3), m(i+4));
if index(text, 'CMMMM') then p(i) = 4;
else p(i)=0;
end;
drop i text;
run;

data want2;
set want1;
array m(10) m1-m10;
array p(5)  p14-p18 ;
do I= 1 to 5;
text=catt(m(i), m(i+1), m(i+2), m(i+3), m(i+4), m(i+5));
if index(text, 'CMMMMM') then p(i) = 5;
else p(i)=0;
end;
drop i text;
run;

Data want3;
set want2;
array m(10) m1-m10;
array p(4)  p19-p22 ;
do I= 1 to 4;
text=catt(m(i), m(i+1), m(i+2), m(i+3), m(i+4), m(i+5),m(i+6));
if index(text, 'CMMMMMM') then p(i) = 6;
else p(i)=0;
end;
drop i text;
run;

Data want4;
set want3;
array m(10) m1-m10;
array p(3)  p23-p25 ;
do I= 1 to 3;
text=catt(m(i), m(i+1), m(i+2), m(i+3), m(i+4), m(i+5),m(i+6),m(i+7));
if index(text, 'CMMMMMMM') then p(i) = 7;
else p(i)=0;
end;
drop i text;
run;

Data want_final;
set want4;
array m(10) m1-m10;
array p(2)  p26-p27 ;
do I= 1 to 2;
text=catt(m(i), m(i+1), m(i+2), m(i+3), m(i+4), m(i+5),m(i+6),m(i+7),m(1+8));
if index(text, 'CMMMMMMMM') then p(i) = 8;
else p(i)=0;
end;
drop i text;
Max=max(of p1-p27);
Min=min(of p1-p27);
run;``````

I am wondering if there is a "magic wand" that you can help me wave  that turns them into a series of  coherent steps to produce the final result with one fell swoop.Thanks!

Super User
Posts: 3,927

## Re: How to streamline this data generating process?

[ Edited ]

Well this might be one way of combining 2 of the rules as an example. Please test it.

``````data have;
informat account_id \$8. m1-m10 \$1.;
input Account_ID \$ M1-M10 \$;
cards;
A00001 C M   M    M   M   M   M   M   M  P
A00002 C P   C    M   M   M   M   M   M  M
A00003 P P   M    C   M   M   M   M   M  M
;
run;

data want;
set have;
array m(10) m1-m10;
array p(13)  p1-p13 ;
do I = 1 to 7;
text=catt(m(i), m(i+1), m(i+2), m(i+3));
if index(text, 'CMMM') then p(i) = 3;
else p(i)=0;
end;

do I= 1 to 6;

text=catt(m(i), m(i+1), m(i+2), m(i+3), m(i+4));
if p(i) = 3 and index(text, 'CMMMM') then p(i +7) = 4;
else p(i)=0;
end;

drop i text;
run;
``````

Solution
‎04-10-2016 09:24 PM
Occasional Contributor
Posts: 17

## Re: How to streamline this data generating process?

Hi SASkiwi, sorry about this delayed reply, it worked out perfectly except some missing values, I changed the codes a bit to make it look like exactly what i wanted.Thanks so much for you help. I will copy this idea and try to expand my variables from M1-M10 to M1-M100.

``````data have;
informat account_id \$8. m1-m10 \$1.;
input Account_ID \$ M1-M10 \$;
cards;
A00001 C M   M    M   M   M   M   M   M  P
A00002 C P   C    M   M   M   M   M   M  M
A00003 P P   M    C   M   M   M   M   M  M
;
run;

data want;
set have;
array m(10) m1-m10;
array p(13)  p1-p13 ;
do I = 1 to 7;
text=catt(m(i), m(i+1), m(i+2), m(i+3));
if index(text, 'CMMM') then p(i) = 3;
else p(i)=0;
end;

do I= 1 to 6;
text=catt(m(i), m(i+1), m(i+2), m(i+3), m(i+4));
if index(text, 'CMMMM') then p(i +7) = 4;
else p(i+7)=0;
end;
drop i text;
run;``````

Best,

🔒 This topic is solved and locked.