I have a table with a column called as flag with values (1,2,3,4). I want to create column called as Category based on certain conditions:
1) if flag 1, 2, 3, 4 appears consecutively for the first time will have their category as c1, c2, c3 and c4.
2) if consecutive flag=3 appears after 1), then the category will change to c5 - This condition can be accepted 2 times
3) if consecutive flag=4 appears after 2), then the category will change to c5 - This condition can be accepted 2 times
4) After 3), if flag=3 appears then category will be c7
5) After 5) if flag=4 appears then category will be c8
So for e.g.
data have;
input Id YearMonth Flag;
datalines;
1 200101 1
1 200102 1
1 200103 2
1 200104 3
1 200105 3
1 200106 4
1 200107 4
1 200108 3
1 200109 4
1 200110 4
1 200111 3
1 200112 4
1 200201 3
1 200202 4
;
run;
data want;
_n_=0;
do until(last.id);
do until(last.flag);
set have;
by id flag notsorted;
if first.id and first.flag=2 then _n_=2;
else if first.flag and _n_<4 then _n_=sum(_n_,1);
if _n_=4 then category=ifn(flag,4,3);
else category=_n_;
output;
end;
end;
run;
I have tried using case statement and comparing each YearMonth to the previous YearMonth. It works but repeats the category. So in 11th row and 13th row, it gives c3 instead of c5 and c7. I think I have to use counter logic and group by but not sure how it will work.
I do not clearly understand the statement of your objective.
But I DO understand that you need to track sequences of values, which means the CASE statement, (or proc sql in general) is not a helpful technique.
Instead, it looks like you need a data step, which can easily be used to detect the beginning and ending of a series, as well as looking back (and forward) to neighboring values in a sequence.
Perhaps if you should your code, your objective might become clearer.
data have;
input Id YearMonth Flag;
datalines;
1 200101 1
1 200102 1
1 200103 2
1 200104 3
1 200105 3
1 200106 4
1 200107 4
1 200108 3
1 200109 4
1 200110 4
1 200111 3
1 200112 4
1 200201 3
1 200202 4
;
run;
data Want;
input Id YearMonth Flag Category;
datalines;
1 200101 1 c1
1 200102 1 c1
1 200103 2 c2
1 200104 3 c3
1 200105 3 c3
1 200106 4 c4
1 200107 4 c4
1 200108 3 c5
1 200109 4 c6
1 200110 4 c6
1 200111 3 c5
1 200112 4 c6
1 200201 3 c7
1 200202 4 c8
;
run;
You apparently did not actually test your data want step to show the desired output. It will fail because it attempts to read character values into the numeric variable named category.
Let's see if I can tease out your intended rules. I'm going to use the term "flag group" to mean a series of records with a constant value for flag. A flag group can be any size, from 1 up.
So if I assume (1) that every id starts with, in order, flag groups for flag=1,2,3,4, and (2) the remaining flag groups are only for flag=3 or 4 then your rules look to me like this table:
Flag group Counts | 1 | 2 | 3 | 4 | 5 |
Flag=3 | C3 | C5 | C5 | C7 | C7 |
Flag=4 | C4 | C6 | C6 | C8 | C8 |
This means that all you have to do is track the flag-group counts for each flag level, as in:
So the general formula is flag+2*ceil(group_count-1)/2).
This makes the program straightforward:
data have;
input Id YearMonth Flag ;
datalines;
1 200101 1 c1
1 200102 1 c1
1 200103 2 c2
1 200104 3 c3
1 200105 3 c3
1 200106 4 c4
1 200107 4 c4
1 200108 3 c5
1 200109 4 c6
1 200110 4 c6
1 200111 3 c5
1 200112 4 c6
1 200201 3 c7
1 200202 4 c8
run;
data want (drop=_:);
set have;
by id flag notsorted;
array group_counts {4} _temporary_ ;
retain category;
if first.flag then do;
if first.id then call missing(of group_counts{*});
group_counts{flag}+1;
if group_counts{flag}=1 then _cnum=flag;
else _cnum=flag+2*ceil((group_counts{flag}-1)/2);
category=cats('C',_cnum);
end;
run;
The approach is to track counts of flag-groups for flag=1, 2, 3, and 4, kept above in a 4-element array, indexed by the value of flag. Because the array is declared _temporary_, its values are not reset to missing automatically.
So at the start of each flag group, first check if the counts need to be reset to missing, then increment the appropriate count, followed by generating the category. The category variable is retained, so the entire flag-group gets the same category value.
I apologies if I wasn't much clear on my ask or description.
Also, Category is not much important to be called as C1,C2.... It can be 1,2,3,4,5,6.7,8
Note:
1)Stage for a id will start with either 1 or 2.
2)If stage starts with 1 then it can be either 1/2 for next YearMonth. Once Stage becomes 2 (which is possible only when stage-1 already exists or stage for an id start with 2 -It also means that If for any id, stage=2 then it won't be stage=1 again.
3)Once stage changes from 2 to 3, it wont be stage 1/2 again and it will only be 4
4) Once stage changes from 3 to 4, it wont be stage 1/2 again and it will only be 4
5) After 2), 3) and 4) will only occur
e.g.
id | YearMonth | Stage |
1 | 200002 | 1 |
1 | 200003 | 2 |
1 | 200004 | 3 |
1 | 200005 | 3 |
1 | 200006 | 4 |
1 | 200007 | 3 |
1 | 200008 | 4 |
2 | 200001 | 2 |
2 | 200002 | 2 |
2 | 200003 | 3 |
2 | 200004 | 3 |
2 | 200005 | 4 |
2 | 200006 | 4 |
2 | 200007 | 3 |
2 | 200008 | 4 |
So for id=1, it starts with 1 and move forward. Once it reaches 4 it can be 3 and similarly once it reaches 3 it can only be 4
for id=2, it starts with 2 and then move forward. Again, once it reaches 4 it can be 3 and similarly once it reaches 3 it can only be 4.
This will make rule-1 not valid.
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.