BookmarkSubscribeRSS Feed
John04
Fluorite | Level 6

I have a table with a column called as flag with values (1,2,3,4). I want to create column called as Category based on certain conditions:

 

1) if flag 1, 2, 3, 4 appears consecutively for the first time  will have their category as c1, c2, c3 and c4.

2) if consecutive flag=3  appears after 1), then the category will change to c5 - This condition can be accepted 2 times

3) if consecutive flag=4  appears after 2), then the category will change to c5 - This condition can be accepted 2 times

4) After 3), if flag=3 appears then category will be c7

5) After 5) if flag=4 appears then category will be c8

 

So for e.g.

 

data have;
	input Id YearMonth Flag;
	datalines;
1  200101    1
1  200102    1
1  200103    2 
1  200104    3
1  200105    3
1  200106    4
1  200107    4
1  200108    3
1  200109    4
1  200110    4
1  200111    3
1  200112    4
1  200201    3
1  200202    4
;
run;
data want;
_n_=0;
do until(last.id);
do until(last.flag);
set have;
by id flag notsorted;
if first.id and first.flag=2 then _n_=2;
else if first.flag and _n_<4 then _n_=sum(_n_,1);
if _n_=4 then category=ifn(flag,4,3);
else category=_n_;
output;
end;
end;
run;

I have tried using case statement and comparing each YearMonth to the previous YearMonth. It works but repeats the category. So in 11th row and 13th row, it gives c3 instead of c5 and c7. I think I have to use counter logic and group by but not sure how it will work.

4 REPLIES 4
mkeintz
PROC Star

I do not clearly understand the statement of your objective.

 

But I DO understand that you need to track sequences of values, which means the CASE statement, (or proc sql in general) is not a helpful technique.

 

Instead, it looks like you need a data step, which can easily be used to detect the beginning and ending of a series, as well as looking back (and forward) to neighboring values in a sequence.

 

Perhaps if you should your code, your objective might become clearer.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
John04
Fluorite | Level 6
data have;
	input Id YearMonth Flag;
	datalines;
1  200101    1
1  200102    1
1  200103    2 
1  200104    3
1  200105    3
1  200106    4
1  200107    4
1  200108    3
1  200109    4
1  200110    4
1  200111    3
1  200112    4
1  200201    3
1  200202    4
;
run;

data Want;
	input Id YearMonth Flag Category;
	datalines;
1  200101    1 c1
1  200102    1 c1
1  200103    2 c2
1  200104    3 c3
1  200105    3 c3
1  200106    4 c4
1  200107    4 c4
1  200108    3 c5
1  200109    4 c6
1  200110    4 c6
1  200111    3 c5
1  200112    4 c6
1  200201    3 c7
1  200202    4 c8
;
run;
mkeintz
PROC Star

You apparently did not actually test your data want step to show the desired output.  It will fail because it attempts to read character values into the numeric variable named category.

 

Let's see if I can tease out your intended rules.  I'm going to use the term "flag group" to mean a series of records with a constant value for flag.  A flag group can be any size, from 1 up.

 

  1. "if flag 1, 2, 3, 4 appears consecutively for the first time  will have their category as c1, c2, c3 and c4.".  I take this to mean a series of flag groups with ascending flag values from 1 through 4.

    question:  what if an ID starts with, say   2,2,3,1,5.  I don't see any rule for an id that does not start out with flag-group ascending from 1 to 4.  What category gets assigned to these records?

  2. Then "if consecutive flag=3  appears after" [sequence in rule 1], "then the category will change to c5 - This condition can be accepted 2 times.".

    Question: I take "consecutive flag=3" to mean a flag group with value 3.  Does that mean you could have, starting in 200108, the sequence 3,3,4,4,3,4.  Since the three flag=3 records appear in two groups, that suggests to me that all three of those flag=3 records would have category C5, correct?

  3. The "if consecutive flag=4  appears after" [event in rule 2], "then the category will change to c5 - This condition can be accepted 2 times"

    This rule, and rule 2 suggests to me that you want to increment the category  (C5, then C6, then C7, etc.)  every second time a flag group for flag=4 or flag=3 starts, correct?


  4. You don't say what should be done if a 1 or a 2 reappear in the series.

 

So if I assume (1) that every id starts with, in order, flag groups for flag=1,2,3,4,  and (2) the remaining flag groups are only for flag=3 or 4 then your rules look to me like this table:

 

Flag group Counts 1 2 3 4 5
Flag=3 C3 C5 C5 C7 C7
Flag=4 C4 C6 C6 C8 C8

 

This means that all you have to do is track the flag-group counts for each flag level, as in:

  1. Take the group count, subtract 1, divide by 2, and round up:
      ceil((group count-1)/2)
    1. So for group count = 2 or 3, this generates a 1
    2. for group count=4 or 5, this generates a 2
      6 or 7==>3,   8 or 9==>4, etc.
  2. Multiply the results by 2.
    1. For group count=2 or 3, this makes a 2
    2. For group count=4 or 5, this makes a 4, etc.
  3. Add the value of flag  (i.e. add 3 or add 4) and prefix with a "C".

So the general formula is    flag+2*ceil(group_count-1)/2).  

 

This makes the program straightforward:

 


data have;
	input Id YearMonth Flag ;
	datalines;
1  200101    1 c1
1  200102    1 c1
1  200103    2 c2
1  200104    3 c3
1  200105    3 c3
1  200106    4 c4
1  200107    4 c4
1  200108    3 c5
1  200109    4 c6
1  200110    4 c6
1  200111    3 c5
1  200112    4 c6
1  200201    3 c7
1  200202    4 c8
run;

data want (drop=_:);
  set have;
  by id flag notsorted;
  array group_counts {4} _temporary_ ;

  retain category;

  if first.flag then do;
    if first.id then call missing(of group_counts{*});
    group_counts{flag}+1;
    if group_counts{flag}=1 then  _cnum=flag;
    else _cnum=flag+2*ceil((group_counts{flag}-1)/2);
    category=cats('C',_cnum);
  end;
run;

The approach is to track counts of flag-groups for flag=1, 2, 3, and 4, kept above in a 4-element array, indexed by the value of flag.  Because the array is declared _temporary_, its values are not reset to missing automatically.

 

So at the start of each flag group, first check if the counts need to be reset to missing, then increment the appropriate count, followed by generating the category.  The category variable is retained, so the entire flag-group gets the same category value.

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
John04
Fluorite | Level 6

I apologies if I wasn't much clear on my ask or description.

Also, Category is not much important to be called as C1,C2.... It can be 1,2,3,4,5,6.7,8

 

Note:

1)Stage for a id will start with either 1 or 2.

2)If stage starts with 1 then it can be either 1/2 for next YearMonth. Once Stage becomes 2 (which is possible only when stage-1 already exists or stage for an id start with 2 -It also means that If for any id, stage=2 then it won't be stage=1 again.

3)Once stage changes from 2 to 3, it wont be stage 1/2 again and it will only be 4 

4) Once stage changes from 3 to 4, it wont be stage 1/2 again and it will only be 4

5) After 2), 3) and 4) will only occur

 

 e.g. 

idYearMonthStage
12000021
12000032
12000043
12000053
12000064
12000073
12000084
22000012
2000022
2000033
2000043
2000054
2000064
2000073
2000084

So for id=1, it starts with 1 and move forward. Once it reaches 4 it can be 3 and similarly once it reaches 3 it can only be 4

for id=2, it starts with 2 and then move forward. Again, once it reaches 4 it can be 3 and similarly once it reaches 3 it can only be 4.

 

This will make rule-1 not valid.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 934 views
  • 0 likes
  • 2 in conversation