I have a dataset consisting of multiple records per ID (patient). For each ID I want to code the "Group" variable with 0, 1, 2 or 9. The coding would be based on values in the "Comment" variable. Below are the possible values for the "Comments" variable:
CANCELLED BY CLINIC
NO ACTION TAKEN
CANCELLED BY PATIENT
The dataset is sorted by id then CDATE. TBI POS will always be the first "Comment" value for each ID. Sometimes it may not, but should be ingored in this case. The "Group" variable should be coded according to the following:
For each ID (patient)
If TBI POS is the only field then "Group" = 0
If TBI POS immediately followed by COMPLETE then "Goup" = 1
IF TBI POS followed by anything except COMPLETE but eventually has COMPLETE then "Group" = 2.
IF TBI POS followed by anything except COMPLETE and COMPLETE nvever shows up then "Group" = 0.
After an ID is determined to be in Group 2 then all values after the COMPLETE should be coded "Group" = 9. So "Group" 9 is a subset of "Group" 2.
After reading and sorting your input, use a DATA step to input your sorted data, and assign your GROUP (based on using RETAIN statement variable), as your have explained. And if you know all possible COMMENT values, consider using a PROC FORMAT to read-up your input data, which would minimize your need for IF/THEN logic to assign GROUP.
Remember that PROC FORMAT can handle character-data range, such as your "starts with", but the VALUE statement or CNTLIN= coding is a bit tricky. If you do not have a large-volume of unique COMMENT strings, maybe the DATA step IF/THEN approach is sufficient, rather than the time required to setup a PROC FORMAT approach.
Scott's approach (with retain and data step and probably first. and last. variables) is one way to go.
I'd probably approach this in a different way -- you have long/skinny data. I would be tempted to use PROC TRANSPOSE to "flip" the data into WIDE data -- so that it would look like this:
Obs id _NAME_ COL1 COL2 COL3 COL4 COL5
That way, you could use an ARRAY statement in a DATA step program and make COL1-COL? the variables that composed the array. This would lead to the ability to have IF statements something like this:
data setgroup (keep=ID comment group)
set transout; /* output from PROC TRANSPOSE */
ARRAY stmt for transposed comment variables;
... do loop to find out how many comments per ID, and set CNTR variable and also
set an indicator EVERCOMPLETE for whether there's ever a "COMPLETE"
in the comments...
* then have IF statements like these:;
if cmnt(1) = "TBI POS" then do;
if cntr = 1 then do;
group = 0;
/* If TBI POS is the only field then "Group" = 0 */
else if cmnt(2) = "COMPLETE" then do;
group = 1;
/* If TBI POS immediately followed by COMPLETE
then "Group" = 1 */
else if cmnt(2) ne "COMPLETE" then do;
if evercomplete = 1 then group = 2;
else if evercomplete = 0 then group = 0;
IF TBI POS followed by anything except COMPLETE
... eventually has COMPLETE then "Group" = 2
... COMPLETE never shows up then "Group" = 0
else if cmnt(1) ne "TBI POS" then output error_obs;
...then another do loop to write out the ID, the comment and the group
as a long skinny data set again. But the logic for the DO loop depends on
whether the 9 coding is applied to just group 2 or to group 1 and 2....
I did find something inconsistent in your description of the coding and what you showed as your desired results -- you said that if there were comments after the group had been determined to be 2, then those comments should be coded as 9. But in your desired results above, you showed these 2 IDs as having a 9 coded after a GROUP of 1 had been assigned.
ID Comment Group
912 TBI POS 1
912 COMPLETE 1
912 UNSCHEDULED 9
Also, you do not say whether there would EVER be anything other than TBI POS as the first comment, but if it's possible, then you might want to catch that ID as an error observation.
Another approach to think about -- since you do not know whether to code the FIRST obs for an ID as a 1 or a 0 or a 2 until you know what the value of the second comment is. There are ways around that, but this is a possible approach. And, you could just build an array in the DATA step without ever using PROC TRANSPOSE, too. It really depends on your comfort level with SAS programming.
I think it could be done like this
if ~first.id and ~last.id then group=1;
if id=lagid and comment~="complete" then group=lag(group);
if id=lagid and comment="complete" and lagc=TBI POS then group=lag(group)+1;
if id=lagid and comment="complete" and lagc=~TBI POS then group=2;
merge yourdata new;
if group=2 and id=lagid and lagc="complete" then group=9;
if id=lagid and lagg=9 then group=9;
I donot know whether it works , my idea is to use lag function.
Thanks for the detailed explanations. I manually coded the groups since my dataset was so small but when I have some extra time I will try the suggested methods because I will run into this problem again.