Re: Filter out wrong patterns within an ID group

Wouter · Posted 09-11-2010 05:43 AM

Hi,

I was wondering if anyone could help me with the following problem I've got.
I've a set of ID's, linked to a set of statusses (and a date field). What I want is to get a dataset in which all ID's that follow a simple pattern of statusses (ID has to begin with 1 and the following (end) code has to be 5). Let me make this more clear with the following example:

ID------Date-------Status
1-----01-01-10------1
1-----01-05-10------5
2-----01-01-10------2
2-----01-03-10------1
2-----01-04-10------5
2-----01-05-10------9
3-----01-03-10------1
3-----01-04-10------3
3-----01-05-10------5
4-----01-01-10------3
4-----01-02-10------1
4-----01-08-10------7
4-----01-10-10------5
5-----01-03-10------1
5-----01-06-10------7
5-----01-08-10------4

So, in this case, I want a set of ID's that begin with status 1, directly followed by status 5 for the same ID, like this:

ID------Date-------Status
1-----01-01-10------1
1-----01-05-10------5
2-----01-03-10------1
2-----01-04-10------5

Not ID 3, because this one has status 3 between the status 1 and 5, not ID 4 because this one has status 7 between the 1 and 5, and not ID 5 because there's no status 5 (after status 1).

I tried some things (linking the ID to the status with "CATS", I tried some things with first.ID / last.ID and "do while last.ID", but I can't figur it out. Can anyone help me? Thanks in advance!!

data_null__ · Posted 09-11-2010 08:40 AM

I think a "simple" look ahead will suffice. The code seems a bit awkward with the flag and count but it seems to get the job done.

[pre]
data test;
input id $1. date mmddyy8. status 1.;
format date date.;
cards;
101-01-101
101-05-105
201-01-102
201-03-101
201-04-105
201-05-109
301-03-101
301-04-103
301-05-105
401-01-103
401-02-101
401-08-107
401-10-105
501-03-101
501-06-107
501-08-104
;;;;
run;
proc print;
run;
data filter;
flag = 0;
count = 0;
do until(last.id);
set test end=eof;
by id;
if not eof then set test(firstobs=2 keep=status rename=(status=ns));
if status eq 1 and ns = 5 then flag = 1;
if flag then do;
count + 1;
if count le 2 then output;
if count eq 2 then call missing(flag,count);
end;
end;
drop ns flag count;
run;
proc print;
run;
[/pre]

Wouter · Posted 09-13-2010 05:29 AM

Hi,

Thanks for your reply! I've just tried the code, but it ends with 0 observations. Could it be possible that it has to do with the following:

"if status eq 1 and ns = 5 then flag = 1;"

I think, since there's no retain function or something like that (to 'remember' the last value of status), the variable 'status' and 'ns' are always the same. Could this be correct. And if so, how to retain the former value...? Thanks for any support in advance!

Cynthia_sas · Posted 09-13-2010 07:01 AM

Hi:
As the above example shows, you will have to essentiallly read the data twice to accomplish this task. I generally do not use/show SET inside a loop, because it is hard to explain to beginners -- however it can be a very efficient way to read and process large datasets. Alternately, if you used the LAG function, you could accomplish the same task in more steps, but possibly would end up with a program that was easier to understand.

If you use the LAG function to "look behind", you would get data as shown below (note that I created the LAG_STAT and the ORIG_ORD variables in addition to reading the original variables that you showed):
[pre]
Show how lag works -- but do not really need this dataset

Obs ID date status lag_stat orig_ord

1 1 01/01/10 1 . 1
2 1 01/05/10 5 1 2
3 2 01/01/10 2 . 3
4 2 01/03/10 1 2 4
5 2 01/04/10 5 1 5
6 2 01/05/10 9 5 6
7 3 01/03/10 1 . 7
8 3 01/04/10 3 1 8
9 3 01/05/10 5 3 9
10 4 01/01/10 3 . 10
11 4 01/02/10 1 3 11
12 4 01/08/10 7 1 12
13 4 01/10/10 5 7 13
14 5 01/03/10 1 . 14
15 5 01/06/10 7 1 15
16 5 01/08/10 4 7 16
[/pre]

If you look at the rows where LAG_STAT = 1, that means that the value for the PREVIOUS observation was 1 so if the current STATUS variable is equal to 5, you now know that you have an ID that you want. In the report above, you see this pattern in obs #2 for ID=1 and in obs#5 for ID=2. So, if you output ONLY the ID for the obs that meet this pattern, then you will have a dataset of the IDs you want.

In my program below, where the LAG function is used, I have 1 DATA step program that creates 2 datasets -- WANTID and LAGSTATUS. The dataset I really need to use is the WANTID dataset -- the LAGSTATUS dataset is just to get the above shot of how the logic works with LAG.

Next, the WANTID dataset is merged with the original dataset (I call the original dataset WORK.CK_PATTERN) -- you probably only want 1 dataset from the merge, but my program creates 3 -- 1 dataset to hold all the obs that DO meet the pattern, 1 dataset to hold all the obs that do NOT meet the pattern, and 1 dataset to hold the obs that are other than 1 or 5 for the IDs that meet the pattern (such as the obs with status=2 and status=9 for ID=2)

You can experiment with data_null_'s program and try to get it working with 2 set statements and the loop and flags or you can use a technique similar to the one I show in the program below.

cynthia
[pre]
** 1) make data;
data ck_pattern;
infile datalines dlm=' ';
input ID date :anydtdte8. status;
orig_ord=_n_;
format date mmddyy8.;
return;
datalines;
1 01-01-10 1
1 01-05-10 5
2 01-01-10 2
2 01-03-10 1
2 01-04-10 5
2 01-05-10 9
3 01-03-10 1
3 01-04-10 3
3 01-05-10 5
4 01-01-10 3
4 01-02-10 1
4 01-08-10 7
4 01-10-10 5
5 01-03-10 1
5 01-06-10 7
5 01-08-10 4
;
run;

proc sort data=ck_pattern;
by id orig_ord;
run;

ods listing;
proc print data=ck_pattern;
title 'Original Data';
run;

** 2) Use the LAG function to determine which IDs are wanted;
** only output ID when current row is 5 and prev row had status of 1;
** lag_stat is status value from prev row and status is value from current row;
** on first.id need to set lag_stat to missing -- because first.id has no previous row;
data wantid(keep=id)
lagstatus (keep=id date status lag_stat orig_ord);
set ck_pattern;
by id;
lag_stat = lag(status);

if first.id then lag_stat=.;
output lagstatus;
if lag_stat = 1 and status = 5 then output wantid;
run;

proc print data=lagstatus;
title 'Show how lag works -- but do not really need this dataset';
var id date status lag_stat orig_ord;
run;

proc print data=wantid;
title 'Just want these ids';
run;

** 3) do merge to get only IDs that meet pattern where orig obs has status of 1 or 5 and;
** put these obs into the dataset, HAVEPATTERN ;
** other datasets created are DONOTMEET -- for IDs that do not meet pattern at all;
** and EXTRAOBS for the other statuses from the IDs that do meet the pattern;
** you may want to just put EXTRAOBS rows into DONOTMEET -- not sure about this;
data havepattern donotmeet extraobs;
merge wantid(in=want)
ck_pattern(in=inorig);
by id;
if want=1 and inorig=1 then do;
if status in (1,5) then output havepattern;
else if status not in (1,5) then output extraobs;
end;
else if inorig=1 and want=0 then output donotmeet;
run;

proc print data=havepattern;
title 'IDs that Meet Pattern';
run;

proc print data=extraobs;
title 'IDs that meet Pattern but these status obs not needed';
run;

proc print data=donotmeet;
title 'Obs that Do Not Meet Pattern';
run;
[/pre]

Wouter · Posted 09-13-2010 09:32 AM

Thank you very much for this extensive answer!! I'll take a look at this later (right now, I'm on work), because this looks very interesting too!

Ksharp · Posted 09-13-2010 09:59 PM

Hi. I also have a way.
But honestly the lag() function really make me embarrassed, and I am not sure
how to properly use it,so I have to make temp-variables to save the lag obs.

[pre]
data ck_pattern;
infile datalines dlm=' ';
input ID date :anydtdte8. status;
format date mmddyy8.;
return;
datalines;
1 01-01-10 1
1 01-05-10 5
2 01-01-10 2
2 01-03-10 1
2 01-04-10 5
2 01-05-10 9
3 01-03-10 1
3 01-04-10 3
3 01-05-10 5
4 01-01-10 3
4 01-02-10 1
4 01-08-10 7
4 01-10-10 5
5 01-03-10 1
5 01-06-10 7
5 01-08-10 4
;
run;
proc print;
run;
data result;
set ck_pattern;
retain _id _date _status;
if status=5 and lag(status)=1 and id=lag(id) then do;
output;
_id=id;
_date=date;
_status=status;
output;
end;
_id=id; _date=date; _status=status; *to save the previous obs;
keep _id _date _status;
format _date mmddyy8.;
run;
proc print noobs;
run;
[/pre]

Ksharp

chang_y_chung_hotmail_com · Posted 09-14-2010 10:37 AM

Cynthia's solution involves reading the data 3+ times and writing 2+ times for the task that can be accomplished by reading the data once and writing less than once. And I don't think her code is particularly more readable or cleaner than others.

I like Ksharp's idea. This enables us to rely on lagging (which is natural in a data step), instead of leading (which is not). The only minor problem is that we not only have to lag id and status, but also need a temporary storage for a whole observation to output in the proper order (status 1 obs first followed by status 5 obs).

Here is my try using a well-known lead technique called "merge without by," with a bit of twist of using a view instead of dataset options. Although this reads the data twice, with proper names, the data step becomes rather straight forward. At least I think it is... 🙂

[pre]
/* test data */
data one;
input id :$1. date :mmddyy8. status :1.;
format date date.;
cards;
1 01-01-10 1
1 01-05-10 5
2 01-01-10 2
2 01-03-10 1
2 01-04-10 5
2 01-05-10 9
3 01-03-10 1
3 01-04-10 3
3 01-05-10 5
4 01-01-10 3
4 01-02-10 1
4 01-08-10 7
4 01-10-10 5
5 01-03-10 1
5 01-06-10 7
5 01-08-10 4
;
run;

data next/view=next;
set one(firstobs=2);
rename id=nextId date=nextDate status=nextStatus;
run;

proc optsave; run;
options mergenoby=nowarn;

data two;
merge one next;
/* no by intentional */
if status = 1 and nextId = id and nextStatus = 5;
output;
date = nextDate;
status = 5;
output;
keep id date status;
run;

proc optload; run;

/* check */
proc print data=two;
run;
/* on lst
Obs id date status

1 1 01JAN10 1
2 1 05JAN10 5
3 2 03JAN10 1
4 2 04JAN10 5
*/
[/pre]

data_null__ · Posted 09-13-2010 07:43 AM

I can't confirm your finding. I just ran the program again it produced this output.

[pre]
bs id date status

1 1 01JAN10 1
2 1 05JAN10 5
3 2 03JAN10 1
4 2 04JAN10 5
[/pre]

Wouter · Posted 09-13-2010 09:31 AM

Hi,

Sorry, it DID the job! But I was working on an existing dataset, so I assumed a firstobs=1 in stead of beginning at the second row (in fact, I removed this statement). Putting it back provided me with the results I wanted.

Many, many thanks!! Message was edited by: Wouter

deleted_user · Posted 09-14-2010 08:29 AM

Hi,

Without using the lag function but with a look ahead to the table test :

data filter;

/* have a look ahead to test */
if not eof then set test (firstobs=2 keep=status rename=(status=status2)) end=eof ;

/* load test */
set test;

retain following 0;
by id;

/* output status 1 if status 5 is following */
if not last.id and status eq 1 and status2 eq 5 then do;
output;
following=1;
end;

/* output the following status 5 */
else if following eq 1 then do;
output;
following=0;
end;

keep id date status;
run;

SAS Innovate 2025: Call for Content

Classroom Training Available!