Re: Identifying a sequence of events

vomer · Posted 06-20-2013 02:24 PM

Hi all, I have a question on how the following can be achieved:

Sample Data:

Event	Reason
S1	Reason 1
S5	Reason 1
S8	Reason 1
S2	Reason 2
S1	Reason 4
S5	Reason 1
S8	Reason 5
S0	Reason 0
S3	Reason 7
S7	Reason 1
S1	Reason 1
S5	Reason 1
S8	Reason 1

Basically, I am trying to identify when a certain set of conditions are satisfied -> give me the count of this senario. So in the example above - I want to pull out the data where the sequence of "Event" goes from S1 to S5 to S8 and the Reason is Reason 1. Looking at the example above - I should get a total count of 2 and if I were to pull the data out into a new data set it would pull out the 2 sets where that sequence occurred and the reason was "Reason 1".

Thanks for your help!

Reeza · Posted 06-20-2013 02:55 PM

Interesting question.

How big is your data?

Haikuo · Posted 06-20-2013 03:03 PM

I maybe seeing where Reeza is going. If your data is NOT that big, then try the following simple approach (3x pass):

data have;

input event$ reason $&9.;

cards;

S1 Reason 1

S5 Reason 1

S8 Reason 1

S2 Reason 2

S1 Reason 4

S5 Reason 1

S8 Reason 5

S0 Reason 0

S3 Reason 7

S7 Reason 1

S1 Reason 1

S5 Reason 1

S8 Reason 1

;

data want;

merge have

have(firstobs=2 keep=event reason rename=(event=_e2 reason=_r2))

have(firstobs=3 keep=event reason rename=(event=_e3 reason=_r3));

RETAIN N;

if event='S1'

AND _E2='S5'

AND _E3='S8'

AND REASON='Reason 1'

AND _R2 ='Reason 1'

AND _R3 ='Reason 1'

THEN N=3;

IF N>0 THEN OUTPUT;

N=N-1;

drop n _:;

RUN;

Haikuo

Reeza · Posted 06-20-2013 03:23 PM

Mines not nearly as elegant

Could probably combine steps 1/2 but here's mine.

data have;

input Event $ Reason $&9. ;

cards;

S1 Reason 1

S5 Reason 1

S8 Reason 1

S2 Reason 2

S1 Reason 4

S5 Reason 1

S8 Reason 5

S0 Reason 0

S3 Reason 7

S7 Reason 1

S1 Reason 1

S5 Reason 1

S8 Reason 1

;

run;

data have2;

set have;

row=_n_;

run;

proc sql;

create table have3 as

select a.*, b.reason as reason2, b.event as event2, c.reason as reason3, c.event as event3

from have2 a

left join have2 b

on a.row=b.row+1

left join have2 c

on a.row=c.row+2;

quit;

proc sql;

create table want as

select * from have3

where row in

(select row from have3 a1

where a1.reason="Reason 1" and a1.Reason2="Reason 1" and a1.Reason3="Reason 1 "

and a1.event="S8" and a1.event2="S5" and a1.event3="S1")

or row in (

select row-1 as row from have3 a2

where a2.reason="Reason 1" and a2.Reason2="Reason 1" and a2.Reason3="Reason 1"

and a2.event="S8" and a2.event2="S5" and a2.event3="S1")

or row in (

select row-2 as row from have3 a3

where a3.reason="Reason 1" and a3.Reason2="Reason 1" and a3.Reason3="Reason 1"

and a3.event="S8" and a3.event2="S5" and a3.event3="S1")

;

quit;

vomer · Posted 06-24-2013 12:33 PM

Hi there, I have a question around Haikou's code:

data want;

merge have

have(firstobs=2 keep=event reason rename=(event=_e2 reason=_r2))

have(firstobs=3 keep=event reason rename=(event=_e3 reason=_r3));

RETAIN N;

if event='S1'

AND _E2='S5'

AND _E3='S8'

AND REASON='Reason 1'

AND _R2 ='Reason 1'

AND _R3 ='Reason 1'

THEN N=3;

IF N>0 THEN OUTPUT;

N=N-1;

drop n _:;

RUN;

I would like to know how I can add in an "ID" variable to consider in this code. For example - right now this code only considers the veriables "event" and "reason". What if I also have a customer ID variable that I would like to chunk this result by?

For example: If I only want Ccustomer #1's data that follows that pattern? Obviously I would not know every customer's number in a large data set but rather would like the query about to "sort" by a customer ID variable.

vomer · Posted 06-20-2013 04:03 PM

Hello Reeza, my dataset is quite large (close to a million rows of data). I am working with a smaller subset to see what solutions are possible in SAS as a testing ground.

Thanks Reeza and Haikuo - I will attempt both these solutions and report back on my findings.

Any other tips/solutions for large data are always welcome as I am still learning this scenario.

KarlK · Posted 06-20-2013 04:43 PM

You could read it sequentially, but it gets a little code heavy and is not nearly as elegant as the solutions you already have. You set a (retained) flag for each criteria and, if something happens out of sequence, you reset them and start over. It looks ugly but is something like (using Hai.Kuo's "have"):

data want (keep = Scenario1);

set have end = eof;

retain Scenario1 got1 got5 got8 0;

select;

when (event = 'S1' and Reason = 'Reason 1')

do;

got1 = 1;

got5 = 0;

got8 = 0;

end;

when (event = 'S5' and Reason = 'Reason 1')

do;

if got1 = 1 then got5 = 1;

else do; got1 = 0; got5 = 0; got8 = 0; end;

end;

when (event = 'S8' and Reason = 'Reason 1')

do;

if got1 = 1 and got5 = 1 then got8 = 1;

else do; got1 = 0; got5 = 0; got8 = 0; end;

end;

otherwise

do;

got1 = 0; got5 = 0; got8 = 0;

end;

if sum(got1, got5, got8) = 3 then do;

Scenario1 + 1;

got1 = 0; got5 = 0; got8 = 0;

end;

if eof then output;

run;

The advantage to this ugly thing is that you could potentially test for multiple scenarios with a single pass through the data, in which case you'd have multiple sets of "gotx" flags. (Above is not only ugly but also not fully tested. It's meant to be illustrative, not production quality.)

HTH

Karl

Astounding · Posted 06-20-2013 04:39 PM

You may want to consider adding another variable as an ID variable. If you save this information in a separate data set, you will not have unique matches:

Reason = 1 and Event goes from S1 to S5

Reason = 1 and Event goes from S5 to S8

These conditions appear more than once in your data, so how will you know which set to retrieve?

ballardw · Posted 06-20-2013 05:31 PM

You say the count should be 2 but there are 3 highlighted?!?

Haikuo · Posted 06-20-2013 05:59 PM

By using Hash(), We can have a viable one-pass data step solution, therefore, theoretically more efficient. I believe you can do the same using Array() (after all, Hash is a type of array), but it will be not as easy as using Hash(), if you have both Char and Number variables. I have also added two additional variables to be closer to your real scenario.

data have;

input event$ reason$&8. OTHER_1 OTHER_2;

cards;

S1 Reason 1 2 3

S5 Reason 1 3 4

S8 Reason 1 4 5

S2 Reason 2 5 6

S1 Reason 4 6 8

S5 Reason 1 4 4

S8 Reason 5 7 7

S0 Reason 0 8 8

S3 Reason 7 4 3

S7 Reason 1 3 2

S1 Reason 1 3 5

S5 Reason 1 4 6

S8 Reason 1 1 0

;

PROC SQL NOPRINT;

SELECT QUOTE(CATS(NAME)) INTO :NAME SEPARATED BY ',' FROM DICTIONARY.COLUMNS WHERE LIBNAME='WORK' AND MEMNAME='HAVE';

QUIT;

DATA WANT_HASH;

IF _N_=1 THEN DO;

IF 0 THEN SET HAVE;

DECLARE HASH H(ORDERed:'Y');

H.DEFINEKEY('EVENT');

H.DEFINEDATA(&NAME.);

H.DEFINEDONE();

DECLARE HITER HI('H');

END;

SET HAVE;

N+1;

IF EVENT='S1' AND REASON='Reason 1' THEN DO;

H.CLEAR();

N=1;

H.REPLACE();

END;

ELSE IF EVENT='S5' AND REASON='Reason 1' AND N=2 THEN H.REPLACE();

ELSE IF EVENT='S8' AND REASON='Reason 1' AND N=3 THEN H.REPLACE();

IF H.NUM_ITEMS=3 THEN DO;

DO RC=HI.FIRST() BY 0 WHILE (RC=0);

OUTPUT;

RC=HI.NEXT();

END;

h.clear();

END;

drop N RC;

RUN;

Haikuo

PGStats · Posted 06-20-2013 06:10 PM

I think it can be done this way. Put the sequences and reasons that you are looking for in one dataset and the full sequences in another dataset. Transform the sequences into strings and use the INDEX function to check if the expected sequence occurs (is a sub-sequence) in the observed sequence.

data lookfor;
input Event $   Reason $&;
datalines;
S1    Reason 1
S5    Reason 1
S8    Reason 1
S2    Reason 2
S6    Reason 2
S9    Reason 2
;

data conditions;
length sequence $200;
eventCount = 0;
do until (last.reason);
set lookFor; by reason notsorted;
sequence = catx("-",sequence,event);
eventCount + 1;
end;
if eventCount > 1;
run;

data seq;
input Event $   Reason $&;
datalines;
S1    Reason 1
S5    Reason 1
S8    Reason 1
S2    Reason 2
S1    Reason 4
S5    Reason 1
S8    Reason 5
S0    Reason 0
S3    Reason 7
S7    Reason 1
S1    Reason 1
S5    Reason 1
S8    Reason 1
;

data sequences;
length sequence $200;
eventCount = 0;
do until (last.reason);
set seq; by reason notsorted;
sequence = catx("-",sequence,event);
eventCount + 1;
end;
if eventCount > 1;
run;

proc sql;
create table seqFreq as
select reason, sequence, eventCount, count(*) as n
from sequences
group by reason, sequence, eventCount;

create table condFreq as
select reason, sequence, eventCount, count(*) as n
from conditions
group by reason, sequence, eventCount;

create table condSeqFreq as
select a.reason, a.sequence, a.eventCount, sum(b.n) as nTot
from condFreq as a left join seqFreq as b
on a.reason=b.reason and index(b.sequence, trim(a.sequence)) > 0
group by a.reason, a.sequence, a.eventCount;
quit;

proc print noobs; run;

PG

vomer · Posted 06-20-2013 11:09 PM

Thanks everyone. There are some great suggestions here and I will take some time to explore them all.

Haikou & PGStats - your updated solutions are new concepts for me - so many thanks for sharing these. I will have to read/experiment into these further.

I will try to report back on my experience as I think it can help others like me in similar scenarios.

PGStats · Posted 08-16-2013 10:51 AM

vomer wrote:

Thanks everyone. There are some great suggestions here and I will take some time to explore them all.

Haikou & PGStats - your updated solutions are new concepts for me - so many thanks for sharing these. I will have to read/experiment into these further.

I will try to report back on my experience as I think it can help others like me in similar scenarios.

That was more than a month ago.

PG

Ksharp · Posted 06-20-2013 11:46 PM

OK.

data have;
input event$ reason $&9.;
cards;
S1    Reason 1
S5    Reason 1
S8    Reason 1
S2    Reason 2
S1    Reason 4
S5    Reason 1
S8    Reason 5
S0    Reason 0
S3    Reason 7
S7    Reason 1
S1    Reason 1
S5    Reason 1
S8    Reason 1
;
run;
data have;
 set have;
 n+1;
run;
data id(keep=n);
 set have;
 if event='S8' and lag(event)='S5' and lag2(event)='S1' and
    reason='Reason 1' and lag(reason)='Reason 1' and lag2(reason)='Reason 1'  then do;
     group+1;call symputx('group',group);
     output;
 end;
run;
%put NOTE: How many groups : &group ;
data id;
 set id;
 output;
 n=n-1;output;
 n=n-1;output;
run;

data want;
 if _n_ eq 1 then do;
  declare hash ha(dataset:'id',hashexp:20);
   ha.definekey('n');
   ha.definedone();
 end;
 set have;
if ha.check()=0;
run;

Ksharp

vomer · Posted 08-16-2013 09:40 AM

anyone^ ??

Registration is open

SAS Training: Just a Click Away