BookmarkSubscribeRSS Feed
vomer
Obsidian | Level 7

Hi all, I have a question on how the following can be achieved:

Sample Data:

EventReason
S1Reason 1
S5Reason 1
S8Reason 1
S2Reason 2
S1Reason 4
S5Reason 1
S8Reason 5
S0Reason 0
S3Reason 7
S7Reason 1
S1Reason 1
S5Reason 1
S8Reason 1

Basically, I am trying to identify when a certain set of conditions are satisfied -> give me the count of this senario. So in the example above - I want to pull out the data where the sequence of "Event" goes from S1 to S5 to S8 and the Reason is Reason 1. Looking at the example above - I should get a total count of 2 and if I were to pull the data out into a new data set it would pull out the 2 sets where that sequence occurred and the reason was "Reason 1".

Thanks for your help!

15 REPLIES 15
Reeza
Super User

Interesting question.

How big is your data? 

Haikuo
Onyx | Level 15

I maybe seeing where Reeza is going. If your data is NOT that big, then try the following simple approach (3x pass):

data have;

input event$ reason $&9.;

cards;

S1    Reason 1

S5    Reason 1

S8    Reason 1

S2    Reason 2

S1    Reason 4

S5    Reason 1

S8    Reason 5

S0    Reason 0

S3    Reason 7

S7    Reason 1

S1    Reason 1

S5    Reason 1

S8    Reason 1

;

data want;

  merge have

       have(firstobs=2 keep=event reason rename=(event=_e2 reason=_r2))

have(firstobs=3 keep=event reason rename=(event=_e3 reason=_r3));

         RETAIN N;

if event='S1'

   AND _E2='S5'

   AND _E3='S8'

   AND REASON='Reason 1'

   AND _R2 ='Reason 1'

   AND _R3 ='Reason 1'

   THEN N=3;

      IF N>0 THEN OUTPUT;

   N=N-1;

   drop n _:;

RUN;

Haikuo

Reeza
Super User

Mines not nearly as elegant Smiley Happy

Could probably combine steps 1/2 but here's mine.

data have;

input Event $    Reason $&9. ;

cards;

S1    Reason 1

S5    Reason 1

S8    Reason 1

S2    Reason 2

S1    Reason 4

S5    Reason 1

S8    Reason 5

S0    Reason 0

S3    Reason 7

S7    Reason 1

S1    Reason 1

S5    Reason 1

S8    Reason 1

;

run;

data have2;

set have;

row=_n_;

run;

proc sql;

    create table have3 as

    select a.*, b.reason as reason2, b.event as event2, c.reason as reason3, c.event as event3

    from have2 a

    left join have2 b

    on a.row=b.row+1

    left join have2 c

    on a.row=c.row+2;

quit;

proc sql;

    create table want as

    select * from have3

    where row in

        (select row from have3 a1

        where a1.reason="Reason 1" and a1.Reason2="Reason 1" and a1.Reason3="Reason 1 "

        and a1.event="S8" and a1.event2="S5" and a1.event3="S1")

    or row in (

        select row-1 as row from have3 a2

        where a2.reason="Reason 1" and a2.Reason2="Reason 1" and a2.Reason3="Reason 1"

        and a2.event="S8" and a2.event2="S5" and a2.event3="S1")

    or row in (

        select row-2 as row from have3 a3

        where a3.reason="Reason 1" and a3.Reason2="Reason 1" and a3.Reason3="Reason 1"

        and a3.event="S8" and a3.event2="S5" and a3.event3="S1")

    ;

quit;

vomer
Obsidian | Level 7

Hi there, I have a question around Haikou's code:

data want;

  merge have

       have(firstobs=2 keep=event reason rename=(event=_e2 reason=_r2))

have(firstobs=3 keep=event reason rename=(event=_e3 reason=_r3));

         RETAIN N;

if event='S1'

   AND _E2='S5'

   AND _E3='S8'

   AND REASON='Reason 1'

   AND _R2 ='Reason 1'

   AND _R3 ='Reason 1'

   THEN N=3;

      IF N>0 THEN OUTPUT;

   N=N-1;

   drop n _:;

RUN;


I would like to know how I can add in an "ID" variable to consider in this code. For example - right now this code only considers the veriables "event" and "reason". What if I also have a customer ID variable that I would like to chunk this result by?


For example: If I only want Ccustomer #1's data that follows that pattern? Obviously I would not know every customer's number in a large data set but rather would like the query about to "sort" by a customer ID variable.

vomer
Obsidian | Level 7

Hello Reeza, my dataset is quite large (close to a million rows of data). I am working with a smaller subset to see what solutions are possible in SAS as a testing ground.

Thanks Reeza and Haikuo - I will attempt both these solutions and report back on my findings.

Any other tips/solutions for large data are always welcome as I am still learning this scenario.

KarlK
Fluorite | Level 6

You could read it sequentially, but it gets a little code heavy and is not nearly as elegant as the solutions you already have. You set a (retained) flag for each criteria and, if something happens out of sequence, you reset them and start over. It looks ugly but is something like (using Hai.Kuo's "have"):

data want (keep = Scenario1);

set have end = eof;

retain Scenario1 got1 got5 got8 0;

select;

  when (event = 'S1' and Reason = 'Reason 1')

   do;

    got1 = 1;

  got5 = 0;

  got8 = 0;

   end;

  when (event = 'S5' and Reason = 'Reason 1')

   do;

    if got1 = 1 then got5 = 1;

  else do; got1 = 0; got5 = 0; got8 = 0; end;

   end;

  when (event = 'S8' and Reason = 'Reason 1')

   do;

    if got1 = 1 and got5 = 1 then got8 = 1;

  else do; got1 = 0; got5 = 0; got8 = 0; end;

   end;

  otherwise

   do;

    got1 = 0; got5 = 0; got8 = 0;

   end;

end;

if sum(got1, got5, got8) = 3 then do;

  Scenario1 + 1;

  got1 = 0; got5 = 0; got8 = 0;

end;

if eof then output;

run;

The advantage to this ugly thing is that you could potentially test for multiple scenarios with a single pass through the data, in which case you'd have multiple sets of "gotx" flags. (Above is not only ugly but also not fully tested. It's meant to be illustrative, not production quality.)

HTH

Karl

Astounding
PROC Star

You may want to consider adding another variable as an ID variable.  If you save this information in a separate data set, you will not have unique matches:

Reason = 1 and Event goes from S1 to S5

Reason = 1 and Event goes from S5 to S8

These conditions appear more than once in your data, so how will you know which set to retrieve?

ballardw
Super User

You say the count should be 2 but there are 3 highlighted?!?

Haikuo
Onyx | Level 15

By using Hash(), We can have a viable one-pass data step solution, therefore, theoretically more efficient.  I believe you can do the same using Array() (after all, Hash is a type of array), but it will be not as easy as using Hash(), if you have both Char and Number variables. I have also added two additional variables to be closer to your real scenario.

data have;

input event$ reason$&8. OTHER_1 OTHER_2;

cards;

S1    Reason 1 2 3

S5    Reason 1 3 4

S8    Reason 1 4 5

S2    Reason 2 5 6

S1    Reason 4 6 8

S5    Reason 1 4 4

S8    Reason 5 7 7

S0    Reason 0 8 8

S3    Reason 7 4 3

S7    Reason 1 3 2

S1    Reason 1 3 5

S5    Reason 1 4 6 

S8    Reason 1 1 0

;

PROC SQL NOPRINT;

  SELECT QUOTE(CATS(NAME)) INTO :NAME SEPARATED BY ',' FROM DICTIONARY.COLUMNS WHERE LIBNAME='WORK' AND MEMNAME='HAVE';

  QUIT;

DATA WANT_HASH;

  IF _N_=1 THEN DO;

     IF 0 THEN SET HAVE;

       DECLARE HASH H(ORDERed:'Y');

       H.DEFINEKEY('EVENT');

       H.DEFINEDATA(&NAME.);

       H.DEFINEDONE();

       DECLARE HITER HI('H');

  END;

  SET HAVE;

  N+1;

      IF EVENT='S1' AND REASON='Reason 1' THEN DO;

H.CLEAR();

        N=1;

            H.REPLACE();

      END;

    ELSE IF EVENT='S5' AND REASON='Reason 1' AND N=2 THEN H.REPLACE();

      ELSE IF EVENT='S8' AND REASON='Reason 1' AND N=3 THEN H.REPLACE();

         

    IF H.NUM_ITEMS=3 THEN DO;

         DO RC=HI.FIRST() BY 0 WHILE (RC=0);

OUTPUT;

RC=HI.NEXT();

         END;

h.clear();

       END;

       drop N RC;

RUN;

Haikuo

PGStats
Opal | Level 21

I think it can be done this way. Put the sequences and reasons that you are looking for in one dataset and the full sequences in another dataset. Transform the sequences into strings and use the INDEX function to check if the expected sequence occurs (is a sub-sequence) in the observed sequence.

data lookfor;
input Event $   Reason $&;
datalines;
S1    Reason 1
S5    Reason 1
S8    Reason 1
S2    Reason 2
S6    Reason 2
S9    Reason 2
;

data conditions;
length sequence $200;
eventCount = 0;
do until (last.reason);
set lookFor; by reason notsorted;
sequence = catx("-",sequence,event);
eventCount + 1;
end;
if eventCount > 1;
run;

data seq;
input Event $   Reason $&;
datalines;
S1    Reason 1
S5    Reason 1
S8    Reason 1
S2    Reason 2
S1    Reason 4
S5    Reason 1
S8    Reason 5
S0    Reason 0
S3    Reason 7
S7    Reason 1
S1    Reason 1
S5    Reason 1
S8    Reason 1
;

data sequences;
length sequence $200;
eventCount = 0;
do until (last.reason);
set seq; by reason notsorted;
sequence = catx("-",sequence,event);
eventCount + 1;
end;
if eventCount > 1;
run;

proc sql;
create table seqFreq as
select reason, sequence, eventCount, count(*) as n
from sequences
group by reason, sequence, eventCount;

create table condFreq as
select reason, sequence, eventCount, count(*) as n
from conditions
group by reason, sequence, eventCount;

create table condSeqFreq as
select a.reason, a.sequence, a.eventCount, sum(b.n) as nTot
from condFreq as a left join seqFreq as b
on a.reason=b.reason and index(b.sequence, trim(a.sequence)) > 0
group by a.reason, a.sequence, a.eventCount;
quit;

proc print noobs; run;

 

PG

PG
vomer
Obsidian | Level 7

Thanks everyone. There are some great suggestions here and I will take some time to explore them all.

Haikou & PGStats - your updated solutions are new concepts for me - so many thanks for sharing these. I will have to read/experiment into these further.

I will try to report back on my experience as I think it can help others like me in similar scenarios.

PGStats
Opal | Level 21

vomer wrote:

Thanks everyone. There are some great suggestions here and I will take some time to explore them all.

Haikou & PGStats - your updated solutions are new concepts for me - so many thanks for sharing these. I will have to read/experiment into these further.

I will try to report back on my experience as I think it can help others like me in similar scenarios.

That was more than a month ago.

PG

PG
Ksharp
Super User

OK.

data have;
input event$ reason $&9.;
cards;
S1    Reason 1
S5    Reason 1
S8    Reason 1
S2    Reason 2
S1    Reason 4
S5    Reason 1
S8    Reason 5
S0    Reason 0
S3    Reason 7
S7    Reason 1
S1    Reason 1
S5    Reason 1
S8    Reason 1
;
run;
data have;
 set have;
 n+1;
run;
data id(keep=n);
 set have;
 if event='S8' and lag(event)='S5' and lag2(event)='S1' and
    reason='Reason 1' and lag(reason)='Reason 1' and lag2(reason)='Reason 1'  then do;
     group+1;call symputx('group',group);
     output;
 end;
run;
%put NOTE: How many groups : &group ;
data id;
 set id;
 output;
 n=n-1;output;
 n=n-1;output;
run;

data want;
 if _n_ eq 1 then do;
  declare hash ha(dataset:'id',hashexp:20);
   ha.definekey('n');
   ha.definedone();
 end;
 set have;
if ha.check()=0;
run;

Ksharp

vomer
Obsidian | Level 7

anyone^ ??

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 4291 views
  • 6 likes
  • 8 in conversation