Hi all, I have a question on how the following can be achieved:
Sample Data:
Event | Reason |
S1 | Reason 1 |
S5 | Reason 1 |
S8 | Reason 1 |
S2 | Reason 2 |
S1 | Reason 4 |
S5 | Reason 1 |
S8 | Reason 5 |
S0 | Reason 0 |
S3 | Reason 7 |
S7 | Reason 1 |
S1 | Reason 1 |
S5 | Reason 1 |
S8 | Reason 1 |
Basically, I am trying to identify when a certain set of conditions are satisfied -> give me the count of this senario. So in the example above - I want to pull out the data where the sequence of "Event" goes from S1 to S5 to S8 and the Reason is Reason 1. Looking at the example above - I should get a total count of 2 and if I were to pull the data out into a new data set it would pull out the 2 sets where that sequence occurred and the reason was "Reason 1".
Thanks for your help!
Interesting question.
How big is your data?
I maybe seeing where Reeza is going. If your data is NOT that big, then try the following simple approach (3x pass):
data have;
input event$ reason $&9.;
cards;
S1 Reason 1
S5 Reason 1
S8 Reason 1
S2 Reason 2
S1 Reason 4
S5 Reason 1
S8 Reason 5
S0 Reason 0
S3 Reason 7
S7 Reason 1
S1 Reason 1
S5 Reason 1
S8 Reason 1
;
data want;
merge have
have(firstobs=2 keep=event reason rename=(event=_e2 reason=_r2))
have(firstobs=3 keep=event reason rename=(event=_e3 reason=_r3));
RETAIN N;
if event='S1'
AND _E2='S5'
AND _E3='S8'
AND REASON='Reason 1'
AND _R2 ='Reason 1'
AND _R3 ='Reason 1'
THEN N=3;
IF N>0 THEN OUTPUT;
N=N-1;
drop n _:;
RUN;
Haikuo
Mines not nearly as elegant
Could probably combine steps 1/2 but here's mine.
data have;
input Event $ Reason $&9. ;
cards;
S1 Reason 1
S5 Reason 1
S8 Reason 1
S2 Reason 2
S1 Reason 4
S5 Reason 1
S8 Reason 5
S0 Reason 0
S3 Reason 7
S7 Reason 1
S1 Reason 1
S5 Reason 1
S8 Reason 1
;
run;
data have2;
set have;
row=_n_;
run;
proc sql;
create table have3 as
select a.*, b.reason as reason2, b.event as event2, c.reason as reason3, c.event as event3
from have2 a
left join have2 b
on a.row=b.row+1
left join have2 c
on a.row=c.row+2;
quit;
proc sql;
create table want as
select * from have3
where row in
(select row from have3 a1
where a1.reason="Reason 1" and a1.Reason2="Reason 1" and a1.Reason3="Reason 1 "
and a1.event="S8" and a1.event2="S5" and a1.event3="S1")
or row in (
select row-1 as row from have3 a2
where a2.reason="Reason 1" and a2.Reason2="Reason 1" and a2.Reason3="Reason 1"
and a2.event="S8" and a2.event2="S5" and a2.event3="S1")
or row in (
select row-2 as row from have3 a3
where a3.reason="Reason 1" and a3.Reason2="Reason 1" and a3.Reason3="Reason 1"
and a3.event="S8" and a3.event2="S5" and a3.event3="S1")
;
quit;
Hi there, I have a question around Haikou's code:
data want;
merge have
have(firstobs=2 keep=event reason rename=(event=_e2 reason=_r2))
have(firstobs=3 keep=event reason rename=(event=_e3 reason=_r3));
RETAIN N;
if event='S1'
AND _E2='S5'
AND _E3='S8'
AND REASON='Reason 1'
AND _R2 ='Reason 1'
AND _R3 ='Reason 1'
THEN N=3;
IF N>0 THEN OUTPUT;
N=N-1;
drop n _:;
RUN;
I would like to know how I can add in an "ID" variable to consider in this code. For example - right now this code only considers the veriables "event" and "reason". What if I also have a customer ID variable that I would like to chunk this result by?
For example: If I only want Ccustomer #1's data that follows that pattern? Obviously I would not know every customer's number in a large data set but rather would like the query about to "sort" by a customer ID variable.
Hello Reeza, my dataset is quite large (close to a million rows of data). I am working with a smaller subset to see what solutions are possible in SAS as a testing ground.
Thanks Reeza and Haikuo - I will attempt both these solutions and report back on my findings.
Any other tips/solutions for large data are always welcome as I am still learning this scenario.
You could read it sequentially, but it gets a little code heavy and is not nearly as elegant as the solutions you already have. You set a (retained) flag for each criteria and, if something happens out of sequence, you reset them and start over. It looks ugly but is something like (using Hai.Kuo's "have"):
data want (keep = Scenario1);
set have end = eof;
retain Scenario1 got1 got5 got8 0;
select;
when (event = 'S1' and Reason = 'Reason 1')
do;
got1 = 1;
got5 = 0;
got8 = 0;
end;
when (event = 'S5' and Reason = 'Reason 1')
do;
if got1 = 1 then got5 = 1;
else do; got1 = 0; got5 = 0; got8 = 0; end;
end;
when (event = 'S8' and Reason = 'Reason 1')
do;
if got1 = 1 and got5 = 1 then got8 = 1;
else do; got1 = 0; got5 = 0; got8 = 0; end;
end;
otherwise
do;
got1 = 0; got5 = 0; got8 = 0;
end;
end;
if sum(got1, got5, got8) = 3 then do;
Scenario1 + 1;
got1 = 0; got5 = 0; got8 = 0;
end;
if eof then output;
run;
The advantage to this ugly thing is that you could potentially test for multiple scenarios with a single pass through the data, in which case you'd have multiple sets of "gotx" flags. (Above is not only ugly but also not fully tested. It's meant to be illustrative, not production quality.)
HTH
Karl
You may want to consider adding another variable as an ID variable. If you save this information in a separate data set, you will not have unique matches:
Reason = 1 and Event goes from S1 to S5
Reason = 1 and Event goes from S5 to S8
These conditions appear more than once in your data, so how will you know which set to retrieve?
You say the count should be 2 but there are 3 highlighted?!?
By using Hash(), We can have a viable one-pass data step solution, therefore, theoretically more efficient. I believe you can do the same using Array() (after all, Hash is a type of array), but it will be not as easy as using Hash(), if you have both Char and Number variables. I have also added two additional variables to be closer to your real scenario.
data have;
input event$ reason$&8. OTHER_1 OTHER_2;
cards;
S1 Reason 1 2 3
S5 Reason 1 3 4
S8 Reason 1 4 5
S2 Reason 2 5 6
S1 Reason 4 6 8
S5 Reason 1 4 4
S8 Reason 5 7 7
S0 Reason 0 8 8
S3 Reason 7 4 3
S7 Reason 1 3 2
S1 Reason 1 3 5
S5 Reason 1 4 6
S8 Reason 1 1 0
;
PROC SQL NOPRINT;
SELECT QUOTE(CATS(NAME)) INTO :NAME SEPARATED BY ',' FROM DICTIONARY.COLUMNS WHERE LIBNAME='WORK' AND MEMNAME='HAVE';
QUIT;
DATA WANT_HASH;
IF _N_=1 THEN DO;
IF 0 THEN SET HAVE;
DECLARE HASH H(ORDERed:'Y');
H.DEFINEKEY('EVENT');
H.DEFINEDATA(&NAME.);
H.DEFINEDONE();
DECLARE HITER HI('H');
END;
SET HAVE;
N+1;
IF EVENT='S1' AND REASON='Reason 1' THEN DO;
H.CLEAR();
N=1;
H.REPLACE();
END;
ELSE IF EVENT='S5' AND REASON='Reason 1' AND N=2 THEN H.REPLACE();
ELSE IF EVENT='S8' AND REASON='Reason 1' AND N=3 THEN H.REPLACE();
IF H.NUM_ITEMS=3 THEN DO;
DO RC=HI.FIRST() BY 0 WHILE (RC=0);
OUTPUT;
RC=HI.NEXT();
END;
h.clear();
END;
drop N RC;
RUN;
Haikuo
I think it can be done this way. Put the sequences and reasons that you are looking for in one dataset and the full sequences in another dataset. Transform the sequences into strings and use the INDEX function to check if the expected sequence occurs (is a sub-sequence) in the observed sequence.
data lookfor;
input Event $ Reason $&;
datalines;
S1 Reason 1
S5 Reason 1
S8 Reason 1
S2 Reason 2
S6 Reason 2
S9 Reason 2
;
data conditions;
length sequence $200;
eventCount = 0;
do until (last.reason);
set lookFor; by reason notsorted;
sequence = catx("-",sequence,event);
eventCount + 1;
end;
if eventCount > 1;
run;
data seq;
input Event $ Reason $&;
datalines;
S1 Reason 1
S5 Reason 1
S8 Reason 1
S2 Reason 2
S1 Reason 4
S5 Reason 1
S8 Reason 5
S0 Reason 0
S3 Reason 7
S7 Reason 1
S1 Reason 1
S5 Reason 1
S8 Reason 1
;
data sequences;
length sequence $200;
eventCount = 0;
do until (last.reason);
set seq; by reason notsorted;
sequence = catx("-",sequence,event);
eventCount + 1;
end;
if eventCount > 1;
run;
proc sql;
create table seqFreq as
select reason, sequence, eventCount, count(*) as n
from sequences
group by reason, sequence, eventCount;
create table condFreq as
select reason, sequence, eventCount, count(*) as n
from conditions
group by reason, sequence, eventCount;
create table condSeqFreq as
select a.reason, a.sequence, a.eventCount, sum(b.n) as nTot
from condFreq as a left join seqFreq as b
on a.reason=b.reason and index(b.sequence, trim(a.sequence)) > 0
group by a.reason, a.sequence, a.eventCount;
quit;
proc print noobs; run;
PG
Thanks everyone. There are some great suggestions here and I will take some time to explore them all.
Haikou & PGStats - your updated solutions are new concepts for me - so many thanks for sharing these. I will have to read/experiment into these further.
I will try to report back on my experience as I think it can help others like me in similar scenarios.
vomer wrote:
Thanks everyone. There are some great suggestions here and I will take some time to explore them all.
Haikou & PGStats - your updated solutions are new concepts for me - so many thanks for sharing these. I will have to read/experiment into these further.
I will try to report back on my experience as I think it can help others like me in similar scenarios.
That was more than a month ago.
PG
OK.
data have; input event$ reason $&9.; cards; S1 Reason 1 S5 Reason 1 S8 Reason 1 S2 Reason 2 S1 Reason 4 S5 Reason 1 S8 Reason 5 S0 Reason 0 S3 Reason 7 S7 Reason 1 S1 Reason 1 S5 Reason 1 S8 Reason 1 ; run; data have; set have; n+1; run; data id(keep=n); set have; if event='S8' and lag(event)='S5' and lag2(event)='S1' and reason='Reason 1' and lag(reason)='Reason 1' and lag2(reason)='Reason 1' then do; group+1;call symputx('group',group); output; end; run; %put NOTE: How many groups : &group ; data id; set id; output; n=n-1;output; n=n-1;output; run; data want; if _n_ eq 1 then do; declare hash ha(dataset:'id',hashexp:20); ha.definekey('n'); ha.definedone(); end; set have; if ha.check()=0; run;
Ksharp
anyone^ ??
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.