I really appreciate you taking your time, and creating a code! I am having trouble creating a dataset with the criteria and any ideas of how to do it would be greatly appreciate 🙂
Here is an approach to detect sequence of events patterns using string pattern matching.
data have;
input ID Category :&$24. Date :mmddyy10.;
format date yymmdd10.;
datalines;
001 initial interview 02/01/2016
001 service delivered 07/01/2016
001 re-interview 08/01/2016
001 service delivered 10/01/2016
001 re-interview 01/01/2017
001 service delivered 03/01/2017
001 re-interview 08/01/2017
002 initial interview 02/01/2016
002 service delivered 03/01/2016
002 re-interview 08/01/2016
002 service delivered 10/01/2016
002 re-interview 01/01/2017
002 service delivered 03/01/2017
002 re-interview 08/01/2017
003 initial interview 02/01/2016
003 re-interview 08/01/2016
003 service delivered 10/01/2016
003 re-interview 01/01/2017
003 service delivered 03/01/2017
003 re-interview 08/01/2017
004 initial interview 02/01/2016
004 service delivered 03/01/2016
004 re-interview 08/01/2016
004 service delivered 10/01/2016
004 re-interview 01/01/2017
004 service delivered 03/01/2017
004 re-interview 08/01/2017
004 service delivered 09/01/2017
004 re-interview 12/01/2017
;
data str;
/*
The daily sequence of events will be represented as a string with one
letter per day, as follows
i: initial interview,
r: re-interview,
s: service delivered
x: nothing
*/
/*
Prepare a pattern to match all the following conditions:
•The client must have an initial interview.
•The client must have at least one service (cleaning house,
delivering groceries, etc.) delivered within 6 months of the
initial interview.
•The client must have at least 3 re-interviews 5-7 months apart
from each other.
The number of days used are:
182 = 6 months - 1 day,
152 = 5 months - 1 day,
212 = 7 months - 1 day
*/
if not prxId1 then
prxId1 + prxParse("/ix{0,182}s[sx]*(r[sx]{152,212}r[sx]{152,212}r)/");
/*
Prepare a pattern to match the following extra condition:
•The client must have at least one service delivered between
the re-interviews.
*/
if not prxId2 then
prxId2 + prxParse("/r[sx]*s[sx]*r[sx]*s[sx]*r/");
/* Transform the sequence of events for a given ID into a string with
one letter per day */
length str $2000; /* Max sequence length: more than 5 years */
do until(last.id);
pDate = date;
set have; by id;
if not first.id then do;
pDate = intnx("day", pDate, 1);
do while (pDate < date);
str = cats(str, "x");
pDate = intnx("day", pDate, 1);
end;
end;
str = cats(str, first(Category));
end;
/* Check if the string matches the patterns */
start = 1;
stop = length(str);
call prxNext(prxId1, start, stop, str, pos, len);
if pos > 0 then
ok = prxMatch(prxId2, prxPosn(prxId1, 1, str));
/* Recover the events corresponding to the matched substring */
i = 0;
do until(last.id);
pDate = date;
set have; by id;
if not first.id then do;
pDate = intnx("day", pDate, 1);
do while (pDate < date);
i = i + 1;
pDate = intnx("day", pDate, 1);
end;
end;
i = i + 1;
if ok and i >= pos and i < pos + len then output;
end;
keep id category date;
run;
proc print noobs; var id date category; run;
Hello @PGStats,
Thanks a lot for the code! I will apply the suggestions tomorrow (due to SAS availability) and let you know.
Hello, here is a new optimized version meeting your latest criteria
data str;
/*
The daily sequence of events will be represented as a string with one
letter per day, as follows
i: initial interview,
r: re-interview,
s: service delivered
x: nothing
*/
/*
Prepare a pattern to match all the following conditions:
•The client must have an initial interview.
•The client must have three subsequent re-interviews
•The client must have at least one service (cleaning house,
delivering groceries, etc.) delivered between interviews.
*/
if not prxId1 then
prxId1 + prxParse("/i((x*)s[sx]*)r(x*[sx]*)r(x*[sx]*)r/");
/* Transform the sequence of events for a given clientID into a string with
one letter per day */
length str $2000; /* Max sequence length: more than 5 years */
do until(last.clientId);
pDate = date;
set have; by clientId;
if not first.clientId then do;
days = intck("day", pDate, date);
if days >= 2 then str = cats(str, repeat("x", days-2));
end;
str = cats(str, first(Category));
end;
/* Check if the string matches the pattern */
start = 1;
stop = length(str);
call prxNext(prxId1, start, stop, str, pos, len);
/* Check the inter-event durations (nb of chars = nb of days) */
if pos > 0 then ok =
/* first service after initial interview within 6 months */
length(prxPosn(prxId1,2,str)) <= 182 and
/* inter-interview times between 5 and 7 months */
152 <= length(prxPosn(prxId1,1,str)) <= 212 and
152 <= length(prxPosn(prxId1,3,str)) <= 212 and
152 <= length(prxPosn(prxId1,4,str)) <= 212;
/* Recover the events corresponding to the matched substring */
i = 0;
do until(last.clientId);
pDate = date;
set have; by clientId;
if not first.clientId then do;
days = intck("day", pDate, date);
if days >= 2 then i = i + days - 1;
end;
i = i + 1;
if ok and i >= pos and i < pos + len then output;
end;
keep clientId category date;
run;
proc print data=str noobs; var clientId date category; run;
To summarize, the pattern is used to check the sequence of events and the lengths of the matched buffers (expressions in parentheses) are used to validate the inter event durations.
Hello @PGStats,
I used your program, and I got o observations. My dataset has about a million observations.
Please find a log attached.
Thanks!
The problem is most likely with the input data, namely the Category values. The first letter of the Category is used to build the string in
str = cats(str, first(Category));
and the expected letters are 'i', 'r', and 's'. Depending on the Category values in your data you may need a more robust letter assignment such as
length c $1;
c = lowcase(first(left(Category)));
if c in ("i", "r", "s") then str = cats(str, c);
else str = cats(str, "x");
Hello @PGStats
I added the code provided above and I got the error below:
Thanks!
2128 data str;
2129 /* The daily sequence of events will be represented as a string with one
2130 letter per day, as follows
2131
2132 i: Initial Interview,
2133 r: Re-Interview,
2134 s: Services Delivered
2135 x: Nothing
2136 */
2137
2138 /*Prepare a pattern to match all the following conditions:
2139
2140 •The client must have an Initial Interview.
2141 •The client must have three subsequent Re-Interviews.
2142 •The client must have at least one service delivered between the Assessments (within 6
2142! months from the Initial Interview)
2143 •The Re-Interviews should be 5-7 months from the previous one
2144 */
2145
2146 if not prxId1 then
2147 prxId1 + prxParse("/i((x*)s[sx]*)r(x*[sx]*)r(x*[sx]*)r/");
2148
2149 /* Transform the sequence of events for a given clientid into a string with
2150 one letter per day */
2151 length str $30000; /* Max sequence length: more than 5 years */
2152 do until(last.clientid);
2153 pdate = date;
2154 set have; by clientid;
ERROR: Variable str has been defined as both character and numeric.
2155 if not first.clientid then do;
2156 days = intck("day", pdate, date);
2157 if days >= 2 then str = cats(str, repeat("x", days-2));
2158 end;
2159 str = cats(str, first(form_final));
2160 end;
2161
2162 length c $1;*/
2163 c = lowcase(first(left(form_final)));
2164 if c in ("i", "r", "s") then str = cats(str, c);
2165 else str = cats(str, "x");
2166
2167 /* Check if the string matches the pattern */
2168 start = 1;
2169 stop = length(str);
2170 call prxNext(prxId1, start, stop, str, pos, len);
2171
2172 /* Check the inter-event durations (nb of chars = nb of days) */
2173 if pos > 0 then ok =
2174 /* first service after Initial Interview within 6 months */
2175 length(prxPosn(prxId1,2,str)) <= 182 and
2176 /* inter-Re-Interview times between 5 and 7 months */
2177 152 <= length(prxPosn(prxId1,1,str)) <= 212 and
2178 152 <= length(prxPosn(prxId1,3,str)) <= 212 and
2179 152 <= length(prxPosn(prxId1,4,str)) <= 212;
2180
2181 /* Recover the events corresponding to the matched substring */
2182 i = 0;
2183 do until(last.clientid);
2184 pdate = date;
2185 set have; by clientid;
ERROR: Variable str has been defined as both character and numeric.
2186 if not first.clientid then do;
2187 days = intck("day", pdate, date);
2188 if days >= 2 then i = i + days - 1;
2189 end;
2190 i = i + 1;
2191 if ok and i >= pos and i < pos + len then output;
2192 end;
2193
2194 keep clientid category date;
2195 run;
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.STR may be incomplete. When this step was stopped there were 0
observations and 3 variables.
WARNING: Data set WORK.STR was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.03 seconds
You obviously didn't understand the code I provided.
What is the full set of form_final values?
proc sql; select unique form_final from have; quit;
Hi @PGStats
"form_final" is the same as "category". I replaced my variables' names with the variables' names in the original example so it could match, but I forgot to replace form_final. In other words, the log I provided is what I got with my variables, then I replaced the names (except for form_final).
Thanks
I'll assume that none of the proposed solutions so far worked for your real data is because your category values are not really 'initial interview', 're-interview' and 'service delivered'. Everybody is looking for these values are noone finds them. Adjust the following code to what your actual category values are, and it should work, just as it does for example data.
/* Define single letter codes for uppercased Category values.
Change the values on the left of equal signs to suit your needs. */
proc format;
value $ Category
"INITIAL INTERVIEW" = "i"
"RE-INTERVIEW" = "r"
"SERVICE DELIVERED" = "s"
OTHER = "x";
run;
data str;
/*
The daily sequence of events will be represented as a string with one
letter per day, as follows
i: initial interview,
r: re-interview,
s: service delivered
x: nothing
*/
/*
Prepare a pattern to match all the following conditions:
•The client must have an initial interview.
•The client must have three subsequent re-interviews
•The client must have at least one service (cleaning house,
delivering groceries, etc.) delivered between interviews.
*/
if not prxId1 then
prxId1 + prxParse("/i((x*)s[sx]*)r(x*[sx]*)r(x*[sx]*)r/");
/* Transform the sequence of events for a given clientID into a string with
one letter per day */
length str $2000; /* Max sequence length: more than 5 years */
do until(last.clientId);
pDate = date;
set have; by clientId;
if not first.clientId then do;
days = intck("day", pDate, date);
if days >= 2 then str = cats(str, repeat("x", days-2));
end;
str = cats(str, put(upcase(Category), $Category.));
end;
/* Check if the string matches the pattern */
start = 1;
stop = length(str);
call prxNext(prxId1, start, stop, str, pos, len);
/* Check the inter-event durations (nb of chars = nb of days) */
if pos > 0 then do;
call prxposn(prxId1, 1, dum, len1);
call prxposn(prxId1, 2, dum, len2);
call prxposn(prxId1, 3, dum, len3);
call prxposn(prxId1, 4, dum, len4);
ok =
/* first service after initial interview within 6 months */
len2 <= 182 and
/* inter-interview times between 5 and 7 months */
152 <= len1 <= 212 and
152 <= len3 <= 212 and
152 <= len4 <= 212;
end;
/* Recover the events corresponding to the matched substring */
i = 0;
do until(last.clientId);
pDate = date;
set have; by clientId;
if not first.clientId then do;
days = intck("day", pDate, date);
if days >= 2 then i = i + days - 1;
end;
i = i + 1;
if ok and i >= pos and i < pos + len then output;
end;
keep clientId category date;
run;
Hi @PGStats,
Thanks for the input. I fix it using your recommendation of the c = lowcase(first(left(Category) and your code worked with my data, I was just double checking the results because I got very few ClientIDs and was expecting to have more. I will also try using the updated suggestion and see if I get the same results. I accept your code as a solution, thanks a lot!
I also tried @art297 and it worked too, the only thing is that I do not know how to get the initial interview in the output (the observation with the initial interview is very important in determining other factors in the dataset), the code is only giving me re-interviews and services (my data is very complicated).
Thank you both for your great input and helping me tremendously to find a way!!!
I don't care which solution you accept, but I had sent you the code for including whichever records you want in the output. I'll repeat it here:
data want; merge client_engage_2 (where=(lowcase(first(left(category))) in ('i','r'))) client_engage_3 (in=in3); by ClientId; if in3; run;
Art, CEO, AnalystFinder.com
Thank you
data have; infile datalines dlm=','; informat ClientID 8.; informat category $20.; informat date mmddyy10.; input ClientID Category Date; format date mmddyy10.; datalines; 1,initial interview,2/1/2016 1,service delivered,8/25/2016 1,re-interview,8/26/2016 1,service delivered,10/1/2016 1,re-interview,1/1/2017 1,service delivered,3/1/2017 1,re-interview,8/1/2017 2,initial interview,2/1/2016 2,service delivered,3/1/2016 2,re-interview,8/1/2016 2,service delivered,10/1/2016 2,re-interview,1/1/2017 2,service delivered,3/1/2017 2,re-interview,8/1/2017 3,initial interview,2/1/2016 3,re-interview,8/1/2016 3,service delivered,10/1/2016 3,re-interview,1/1/2017 3,service delivered,3/1/2017 3,re-interview,8/1/2017 4,initial interview,2/1/2016 4,service delivered,3/1/2016 4,re-interview,8/1/2016 4,service delivered,10/1/2016 4,re-interview,1/1/2017 4,service delivered,3/1/2017 4,re-interview,8/1/2017 4,service delivered,9/1/2017 4,re-interview,12/1/2017 5,initial interview,2/1/2016 5,service delivered,3/1/2016 5,re-interview,8/1/2016 5,service delivered,10/1/2016 5,re-interview,1/1/2017 5,service delivered,3/1/2017 5,re-interview,8/1/2017 6,initial interview,2/1/2016 6,service delivered,8/2/2016 6,re-interview,8/3/2016 6,service delivered,10/1/2016 6,re-interview,1/1/2017 6,service delivered,3/1/2017 6,re-interview,8/1/2017 7,initial interview,2/29/2016 7,service delivered,8/29/2016 7,re-interview,9/3/2016 7,service delivered,10/1/2016 7,re-interview,2/3/2017 7,service delivered,3/1/2017 7,re-interview,9/1/2017 8,initial interview,2/29/2016 8,service delivered,8/29/2016 8,re-interview,9/3/2016 8,re-interview,10/3/2016 8,re-interview,11/3/2016 8,re-interview,12/3/2016 8,service delivered,12/4/2016 8,re-interview,6/3/2017 8,service delivered,6/5/2017 8,re-interview,12/2/2017 ; data temp; set have; by ClientID; if first.ClientID then n=0; n+1; run; data temp1; set temp; by ClientID Category notsorted; if first.Category; run; data temp2; set temp1; by ClientID; retain yes 1; if first.ClientID then yes=1; lag_Date=lag(Date); lag2_Date=lag2(Date); if n=1 then do; if Category ne 'initial interview' then yes=0; end; else if n=2 then do; if Category ne 'service delivered' then yes=0; else if intck('month',lag_date,date,'c')>6 then yes=0;; end; else if mod(n,2)=1 then do; if Category ne 're-interview' then yes=0; else do; if n=3 then do; if yes then output; end; else do; if 5<=intck('month',lag2_date,date,'c')<=7 then do; if yes then output; end; else yes=0; end; end; end; else if mod(n,2)=0 then do; if Category ne 'service delivered' then yes=0; end; run; data temp3; set temp2; by ClientID; if last.ClientID; keep ClientID n; run; data temp4; merge temp temp3(rename=(n=_n)); by ClientID; if n<=_n; drop n _n; run; proc sql; create table want as select * from temp4 group by ClientID having sum(Category='re-interview')>2 order by ClientID,date; quit;
@Ksharp: I see, in your subsequent post, that you modified the code so that ClientID 4 would be accepted. ClientID 4 meets all of the conditions that @Asileon has mentioned thus far. A more complicated one is shown in ClientID 9, below.
Nothing in @Asileon's criteria would eliminate ClientID 9, but both your code and the one proposed by @PGStats fail to select it. If it shouldn't be selected then it's for a criterion that hasn't yet been specified.
Art, CEO, AnalystFinder.com
data have; infile datalines dlm=','; informat ClientID 8.; informat category $20.; informat date mmddyy10.; input ClientID Category Date; format date mmddyy10.; datalines; 1,initial interview,2/1/2016 1,service delivered,8/25/2016 1,re-interview,8/26/2016 1,service delivered,10/1/2016 1,re-interview,1/1/2017 1,service delivered,3/1/2017 1,re-interview,8/1/2017 2,initial interview,2/1/2016 2,service delivered,3/1/2016 2,re-interview,8/1/2016 2,service delivered,10/1/2016 2,re-interview,1/1/2017 2,service delivered,3/1/2017 2,re-interview,8/1/2017 3,initial interview,2/1/2016 3,re-interview,8/1/2016 3,service delivered,10/1/2016 3,re-interview,1/1/2017 3,service delivered,3/1/2017 3,re-interview,8/1/2017 4,initial interview,2/1/2016 4,service delivered,3/1/2016 4,re-interview,8/1/2016 4,service delivered,10/1/2016 4,re-interview,1/1/2017 4,service delivered,3/1/2017 4,re-interview,8/1/2017 4,service delivered,9/1/2017 4,re-interview,12/1/2017 5,initial interview,2/1/2016 5,service delivered,3/1/2016 5,re-interview,8/1/2016 5,service delivered,10/1/2016 5,re-interview,1/1/2017 5,service delivered,3/1/2017 5,re-interview,8/1/2017 6,initial interview,2/1/2016 6,service delivered,8/2/2016 6,re-interview,8/3/2016 6,service delivered,10/1/2016 6,re-interview,1/1/2017 6,service delivered,3/1/2017 6,re-interview,8/1/2017 7,initial interview,2/29/2016 7,service delivered,8/29/2016 7,re-interview,9/3/2016 7,service delivered,10/1/2016 7,re-interview,2/3/2017 7,service delivered,3/1/2017 7,re-interview,9/1/2017 8,initial interview,2/29/2016 8,service delivered,8/29/2016 8,re-interview,9/3/2016 8,re-interview,10/3/2016 8,re-interview,11/3/2016 8,re-interview,12/3/2016 8,service delivered,12/4/2016 8,re-interview,6/3/2017 8,service delivered,6/5/2017 8,re-interview,12/2/2017 9,initial interview,2/29/2016 9,service delivered,8/29/2016 9,re-interview,10/3/2016 9,service delivered,10/4/2016 9,re-interview,12/3/2016 9,service delivered,12/4/2016 9,re-interview,6/3/2017 9,service delivered,6/5/2017 9,re-interview,12/2/2017 9,service delivered,12/4/2017 9,re-interview,6/2/2018 9,service delivered,6/3/2018 9,re-interview,6/4/2018 ;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.