SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Accepted Solution

Counting the number of observations between rows (same ID) and output the ID that meet the criteria?

Reply
Contributor
Posts: 32

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

Posted in reply to novinosrin

I really appreciate you taking your time, and creating a code! I am having trouble creating a dataset with the criteria and any ideas of how to do it would be greatly appreciate :-)

Esteemed Advisor
Posts: 5,129

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

Here is an approach to detect sequence of events patterns using string pattern matching.

 

data have;
input ID Category :&$24. Date :mmddyy10.;
format date yymmdd10.;
datalines;
001   initial interview   02/01/2016
001   service delivered   07/01/2016
001   re-interview   08/01/2016
001   service delivered   10/01/2016
001   re-interview   01/01/2017
001   service delivered   03/01/2017
001   re-interview   08/01/2017
002   initial interview   02/01/2016
002   service delivered   03/01/2016
002   re-interview   08/01/2016
002   service delivered   10/01/2016
002   re-interview   01/01/2017
002   service delivered   03/01/2017
002   re-interview   08/01/2017
003   initial interview   02/01/2016
003   re-interview   08/01/2016
003   service delivered   10/01/2016
003   re-interview   01/01/2017
003   service delivered   03/01/2017
003   re-interview   08/01/2017
004   initial interview   02/01/2016
004   service delivered   03/01/2016
004   re-interview   08/01/2016
004   service delivered   10/01/2016
004   re-interview   01/01/2017
004   service delivered   03/01/2017
004   re-interview   08/01/2017
004   service delivered   09/01/2017
004   re-interview   12/01/2017
;

data str;
/*
The daily sequence of events will be represented as a string with one 
letter per day, as follows

    i: initial interview, 
    r: re-interview, 
    s: service delivered
    x: nothing 
*/
/*
Prepare a pattern to match all the following conditions:

•The client must have an initial interview.
•The client must have at least one service (cleaning house, 
 delivering groceries, etc.) delivered within 6 months of the 
 initial interview.
•The client must have at least 3 re-interviews 5-7 months apart 
 from each other.

The number of days used are:
    182 = 6 months - 1 day, 
    152 = 5 months - 1 day, 
    212 = 7 months - 1 day
*/

if not prxId1 then 
    prxId1 + prxParse("/ix{0,182}s[sx]*(r[sx]{152,212}r[sx]{152,212}r)/");

/*
Prepare a pattern to match the following extra condition:

•The client must have at least one service delivered between 
 the re-interviews.
*/

if not prxId2 then 
    prxId2 + prxParse("/r[sx]*s[sx]*r[sx]*s[sx]*r/");

/* Transform the sequence of events for a given ID into a string with 
    one letter per day */
length str $2000;  /* Max sequence length: more than 5 years */
do until(last.id);
    pDate = date;
    set have; by id;
    if not first.id then do;
        pDate = intnx("day", pDate, 1);
        do while (pDate < date);
            str = cats(str, "x");
            pDate = intnx("day", pDate, 1);
            end;
        end;
    str = cats(str, first(Category));
    end;

/* Check if the string matches the patterns */
start = 1;
stop = length(str);
call prxNext(prxId1, start, stop, str, pos, len);
if pos > 0 then 
    ok = prxMatch(prxId2, prxPosn(prxId1, 1, str));

/* Recover the events corresponding to the matched substring */
i = 0;
do until(last.id);
    pDate = date;
    set have; by id;
    if not first.id then do;
        pDate = intnx("day", pDate, 1);
        do while (pDate < date);
            i = i + 1;
            pDate = intnx("day", pDate, 1);
            end;
        end;
    i = i + 1;
    if ok and i >= pos and i < pos + len then output;
    end;

keep id category date;
run;


proc print noobs; var id date category; run;
PG
Contributor
Posts: 32

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

Hello @PGStats,

 

Thanks a lot for the code! I will apply the suggestions tomorrow (due to SAS availability) and let you know.

Highlighted
Esteemed Advisor
Posts: 5,129

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

Hello, here is a new optimized version meeting your latest criteria

 


data str;
/*
The daily sequence of events will be represented as a string with one 
letter per day, as follows

    i: initial interview, 
    r: re-interview, 
    s: service delivered
    x: nothing 
*/
/*
Prepare a pattern to match all the following conditions:

•The client must have an initial interview.
•The client must have three subsequent re-interviews
•The client must have at least one service (cleaning house, 
 delivering groceries, etc.) delivered between interviews.
*/

if not prxId1 then 
    prxId1 + prxParse("/i((x*)s[sx]*)r(x*[sx]*)r(x*[sx]*)r/");

/* Transform the sequence of events for a given clientID into a string with 
    one letter per day */
length str $2000;  /* Max sequence length: more than 5 years */
do until(last.clientId);
    pDate = date;
    set have; by clientId;
    if not first.clientId then do;
        days = intck("day", pDate, date);
        if days >= 2 then str = cats(str, repeat("x", days-2));
        end;
    str = cats(str, first(Category));
    end;

/* Check if the string matches the pattern */
start = 1;
stop = length(str);
call prxNext(prxId1, start, stop, str, pos, len);

/* Check the inter-event durations (nb of chars = nb of days) */
if pos > 0 then ok =
    /* first service after initial interview within 6 months */
    length(prxPosn(prxId1,2,str)) <= 182 and
    /* inter-interview times between 5 and 7 months */ 
    152 <= length(prxPosn(prxId1,1,str)) <= 212 and
    152 <= length(prxPosn(prxId1,3,str)) <= 212 and
    152 <= length(prxPosn(prxId1,4,str)) <= 212;

/* Recover the events corresponding to the matched substring */
i = 0;
do until(last.clientId);
    pDate = date;
    set have; by clientId;
    if not first.clientId then do;
        days = intck("day", pDate, date);
        if days >= 2 then i = i + days - 1;
        end;
    i = i + 1;
    if ok and i >= pos and i < pos + len then output;
    end;

keep clientId category date;
run;


proc print data=str noobs; var clientId date category; run;

To summarize, the pattern is used to check the sequence of events and the lengths of the matched buffers (expressions in parentheses) are used to validate the inter event durations.

 

PG
Contributor
Posts: 32

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

Hello @PGStats,

 

I used your program, and I got o observations. My dataset has about a million observations.

Please find a log attached.

 

Thanks!

Esteemed Advisor
Posts: 5,129

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

The problem is most likely with the input data, namely the Category values. The first letter of the Category is used to build the string in

 

str = cats(str, first(Category));

 

and the expected letters are 'i', 'r', and 's'. Depending on the Category values in your data you may need a more robust letter assignment such as

 

length c $1;

c = lowcase(first(left(Category)));

if c in ("i", "r", "s") then str = cats(str, c);

else str = cats(str, "x");

PG
Contributor
Posts: 32

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

Hello @PGStats

 

I added the code provided above and I got the error below:

Thanks!

 

 

2128 data str;

2129 /* The daily sequence of events will be represented as a string with one

2130     letter per day, as follows

2131

2132     i: Initial Interview,

2133     r: Re-Interview,

2134     s: Services Delivered

2135     x: Nothing

2136 */

2137

2138 /*Prepare a pattern to match all the following conditions:

2139

2140 •The client must have an Initial Interview.

2141 •The client must have three subsequent Re-Interviews.

2142 •The client must have at least one service delivered between the Assessments (within 6

2142! months from the Initial Interview)

2143 •The Re-Interviews should be 5-7 months from the previous one

2144 */

2145

2146 if not prxId1 then

2147     prxId1 + prxParse("/i((x*)s[sx]*)r(x*[sx]*)r(x*[sx]*)r/");

2148

2149 /* Transform the sequence of events for a given clientid into a string with

2150     one letter per day */

2151 length str $30000;  /* Max sequence length: more than 5 years */

2152 do until(last.clientid);

2153     pdate = date;

2154     set have; by clientid;

ERROR: Variable str has been defined as both character and numeric.

2155     if not first.clientid then do;

2156         days = intck("day", pdate, date);

2157         if days >= 2 then str = cats(str, repeat("x", days-2));

2158         end;

2159     str = cats(str, first(form_final));

2160     end;

2161

2162  length c $1;*/

2163 c = lowcase(first(left(form_final)));

2164 if c in ("i", "r", "s") then str = cats(str, c);

2165 else str = cats(str, "x");

2166

2167 /* Check if the string matches the pattern */

2168 start = 1;

2169 stop = length(str);

2170 call prxNext(prxId1, start, stop, str, pos, len);

2171

2172 /* Check the inter-event durations (nb of chars = nb of days) */

2173 if pos > 0 then ok =

2174     /* first service after Initial Interview within 6 months */

2175     length(prxPosn(prxId1,2,str)) <= 182 and

2176     /* inter-Re-Interview times between 5 and 7 months */

2177     152 <= length(prxPosn(prxId1,1,str)) <= 212 and

2178     152 <= length(prxPosn(prxId1,3,str)) <= 212 and

2179     152 <= length(prxPosn(prxId1,4,str)) <= 212;

2180

2181 /* Recover the events corresponding to the matched substring */

2182 i = 0;

2183 do until(last.clientid);

2184     pdate = date;

2185     set have; by clientid;

ERROR: Variable str has been defined as both character and numeric.

2186     if not first.clientid then do;

2187         days = intck("day", pdate, date);

2188         if days >= 2 then i = i + days - 1;

2189         end;

2190     i = i + 1;

2191     if ok and i >= pos and i < pos + len then output;

2192     end;

2193

2194 keep clientid category date;

2195 run;

 

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set WORK.STR may be incomplete. When this step was stopped there were 0

         observations and 3 variables.

WARNING: Data set WORK.STR was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

      real time           0.01 seconds

      cpu time            0.03 seconds

Esteemed Advisor
Posts: 5,129

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

You obviously didn't understand the code I provided.

 

What is the full set of form_final values?

 

proc sql; select unique form_final from have; quit;

 

PG
Contributor
Posts: 32

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

Hi @PGStats

 

"form_final" is the same as "category". I replaced my variables' names with the variables' names in the original example so it could match, but I forgot to replace form_final. In other words, the log I provided is what I got with my variables, then I replaced the names (except for form_final).

 

Thanks

Esteemed Advisor
Posts: 5,129

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

I'll assume that none of the proposed solutions so far worked for your real data is because your category values are not really 'initial interview', 're-interview' and 'service delivered'. Everybody is looking for these values are noone finds them. Adjust the following code to what your actual category values are, and it should work, just as it does for example data.

 


/* Define single letter codes for uppercased Category values. 
Change the values on the left of equal signs to suit your needs. */
proc format;
value $ Category
"INITIAL INTERVIEW" = "i"
"RE-INTERVIEW"      = "r"
"SERVICE DELIVERED" = "s"
OTHER               = "x";
run;

data str;
/*
The daily sequence of events will be represented as a string with one 
letter per day, as follows

    i: initial interview, 
    r: re-interview, 
    s: service delivered
    x: nothing 
*/
/*
Prepare a pattern to match all the following conditions:

•The client must have an initial interview.
•The client must have three subsequent re-interviews
•The client must have at least one service (cleaning house, 
 delivering groceries, etc.) delivered between interviews.
*/

if not prxId1 then 
    prxId1 + prxParse("/i((x*)s[sx]*)r(x*[sx]*)r(x*[sx]*)r/");

/* Transform the sequence of events for a given clientID into a string with 
    one letter per day */
length str $2000;  /* Max sequence length: more than 5 years */
do until(last.clientId);
    pDate = date;
    set have; by clientId;
    if not first.clientId then do;
        days = intck("day", pDate, date);
        if days >= 2 then str = cats(str, repeat("x", days-2));
        end;
    str = cats(str, put(upcase(Category), $Category.));
    end;

/* Check if the string matches the pattern */
start = 1;
stop = length(str);
call prxNext(prxId1, start, stop, str, pos, len);

/* Check the inter-event durations (nb of chars = nb of days) */
if pos > 0 then do;
    call prxposn(prxId1, 1, dum, len1);
    call prxposn(prxId1, 2, dum, len2);
    call prxposn(prxId1, 3, dum, len3);
    call prxposn(prxId1, 4, dum, len4);
    ok =
        /* first service after initial interview within 6 months */
        len2 <= 182 and
        /* inter-interview times between 5 and 7 months */ 
        152 <= len1 <= 212 and
        152 <= len3 <= 212 and
        152 <= len4 <= 212;
    end;

/* Recover the events corresponding to the matched substring */
i = 0;
do until(last.clientId);
    pDate = date;
    set have; by clientId;
    if not first.clientId then do;
        days = intck("day", pDate, date);
        if days >= 2 then i = i + days - 1;
        end;
    i = i + 1;
    if ok and i >= pos and i < pos + len then output;
    end;

keep clientId category date;
run;
PG
Contributor
Posts: 32

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

Hi @PGStats,

 

Thanks for the input. I fix it using your recommendation of the c = lowcase(first(left(Category) and your code worked with my data, I was just double checking the results because I got very few ClientIDs and was expecting to have more. I will also try using the updated suggestion and see if I get the same results. I accept your code as a solution, thanks a lot!

 

I also tried @art297 and it worked too, the only thing is that I do not know how to get the initial interview in the output (the observation with the initial interview is very important in determining other factors in the dataset), the code is only giving me re-interviews and services (my data is very complicated).

 

Thank you both for your great input and helping me tremendously to find a way!!!

PROC Star
Posts: 7,804

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

[ Edited ]

I don't care which solution you accept, but I had sent you the code for including whichever records you want in the output. I'll repeat it here:

data want;
  merge client_engage_2 (where=(lowcase(first(left(category))) in ('i','r')))
        client_engage_3 (in=in3);
  by ClientId;
  if in3;
run;

Art, CEO, AnalystFinder.com

 

Contributor
Posts: 32

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

Thank you

Super User
Posts: 10,323

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

I cann't agree Arthur.T any more. Anybody could make a mistake. That is the reason why we are here to polish up and make us better and better and better ......... BTW, Arthur.T , Does 4 re-interview should be 08/01/2017 not 12/01/2017 on account of 08/01/2017 -> 12/01/2017 is only 4 month not 5-7 as OP required ?


data have;
  infile datalines dlm=',';
  informat ClientID 8.;
  informat category $20.;
  informat date mmddyy10.;
  input ClientID Category Date;
  format date mmddyy10.;
  datalines;
1,initial interview,2/1/2016
1,service delivered,8/25/2016
1,re-interview,8/26/2016
1,service delivered,10/1/2016
1,re-interview,1/1/2017
1,service delivered,3/1/2017
1,re-interview,8/1/2017
2,initial interview,2/1/2016
2,service delivered,3/1/2016
2,re-interview,8/1/2016
2,service delivered,10/1/2016
2,re-interview,1/1/2017
2,service delivered,3/1/2017
2,re-interview,8/1/2017
3,initial interview,2/1/2016
3,re-interview,8/1/2016
3,service delivered,10/1/2016
3,re-interview,1/1/2017
3,service delivered,3/1/2017
3,re-interview,8/1/2017
4,initial interview,2/1/2016
4,service delivered,3/1/2016
4,re-interview,8/1/2016
4,service delivered,10/1/2016
4,re-interview,1/1/2017
4,service delivered,3/1/2017
4,re-interview,8/1/2017
4,service delivered,9/1/2017
4,re-interview,12/1/2017
5,initial interview,2/1/2016
5,service delivered,3/1/2016
5,re-interview,8/1/2016
5,service delivered,10/1/2016
5,re-interview,1/1/2017
5,service delivered,3/1/2017
5,re-interview,8/1/2017
6,initial interview,2/1/2016
6,service delivered,8/2/2016
6,re-interview,8/3/2016
6,service delivered,10/1/2016
6,re-interview,1/1/2017
6,service delivered,3/1/2017
6,re-interview,8/1/2017
7,initial interview,2/29/2016
7,service delivered,8/29/2016
7,re-interview,9/3/2016
7,service delivered,10/1/2016
7,re-interview,2/3/2017
7,service delivered,3/1/2017
7,re-interview,9/1/2017
8,initial interview,2/29/2016
8,service delivered,8/29/2016
8,re-interview,9/3/2016
8,re-interview,10/3/2016
8,re-interview,11/3/2016
8,re-interview,12/3/2016
8,service delivered,12/4/2016
8,re-interview,6/3/2017
8,service delivered,6/5/2017
8,re-interview,12/2/2017
;


data temp;
 set have;
 by ClientID;
 if first.ClientID then n=0;
 n+1;
run;

data temp1;
 set temp;
 by  ClientID  Category notsorted;
 if first.Category;
run;

data temp2;
 set temp1;
 by ClientID;
 retain yes 1;
 if first.ClientID then yes=1;

 lag_Date=lag(Date);
 lag2_Date=lag2(Date);

 if n=1 then do;
  if Category ne 'initial interview' then yes=0;
 end;
 else if n=2 then do;
  if Category ne 'service delivered' then yes=0;
   else if intck('month',lag_date,date,'c')>6 then yes=0;;
 end;
 else if mod(n,2)=1 then do;
    if Category ne 're-interview' then yes=0;
	 else do;
       if n=3 then do; if yes then output; end;
	     else do; 
                if 	5<=intck('month',lag2_date,date,'c')<=7 then do; if yes then output; end;
				 else yes=0;
              end;
	 end;
 end;
 else if mod(n,2)=0 then do;
    if Category ne 'service delivered' then yes=0;
end;
 run;

data temp3;
 set temp2;
 by ClientID;
 if last.ClientID;
 keep ClientID n;
run;

data temp4;
 merge temp temp3(rename=(n=_n));
 by ClientID;
 if n<=_n;
 drop n _n;
run;

proc sql;
create table want as
 select *
  from temp4 
   group by ClientID
    having sum(Category='re-interview')>2
     order by ClientID,date;
quit;

PROC Star
Posts: 7,804

Re: Counting the number of observations between rows (same ID) and output the ID that meet the crite

@Ksharp: I see, in your subsequent post, that you modified the code so that ClientID 4 would be accepted. ClientID 4 meets all of the conditions that @Asileon has mentioned thus far. A more complicated one is shown in ClientID 9, below.

 

Nothing in @Asileon's criteria would eliminate ClientID 9, but both your code and the one proposed by @PGStats fail to select it. If it shouldn't be selected then it's for a criterion that hasn't yet been specified.

 

Art, CEO, AnalystFinder.com

data have;
  infile datalines dlm=',';
  informat ClientID 8.;
  informat category $20.;
  informat date mmddyy10.;
  input ClientID Category Date;
  format date mmddyy10.;
  datalines;
1,initial interview,2/1/2016
1,service delivered,8/25/2016
1,re-interview,8/26/2016
1,service delivered,10/1/2016
1,re-interview,1/1/2017
1,service delivered,3/1/2017
1,re-interview,8/1/2017
2,initial interview,2/1/2016
2,service delivered,3/1/2016
2,re-interview,8/1/2016
2,service delivered,10/1/2016
2,re-interview,1/1/2017
2,service delivered,3/1/2017
2,re-interview,8/1/2017
3,initial interview,2/1/2016
3,re-interview,8/1/2016
3,service delivered,10/1/2016
3,re-interview,1/1/2017
3,service delivered,3/1/2017
3,re-interview,8/1/2017
4,initial interview,2/1/2016
4,service delivered,3/1/2016
4,re-interview,8/1/2016
4,service delivered,10/1/2016
4,re-interview,1/1/2017
4,service delivered,3/1/2017
4,re-interview,8/1/2017
4,service delivered,9/1/2017
4,re-interview,12/1/2017
5,initial interview,2/1/2016
5,service delivered,3/1/2016
5,re-interview,8/1/2016
5,service delivered,10/1/2016
5,re-interview,1/1/2017
5,service delivered,3/1/2017
5,re-interview,8/1/2017
6,initial interview,2/1/2016
6,service delivered,8/2/2016
6,re-interview,8/3/2016
6,service delivered,10/1/2016
6,re-interview,1/1/2017
6,service delivered,3/1/2017
6,re-interview,8/1/2017
7,initial interview,2/29/2016
7,service delivered,8/29/2016
7,re-interview,9/3/2016
7,service delivered,10/1/2016
7,re-interview,2/3/2017
7,service delivered,3/1/2017
7,re-interview,9/1/2017
8,initial interview,2/29/2016
8,service delivered,8/29/2016
8,re-interview,9/3/2016
8,re-interview,10/3/2016
8,re-interview,11/3/2016
8,re-interview,12/3/2016
8,service delivered,12/4/2016
8,re-interview,6/3/2017
8,service delivered,6/5/2017
8,re-interview,12/2/2017
9,initial interview,2/29/2016
9,service delivered,8/29/2016
9,re-interview,10/3/2016
9,service delivered,10/4/2016
9,re-interview,12/3/2016
9,service delivered,12/4/2016
9,re-interview,6/3/2017
9,service delivered,6/5/2017
9,re-interview,12/2/2017
9,service delivered,12/4/2017
9,re-interview,6/2/2018
9,service delivered,6/3/2018
9,re-interview,6/4/2018
;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 71 replies
  • 547 views
  • 12 likes
  • 8 in conversation