Solved: Re: Flag the presence of substring

vrushankshah · Posted 07-15-2018 01:26 AM

I have 4 columns: Column1-It has all the clicks made on the website 2) Unique ObjectId for each user,3rd column-time of the click and 4th column-group which indicates the pattern of clicks followed by the user. If the pattern starts with the click a,c,e then it comes under different group. I want to find if all the words of a group are part of the pattern followed by a different user or no? If it is a part of string then return the objectid , group and flag it as repeated. For example:

Notetext  ObjectId  group  CreatedOn
a            123      1       3:22pm
b            123      1       3:23pm
c            124      2       4:00pm
d            124      2       4:02pm
a            124      3       4:03pm
b            124      3       4:04pm
e            125      4       10:00am
f            125      4       10:02am
g            125      4       10:07am
c            126      5       11:06am
t            126      5       11:07am
l            126      5       11:09am
c            127      6       11:13am
d            127      6       11:15am
a            127      7       12:09pm
b            127      7       12:09pm
s            127      7       12:11pm

OUTPUT EXPECTED:

NoteText       ObjectId  group   Part_of_string 
    a            123      1           N
    b            123      1           N
    c            124      2           N
    d            124      2           N
    a            124      3           Y
    b            124      3           Y
    e            125      4           N
    f            125      4           N
    g            125      4           N
    c            126      5           N
    t            126      5           N
    l            126      5           N
    c            127      6           Y
    d            127      6           Y
    a            127      7           N
    b            127      7           N
    s            127      7           N

In the above example a,b was the part of the pattern followed by user 124 hence it should be flagged Y but 'a,b,s' is not repeated hence it is flagged N.

EDIT:

I want this final thing it will be great if someone can help me . The main thing is I don't want that if the pattern is the only thing present in the objectId then don't count it only count it if it is the PART of another big string under a objectId.

I want this final thing it will be great if you can help me .

NoteText    ObjectId   Group       

Inquiry      1           1            

Benefits     2           2            

Verify       2           2            

Claims       2           3            

Type         2           3            

Inquiry      2           4             

Benefits     3           5            

Verify       3           5           

Benefits     4           6            

Expected Output:
Notetext                  Count

Benefits,verify             1
Inquiry                     1
Claims,Type                 1
Benefits                    0

For example in the above example in the above case Benefits,verify should be counted only once since it was the PART of objectID '2'.

ChrisNZ · Posted 07-17-2018 12:42 AM

It works fine for me. What are you doing wrong?

data HAVE(index=(GROUP));
  input (NOTETEXT  OBJECTID  GROUP  ) ($);
  ROW=_N_;
cards;
a 1 1
b 1 1
d 1 2
f 1 3
g 1 4
t 2 5
y 2 5
f 3 6
a 4 7
b 4 7
d 4 8
f 5 9
r 6 10
s 6 10
t 6 10
a 6 11
b 6 11
s 6 11
q 7 12
w 7 12
e 7 12
r 7 12
d 7 13
f 7 14
g 7 15
run;
data PATTERNS;
  retain PATTERN ;
  length PATTERN $16;
  set HAVE;
  by GROUP;
  if first.GROUP then PATTERN=' ';
  PATTERN=cats(PATTERN,NOTETEXT);
  if last.GROUP then output;
  keep GROUP PATTERN;
run;
proc sql;
  create table WANT as
  select trans.*
       , case when COUNT>1 then 'Y' else 'N' end as REPEATED
       , pattern.PATTERN
       , PATTERN_NO
      , COUNT
  from HAVE                                      trans
      ,PATTERNS                                  pattern
      ,(select PATTERN
             , count(*)    as COUNT
             , monotonic() as PATTERN_NO
        from PATTERNS 
        group by PATTERN)                        count
  where trans.GROUP     = pattern.GROUP
    and pattern.PATTERN = count.PATTERN
  order by ROW;
quit;

NOTETEXT	OBJECTID	GROUP	ROW	REPEATED	PATTERN	PATTERN_NO	COUNT
a	1	1	1	Y	ab	1	2
b	1	1	2	Y	ab	1	2
d	1	2	3	Y	d	5	3
f	1	3	4	Y	f	15	4
g	1	4	5	Y	g	7	2
t	2	5	6	N	ty	11	1
y	2	5	7	N	ty	11	1
f	3	6	8	Y	f	15	4
a	4	7	9	Y	ab	1	2
b	4	7	10	Y	ab	1	2
d	4	8	11	Y	d	5	3
f	5	9	12	Y	f	15	4
r	6	10	13	N	rst	2	1
s	6	10	14	N	rst	2	1
t	6	10	15	N	rst	2	1
a	6	11	16	N	abs	3	1
b	6	11	17	N	abs	3	1
s	6	11	18	N	abs	3	1
q	7	12	19	N	qwer	4	1
w	7	12	20	N	qwer	4	1
e	7	12	21	N	qwer	4	1
r	7	12	22	N	qwer	4	1
d	7	13	23	Y	d	5	3
f	7	14	24	Y	f	15	4
g	7	15	25	Y	g	7	2

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 07-16-2018 07:15 PM

Like this?

data HAVE;
  input (NOTETEXT  OBJECTID  GROUP  CREATEDON) ($);
  ROW=_N_;
cards;
a            123      1       3:22pm
b            123      1       3:23pm
c            124      2       4:00pm
d            124      2       4:02pm
a            124      3       4:03pm
b            124      3       4:04pm
e            125      4       10:00am
f            125      4       10:02am
g            125      4       10:07am
c            126      5       11:06am
t            126      5       11:07am
l            126      5       11:09am
c            127      6       11:13am
d            127      6       11:15am
a            127      7       12:09pm
b            127      7       12:09pm
s            127      7       12:11pm
run;
data PATTERNS;
  retain PATTERN ;
  length PATTERN $16;
  set HAVE;
  by GROUP;
  if first.GROUP then PATTERN=' ';
  PATTERN=cats(PATTERN,NOTETEXT);
  if last.GROUP then output;
  keep GROUP PATTERN;
run;
proc sql;
  create table WANT as
  select trans.*
       , case when COUNT>1 then 'Y' else 'N' end as REPEATED
       , COUNT
  from HAVE                                                               trans
      ,PATTERNS                                                           pattern
      ,(select PATTERN, count(*) as COUNT from PATTERNS group by PATTERN) count
  where trans.GROUP     = pattern.group
    and pattern.PATTERN = count.PATTERN
  order by ROW;
quit;

NOTETEXT	OBJECTID	GROUP	CREATEDON	ROW	REPEATED	COUNT
a	123	1	3:22pm	1	Y	2
b	123	1	3:23pm	2	Y	2
c	124	2	4:00pm	3	Y	2
d	124	2	4:02pm	4	Y	2
a	124	3	4:03pm	5	Y	2
b	124	3	4:04pm	6	Y	2
e	125	4	10:00am	7	N	1
f	125	4	10:02am	8	N	1
g	125	4	10:07am	9	N	1
c	126	5	11:06am	10	N	1
t	126	5	11:07am	11	N	1
l	126	5	11:09am	12	N	1
c	127	6	11:13am	13	Y	2
d	127	6	11:15am	14	Y	2
a	127	7	12:09pm	15	N	1
b	127	7	12:09pm	16	N	1
s	127	7	12:11pm	17	N	1

High-Performance SAS Coding - Third Edition

vrushankshah · Posted 07-16-2018 08:25 PM

ya I wanted something like this but when I am trying on other data your code is not working properly.
Data used to test:
Column1 Objectid Group
a 1 1
b 1 1
d 1 2
f 1 3
g 1 4
t 2 5
y 2 5
f 3 6
a 4 7
b 4 7
d 4 8
f 5 9
r 6 10
s 6 10
t 6 10
a 6 11
b 6 11
s 6 11
q 7 12
w 7 12
e 7 12
r 7 12
d 7 13
f 7 14
g 7 15

ChrisNZ · Posted 07-16-2018 08:31 PM

"Not working" is very vague. What didn't work?

High-Performance SAS Coding - Third Edition

vrushankshah · Posted 07-16-2018 08:32 PM

Hey sorry for not explaining properly! I think the logic is going a bit wrong I am not getting correct count and it flags Y to most of the pattern which it shouldn’t

ChrisNZ · Posted 07-16-2018 10:02 PM

I don't see how.

All the groups with a Y have repeated patterns.

proc sql;
  create table WANT as
  select trans.*
       , case when COUNT>1 then 'Y' else 'N' end as REPEATED
       , COUNT
       , PATTERN_NO
  from HAVE                                      trans
      ,PATTERNS                                  pattern
      ,(select PATTERN
             , count(*)    as COUNT
             , monotonic() as PATTERN_NO
        from PATTERNS 
        group by PATTERN)                        count
  where trans.GROUP     = pattern.group
    and pattern.PATTERN = count.PATTERN
  order by ROW;
quit;

NOTETEXT	OBJECTID	GROUP	CREATEDON	ROW	REPEATED	COUNT	PATTERN_NO
a	123	1	3:22pm	1	Y	2	1
b	123	1	3:23pm	2	Y	2	1
c	124	2	4:00pm	3	Y	2	6
d	124	2	4:02pm	4	Y	2	6
a	124	3	4:03pm	5	Y	2	1
b	124	3	4:04pm	6	Y	2	1
e	125	4	10:00am	7	N	1	4
f	125	4	10:02am	8	N	1	4
g	125	4	10:07am	9	N	1	4
c	126	5	11:06am	10	N	1	5
t	126	5	11:07am	11	N	1	5
l	126	5	11:09am	12	N	1	5
c	127	6	11:13am	13	Y	2	6
d	127	6	11:15am	14	Y	2	6
a	127	7	12:09pm	15	N	1	7
b	127	7	12:09pm	16	N	1	7
s	127	7	12:11pm	17	N	1	7

High-Performance SAS Coding - Third Edition

vrushankshah · Posted 07-16-2018 11:05 PM

I was just testing the code on a new data and it didnt worked for that :
New test data :
Column1 Objectid Group
a 1 1
b 1 1
d 1 2
f 1 3
g 1 4
t 2 5
y 2 5
f 3 6
a 4 7
b 4 7
d 4 8
f 5 9
r 6 10
s 6 10
t 6 10
a 6 11
b 6 11
s 6 11
q 7 12
w 7 12
e 7 12
r 7 12
d 7 13
f 7 14
g 7 15

ChrisNZ · Posted 07-17-2018 12:42 AM

It works fine for me. What are you doing wrong?

data HAVE(index=(GROUP));
  input (NOTETEXT  OBJECTID  GROUP  ) ($);
  ROW=_N_;
cards;
a 1 1
b 1 1
d 1 2
f 1 3
g 1 4
t 2 5
y 2 5
f 3 6
a 4 7
b 4 7
d 4 8
f 5 9
r 6 10
s 6 10
t 6 10
a 6 11
b 6 11
s 6 11
q 7 12
w 7 12
e 7 12
r 7 12
d 7 13
f 7 14
g 7 15
run;
data PATTERNS;
  retain PATTERN ;
  length PATTERN $16;
  set HAVE;
  by GROUP;
  if first.GROUP then PATTERN=' ';
  PATTERN=cats(PATTERN,NOTETEXT);
  if last.GROUP then output;
  keep GROUP PATTERN;
run;
proc sql;
  create table WANT as
  select trans.*
       , case when COUNT>1 then 'Y' else 'N' end as REPEATED
       , pattern.PATTERN
       , PATTERN_NO
      , COUNT
  from HAVE                                      trans
      ,PATTERNS                                  pattern
      ,(select PATTERN
             , count(*)    as COUNT
             , monotonic() as PATTERN_NO
        from PATTERNS 
        group by PATTERN)                        count
  where trans.GROUP     = pattern.GROUP
    and pattern.PATTERN = count.PATTERN
  order by ROW;
quit;

NOTETEXT	OBJECTID	GROUP	ROW	REPEATED	PATTERN	PATTERN_NO	COUNT
a	1	1	1	Y	ab	1	2
b	1	1	2	Y	ab	1	2
d	1	2	3	Y	d	5	3
f	1	3	4	Y	f	15	4
g	1	4	5	Y	g	7	2
t	2	5	6	N	ty	11	1
y	2	5	7	N	ty	11	1
f	3	6	8	Y	f	15	4
a	4	7	9	Y	ab	1	2
b	4	7	10	Y	ab	1	2
d	4	8	11	Y	d	5	3
f	5	9	12	Y	f	15	4
r	6	10	13	N	rst	2	1
s	6	10	14	N	rst	2	1
t	6	10	15	N	rst	2	1
a	6	11	16	N	abs	3	1
b	6	11	17	N	abs	3	1
s	6	11	18	N	abs	3	1
q	7	12	19	N	qwer	4	1
w	7	12	20	N	qwer	4	1
e	7	12	21	N	qwer	4	1
r	7	12	22	N	qwer	4	1
d	7	13	23	Y	d	5	3
f	7	14	24	Y	f	15	4
g	7	15	25	Y	g	7	2

High-Performance SAS Coding - Third Edition

Registration is open

SAS Training: Just a Click Away