Re: Iteratively extracting the word before a specific index word that ...

DR_Jorg · Posted 02-01-2019 08:54 AM

Hi Everyone,

I am trying to parse a comment section into variables. I want the word before, and two words before a trigger word. The difficulty I am having is that sometimes the trigger word appears more than once (up to five times). I want to pull it each time it is stated. The scan function only returns the last occurrence. My current code below is not pulling the next occurrence, only the last. Does anyone know of a way to pull each iteration before the trigger?

Thanks!

data have;

input ID $ Comments $;

datalines;

1 "word word word word word WORD2_BEFORE_1 WORD_BEFORE_1 trigger WORD_AFTER word word word word WORD_BEFORE_2 trigger WORD_AFTER WORD_BEFORE_3 trigger WORD_AFTER"

2 "word word word WORD2_BEFORE_1 WORD_BEFORE_1 trigger word"

3 "word word word word word WORD2_BEFORE_1 WORD_BEFORE_1 trigger WORD_BEFORE_2 trigger";

Data want; set have;

count_CI = 0;

do i = 1 to countw(comments, " ");

count_CI = count_CI + (scan(comments,i) = "trigger");

end;

delims= ' (';

delims2= ': ';

do i=1 to countw(comments, " ") while (count_CI<=i);

if Class_I_p=1 and count_CI=1 and scan(comments, i, " ")="trigger" then do;

WORD_BEFORE_1=scan(comments, i-1, delims);

WORD2_BEFORE_1=scan(comments,i-2, delims);

WORD_BEFORE_2= (scan(comments, i-1, delims))-1); -does not work

end;

Astounding · Posted 02-01-2019 09:00 AM

An easy way would be to split up the COMMENTS field ahead of time into separate words:

data want;

set have;

array word {99} $ 50;

do k=1 to 99;

word{k} = scan(comments, k, " ");

end;

..................

That gives you up to 99 variables to work with. Each one can be inspected so it should be easy to find all occurrences. It should also be easy to pull out the words before (and even to check whether those words before contain an additional occurrence of the key word you are searching for).

drjorg · Posted 02-01-2019 10:19 AM

Thanks, the issue being that I have 1000+ observations, so going variable by variable and line by line, is not very feasible.

Astounding · Posted 02-01-2019 10:38 AM

Well, it's very feasible IF you first change all the instances of "ice cream" to "icecream". That would let the key phrase be captured as a single word. After that:

array word {99} $ 50;

do i=1 to 99;

word{i} = scan(comments, i, " ");

end;

do i=1 to 99;

if word{i} = "icecream" then do;

........

end;

There are other methods if this doesn't seem like a good way. The search functions (find, findw, index, indexw) generally have extra parameters that let you specify where to begin the search within COMMENTS. So if you find "ice cream" once and keep track of where that is, the next search could continue where the first search left off.

VDD · Posted 02-01-2019 09:06 AM

in your dataset have does comments equal more that the first word on the line?

drjorg · Posted 02-01-2019 09:27 AM

Yes, it is not the same word repeated. I guess an example would be below. I want the flavor and amount each time. I am using something like ice cream as the trigger.

in the past week can you tell me all the flavors of ice cream you had?

ID	Comments
1	On Monday I had 1.5 scoops of vanilla ice cream, Tuesday we had 2 scoops of chocolate ice cream, one day we also had 3 scoops of strawberry ice cream, but I can't remember when.
2	Over the week I had 4 scoops of raspberry ice cream, 3 scoops of vanilla ice cream
3	On 1/20/2019, I had 3 scoops of chocolate ice cream.

kiranv_ · Posted 02-01-2019 09:43 AM

what your answer will be. something like below.

id      icream   scoop

1      vanilla     1.5

1      chocolate    2

1      strawberry   3

2      raspberry    4

2      vanilla      3

3      chocolate    3

drjorg · Posted 02-01-2019 10:30 AM

I would like it to be wide format, so icream_1 scoop_1
icream_2 scoop_2, icream_3 scoop_3

kiranv_ · Posted 02-01-2019 12:57 PM

one way to do this.

\w+ =alphabet or number or underscore

(?=) is zero length lookahead assertion which means just see what is next to word of interest in our case it is \w+ which can be alphabet or number or underscore

(?=\sice cream) -- find the word with space and icecream

[0-9\.]+ is number or dot

(\sscoops)/') -- find the word with space and scoops

data have;
length id 8. comments $500.;
infile datalines dlm='09'x;
input ID	Comments $ ;
datalines;
1	On Monday I had 1.5 scoops of vanilla ice cream, Tuesday we had 2 scoops of chocolate ice cream, one day we also had 3 scoops of strawberry ice cream, but I can't remember when. 
2	Over the week I had 4 scoops of raspberry ice cream, 3 scoops of vanilla ice cream
3	On 1/20/2019, I had 3 scoops of chocolate ice cream. 
;

data have1(keep= id icecream amount);
length val val1 icecream amount $4200.;
set have;
 start = 1;
   stop = length(comments);
 
   re = prxparse('/\w+(?=\sice cream)/');
   set have;
   call prxnext(re, start, stop, trim(comments), position, length);
      do while (position > 0 );
         val = substr(comments, position, length); 
           icecream = catx(" ",  icecream, val);
         call prxnext(re, start, stop, trim(comments), position, length);
      end;
set have;
start1 = 1;
   stop1 = length(comments);
   re1 = prxparse('/[0-9\.]+(?=\sscoops)/');
call prxnext(re1, start1, stop1, trim(comments), position1, length1);
      do while (position1 > 0);
         val1 = substr(comments, position1, length1); 
           amount = catx(" ",  amount, val1);
         call prxnext(re1, start1, stop1, trim(comments), position1, length1);
         end;
run;

proc sql noprint;
select cats("ice", max(countw(icecream," "))),
       cats("amount", max(countw(amount," ")))

into :maxice, :maxamount
from have1;




data want;
set have1;
length ice1-&maxice $10.;
array ice[*] $ ice1-&maxice;
array amounts[*] amount1-&maxamount;
 do i = 1 to countw(icecream," "); 
do j = 1 to countw(amount," ");
 ice[i] = scan(icecream, i, ' '); 
amounts[j] = scan(amount, j, ' '); 
end;
end;
drop i j amount icecream;
run;


proc print;


Obs	id	ice1	ice2	ice3	amount1	amount2	amount3
1	1	vanilla	chocolate	strawberry	1.5	2	3
2	2	raspberry	vanilla		4.0	3	.
3	3	chocolate			3.0	.	.

art297 · Posted 02-01-2019 11:00 AM

If searching for 'ice' rather than 'ice cream' would suffice, then here is one method:

data have (keep=id j scoops flavor);
  informat comments $255.;
  length scoops flavor $10;
  
  input id comments &;
  i=0;
  j=0;
  ixp=0;
  ixs=-2;
  do while (findw(comments,"ice",' ,','e',ixs+3));
    i+1;
    if i eq 1 then ixs+3;
    ixp=findw(comments,"ice",' ,','e',ixs)+ixp;
    if (scan(comments,ixp-3,' ') eq 'scoops' or
        scan(comments,ixp-3,' ') eq 'scoop') and
       scan(comments,ixp-2,' ') eq 'of' then do;
      j+1;
      scoops=catx(' ',scan(comments,ixp-4,' '),scan(comments,ixp-3,' '));
      flavor = scan(comments,ixp-1,' ');
      output;
    end;
    ixs=findw(comments,"ice",ixs)+3;
  end;
  cards;
1   On Monday I had 1.5 scoops of vanilla ice cream, Tuesday we had 2 scoops of chocolate ice cream, one day we also had 3 scoops of strawberry ice cream, but I can't remember when. 
2   Over the week I had 4 scoops of raspberry ice cream, 3 scoops of vanilla ice cream
3   On 1/20/2019, I had 3 scoops of chocolate ice cream.
4   On 2/1/2019 I didn't want any ice cream
5   On 2/1/2019 I didn't want any ice cream, but the day before I had 1 scoop of lemon ice
;

%transpose(data=have,out=want, var=scoops flavor, id=j, by=id)

/*
uses transpose macro which can be downloaded from
http://www.sascommunity.org/wiki/A_Better_Way_to_Flip_(Transpose)_a_SAS_Dataset
*/

Art, CEO, AnalystFinder.com

drjorg · Posted 02-01-2019 11:58 AM

Thanks, I think that this would work. The variables appear to be correctly identified...

but for some reason when I run it, it is only giving me 83 results. I expect there to be more than 454 (since that is how many report eating ice cream... I am not sure what I am doing wrong. CLass_1_p is an indicator for those who reported eating ice cream. Date is the time they were asked (some were asked multiple times to report).

data try6;
set limit(keep=ID Class_I_p RESULT_DATE comments);

i=0;
j=0;
ixp=0;
ixs=-2;
do while (findw(comments,"ICE",' ,','e',ixs+3));
i+1;
if i eq 1 then ixs+3;
ixp=findw(comments,"ICE",' ,','e',ixs)+ixp;
if Class_I_p=1 then do;
j+1;
SCOOPS=catx(' ',scan(comments,ixp-4,' '),scan(comments,ixp-3,' '));
Flavor = scan(comments,ixp-1,' ');

output;
end;
ixs=findw(comments,"ice",ixs)+3;
end;
run;

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
1617:44
NOTE: There were 7699 observations read from the data set WORK.LIMIT.
NOTE: The data set WORK.TRY6 has 83 observations and 13 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

art297 · Posted 02-01-2019 12:13 PM

@DR_Jorg:

You'd have to provide some sample data that reflects your actual data. My first guess is that you have to add the 'i' modifier to the findw calls to ignore the case of the string 'ice'. I.e., to capture any variation of capitalization.

However, I'd have to run your code against actual data to identify why you are getting the numeric to character conversion note.

Art, CEO, AnalystFinder.com

art297 · Posted 02-01-2019 05:06 PM

@drjorg : As I mentioned, one can only guess unless you provide some example records that closely approximate your actual data.

However, that said, there was a logical error in the way you modified the code, and I added the probably necessary 'i' modifier.

See if the following comes closer to what you want/expected:

data limit;
  informat id 8.;
  informat Class_I_p $1.;
  informat Result_Date date9.;
  format Result_Date date9.;
  informat comments $255.;
  input ID Class_I_p  RESULT_DATE comments &;
  cards;
1 1 02feb2019  On Monday I had 1.5 scoops of vanilla ICE cream, Tuesday we had 2 scoops of chocolate Ice cream, one day we also had 3 scoops of strawberry icE cream, but I can't remember when. 
2 1 02feb2019  Over the week I had 4 scoops of raspberry ICE cream, 3 scoops of vanilla Ice cream
3 1 02feb2019  On 1/20/2019, I had 3 scoops of chocolate ice cream.
4 0 02feb2019  On 2/1/2019 I didn't want any ice cream
5 1 02feb2019  On 2/1/2019 I had 1 scoop of lemon iCe
;

data try6;
  set limit(keep=ID Class_I_p RESULT_DATE comments);
  i=0;
  j=0;
  ixp=0;
  ixs=-2;
  if Class_I_p=1 then do;
    do while (findw(comments,"ICE",' ,','ei',ixs+3));
      i+1;
      if i eq 1 then ixs+3;
      ixp=findw(comments,"ICE",' ,','ei',ixs)+ixp;
      j+1;
      SCOOPS=catx(' ',scan(comments,ixp-4,' '),scan(comments,ixp-3,' '));
      Flavor = scan(comments,ixp-1,' ');
      output;
      ixs=findw(comments,"ice",' ,','i',ixs)+3;
    end;
  end;
run;

%transpose(data=try6,out=want, var=scoops flavor, id=j, by=id RESULT_DATE)

Art, CEO, AnalystFinder.com

Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Re: Iteratively extracting the word before a specific index word that occurs multiple times

Ready to join fellow brilliant minds for the SAS Hackathon?