BookmarkSubscribeRSS Feed
DR_Jorg
Fluorite | Level 6

Hi Everyone,

I am trying to parse a comment section into variables. I want the word before, and two words before a trigger word. The difficulty I am having is that sometimes the trigger word appears more than once (up to five times). I want to pull it each time it is stated. The scan function only returns the last occurrence.  My current code below is not pulling the next occurrence, only the last. Does anyone know of a way to pull each iteration before the trigger?

 

Thanks!

 

 

data have;

input ID $ Comments $;

datalines;

1 "word word word word word WORD2_BEFORE_1 WORD_BEFORE_1 trigger WORD_AFTER word word word word WORD_BEFORE_2 trigger WORD_AFTER WORD_BEFORE_3 trigger WORD_AFTER"

2 "word word word WORD2_BEFORE_1 WORD_BEFORE_1 trigger word"

"word word word word word WORD2_BEFORE_1 WORD_BEFORE_1 trigger WORD_BEFORE_2 trigger";

 

 

 

 Data want; set have;

count_CI = 0;

do i = 1 to countw(comments, " ");

  count_CI = count_CI + (scan(comments,i) = "trigger");

end;

 

delims= ' (';   

delims2= ': ';

 

 

do i=1 to countw(comments, " ") while (count_CI<=i);

    if Class_I_p=1 and count_CI=1 and scan(comments, i, " ")="trigger" then do;

             WORD_BEFORE_1=scan(comments, i-1, delims);

              WORD2_BEFORE_1=scan(comments,i-2, delims);

             WORD_BEFORE_2= (scan(comments, i-1, delims))-1); -does not work

              end;

 

end;

 

 

12 REPLIES 12
Astounding
PROC Star

An easy way would be to split up the COMMENTS field ahead of time into separate words:

 

data want;

set have;

array word {99} $ 50;

do k=1 to 99;

   word{k} = scan(comments, k, " ");

end;

..................

 

That gives you up to 99 variables to work with.  Each one can be inspected so it should be easy to find all occurrences.  It should also be easy to pull out the words before (and even to check whether those words before contain an additional occurrence of the key word you are searching for).

drjorg
Calcite | Level 5
Thanks, the issue being that I have 1000+ observations, so going variable by variable and line by line, is not very feasible.
Astounding
PROC Star

Well, it's very feasible IF you first change all the instances of "ice cream" to "icecream".  That would let the key phrase be captured as a single word.  After that:

 

array word {99} $ 50;

do i=1 to 99;

   word{i} = scan(comments, i, " ");

end;

do i=1 to 99;

   if word{i} = "icecream" then do;

      ........

   end;

end;

 

There are other methods if this doesn't seem like a good way.  The search functions (find, findw, index, indexw) generally have extra parameters that let you specify where to begin the search within COMMENTS.  So if you find "ice cream" once and keep track of where that is, the next search could continue where the first search left off.

VDD
Ammonite | Level 13 VDD
Ammonite | Level 13

in your dataset have does comments equal more that the first word on the line?

 

drjorg
Calcite | Level 5

Yes, it is not the same word repeated. I guess an example would be below. I want the flavor and amount each time. I am using something like ice cream as the trigger. 

 

in the past week can you tell me all the flavors of ice cream you had?

 

IDComments
1On Monday I had 1.5 scoops of vanilla ice cream, Tuesday we had 2 scoops of chocolate ice cream, one day we also had 3 scoops of strawberry ice cream, but I can't remember when. 
2Over the week I had 4 scoops of raspberry ice cream, 3 scoops of vanilla ice cream
3On 1/20/2019, I had 3 scoops of chocolate ice cream
kiranv_
Rhodochrosite | Level 12

what your answer will be. something like below.

 

id      icream   scoop

1      vanilla     1.5

1      chocolate    2

1      strawberry   3

2      raspberry    4

2      vanilla      3

3      chocolate    3

 

drjorg
Calcite | Level 5
I would like it to be wide format, so icream_1 scoop_1
icream_2 scoop_2, icream_3 scoop_3
kiranv_
Rhodochrosite | Level 12

one  way to do this. 

 

 \w+ =alphabet or number or underscore

(?=) is  zero length lookahead assertion which means just see what is next to word of interest in our case it is \w+ which can be alphabet or number or underscore

(?=\sice cream) -- find the word with space and icecream

[0-9\.]+ is number or dot

(\sscoops)/') -- find the word with space and scoops

 

data have;
length id 8. comments $500.;
infile datalines dlm='09'x;
input ID	Comments $ ;
datalines;
1	On Monday I had 1.5 scoops of vanilla ice cream, Tuesday we had 2 scoops of chocolate ice cream, one day we also had 3 scoops of strawberry ice cream, but I can't remember when. 
2	Over the week I had 4 scoops of raspberry ice cream, 3 scoops of vanilla ice cream
3	On 1/20/2019, I had 3 scoops of chocolate ice cream. 
;

data have1(keep= id icecream amount);
length val val1 icecream amount $4200.;
set have;
 start = 1;
   stop = length(comments);
 
   re = prxparse('/\w+(?=\sice cream)/');
   set have;
   call prxnext(re, start, stop, trim(comments), position, length);
      do while (position > 0 );
         val = substr(comments, position, length); 
           icecream = catx(" ",  icecream, val);
         call prxnext(re, start, stop, trim(comments), position, length);
      end;
set have;
start1 = 1;
   stop1 = length(comments);
   re1 = prxparse('/[0-9\.]+(?=\sscoops)/');
call prxnext(re1, start1, stop1, trim(comments), position1, length1);
      do while (position1 > 0);
         val1 = substr(comments, position1, length1); 
           amount = catx(" ",  amount, val1);
         call prxnext(re1, start1, stop1, trim(comments), position1, length1);
         end;
run;

proc sql noprint;
select cats("ice", max(countw(icecream," "))),
       cats("amount", max(countw(amount," ")))

into :maxice, :maxamount
from have1;




data want;
set have1;
length ice1-&maxice $10.;
array ice[*] $ ice1-&maxice;
array amounts[*] amount1-&maxamount;
 do i = 1 to countw(icecream," "); 
do j = 1 to countw(amount," ");
 ice[i] = scan(icecream, i, ' '); 
amounts[j] = scan(amount, j, ' '); 
end;
end;
drop i j amount icecream;
run;


proc print;

 
Obs id ice1 ice2 ice3 amount1 amount2 amount3
1 1 vanilla chocolate strawberry 1.5 2 3
2 2 raspberry vanilla   4.0 3 .
3 3 chocolate     3.0 . .




 

art297
Opal | Level 21

If searching for 'ice' rather than 'ice cream' would suffice, then here is one method:

data have (keep=id j scoops flavor);
  informat comments $255.;
  length scoops flavor $10;
  
  input id comments &;
  i=0;
  j=0;
  ixp=0;
  ixs=-2;
  do while (findw(comments,"ice",' ,','e',ixs+3));
    i+1;
    if i eq 1 then ixs+3;
    ixp=findw(comments,"ice",' ,','e',ixs)+ixp;
    if (scan(comments,ixp-3,' ') eq 'scoops' or
        scan(comments,ixp-3,' ') eq 'scoop') and
       scan(comments,ixp-2,' ') eq 'of' then do;
      j+1;
      scoops=catx(' ',scan(comments,ixp-4,' '),scan(comments,ixp-3,' '));
      flavor = scan(comments,ixp-1,' ');
      output;
    end;
    ixs=findw(comments,"ice",ixs)+3;
  end;
  cards;
1   On Monday I had 1.5 scoops of vanilla ice cream, Tuesday we had 2 scoops of chocolate ice cream, one day we also had 3 scoops of strawberry ice cream, but I can't remember when. 
2   Over the week I had 4 scoops of raspberry ice cream, 3 scoops of vanilla ice cream
3   On 1/20/2019, I had 3 scoops of chocolate ice cream.
4   On 2/1/2019 I didn't want any ice cream
5   On 2/1/2019 I didn't want any ice cream, but the day before I had 1 scoop of lemon ice
;

%transpose(data=have,out=want, var=scoops flavor, id=j, by=id)

/*
uses transpose macro which can be downloaded from
http://www.sascommunity.org/wiki/A_Better_Way_to_Flip_(Transpose)_a_SAS_Dataset
*/

Art, CEO, AnalystFinder.com

 

drjorg
Calcite | Level 5

Thanks, I think that this would work. The variables appear to be correctly identified...

but for some reason when I run it, it is only giving me 83 results. I expect there to be more than 454 (since that is how many report eating ice cream... I am not sure what I am doing wrong. CLass_1_p is an indicator for those who reported eating ice cream. Date is the time they were asked (some were asked multiple times to report). 

 


data try6;
set limit(keep=ID Class_I_p RESULT_DATE comments);

i=0;
j=0;
ixp=0;
ixs=-2;
do while (findw(comments,"ICE",' ,','e',ixs+3));
i+1;
if i eq 1 then ixs+3;
ixp=findw(comments,"ICE",' ,','e',ixs)+ixp;
if Class_I_p=1 then do;
j+1;
SCOOPS=catx(' ',scan(comments,ixp-4,' '),scan(comments,ixp-3,' '));
Flavor = scan(comments,ixp-1,' ');



output;
end;
ixs=findw(comments,"ice",ixs)+3;
end;
run;

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
1617:44
NOTE: There were 7699 observations read from the data set WORK.LIMIT.
NOTE: The data set WORK.TRY6 has 83 observations and 13 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

 

art297
Opal | Level 21

@DR_Jorg:

 

You'd have to provide some sample data that reflects your actual data. My first guess is that you have to add the 'i' modifier to the findw calls to ignore the case of the string 'ice'. I.e., to capture any variation of capitalization.

However, I'd have to run your code against actual data to identify why you are getting the numeric to character conversion note.

 

Art, CEO, AnalystFinder.com

 

art297
Opal | Level 21

@drjorg : As I mentioned, one can only guess unless you provide some example records that closely approximate your actual data.

 

However, that said, there was a logical error in the way you modified the code, and I added the probably necessary 'i' modifier.

 

See if the following comes closer to what you want/expected:

 

data limit;
  informat id 8.;
  informat Class_I_p $1.;
  informat Result_Date date9.;
  format Result_Date date9.;
  informat comments $255.;
  input ID Class_I_p  RESULT_DATE comments &;
  cards;
1 1 02feb2019  On Monday I had 1.5 scoops of vanilla ICE cream, Tuesday we had 2 scoops of chocolate Ice cream, one day we also had 3 scoops of strawberry icE cream, but I can't remember when. 
2 1 02feb2019  Over the week I had 4 scoops of raspberry ICE cream, 3 scoops of vanilla Ice cream
3 1 02feb2019  On 1/20/2019, I had 3 scoops of chocolate ice cream.
4 0 02feb2019  On 2/1/2019 I didn't want any ice cream
5 1 02feb2019  On 2/1/2019 I had 1 scoop of lemon iCe
;

data try6;
  set limit(keep=ID Class_I_p RESULT_DATE comments);
  i=0;
  j=0;
  ixp=0;
  ixs=-2;
  if Class_I_p=1 then do;
    do while (findw(comments,"ICE",' ,','ei',ixs+3));
      i+1;
      if i eq 1 then ixs+3;
      ixp=findw(comments,"ICE",' ,','ei',ixs)+ixp;
      j+1;
      SCOOPS=catx(' ',scan(comments,ixp-4,' '),scan(comments,ixp-3,' '));
      Flavor = scan(comments,ixp-1,' ');
      output;
      ixs=findw(comments,"ice",' ,','i',ixs)+3;
    end;
  end;
run;

%transpose(data=try6,out=want, var=scoops flavor, id=j, by=id RESULT_DATE)

Art, CEO, AnalystFinder.com

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 2224 views
  • 0 likes
  • 6 in conversation