Hi Everyone,
I am trying to parse a comment section into variables. I want the word before, and two words before a trigger word. The difficulty I am having is that sometimes the trigger word appears more than once (up to five times). I want to pull it each time it is stated. The scan function only returns the last occurrence. My current code below is not pulling the next occurrence, only the last. Does anyone know of a way to pull each iteration before the trigger?
Thanks!
data have;
input ID $ Comments $;
datalines;
1 "word word word word word WORD2_BEFORE_1 WORD_BEFORE_1 trigger WORD_AFTER word word word word WORD_BEFORE_2 trigger WORD_AFTER WORD_BEFORE_3 trigger WORD_AFTER"
2 "word word word WORD2_BEFORE_1 WORD_BEFORE_1 trigger word"
3 "word word word word word WORD2_BEFORE_1 WORD_BEFORE_1 trigger WORD_BEFORE_2 trigger";
Data want; set have;
count_CI = 0;
do i = 1 to countw(comments, " ");
count_CI = count_CI + (scan(comments,i) = "trigger");
end;
delims= ' (';
delims2= ': ';
do i=1 to countw(comments, " ") while (count_CI<=i);
if Class_I_p=1 and count_CI=1 and scan(comments, i, " ")="trigger" then do;
WORD_BEFORE_1=scan(comments, i-1, delims);
WORD2_BEFORE_1=scan(comments,i-2, delims);
WORD_BEFORE_2= (scan(comments, i-1, delims))-1); -does not work
end;
end;
An easy way would be to split up the COMMENTS field ahead of time into separate words:
data want;
set have;
array word {99} $ 50;
do k=1 to 99;
word{k} = scan(comments, k, " ");
end;
..................
That gives you up to 99 variables to work with. Each one can be inspected so it should be easy to find all occurrences. It should also be easy to pull out the words before (and even to check whether those words before contain an additional occurrence of the key word you are searching for).
Well, it's very feasible IF you first change all the instances of "ice cream" to "icecream". That would let the key phrase be captured as a single word. After that:
array word {99} $ 50;
do i=1 to 99;
word{i} = scan(comments, i, " ");
end;
do i=1 to 99;
if word{i} = "icecream" then do;
........
end;
end;
There are other methods if this doesn't seem like a good way. The search functions (find, findw, index, indexw) generally have extra parameters that let you specify where to begin the search within COMMENTS. So if you find "ice cream" once and keep track of where that is, the next search could continue where the first search left off.
in your dataset have does comments equal more that the first word on the line?
Yes, it is not the same word repeated. I guess an example would be below. I want the flavor and amount each time. I am using something like ice cream as the trigger.
in the past week can you tell me all the flavors of ice cream you had?
ID | Comments |
1 | On Monday I had 1.5 scoops of vanilla ice cream, Tuesday we had 2 scoops of chocolate ice cream, one day we also had 3 scoops of strawberry ice cream, but I can't remember when. |
2 | Over the week I had 4 scoops of raspberry ice cream, 3 scoops of vanilla ice cream |
3 | On 1/20/2019, I had 3 scoops of chocolate ice cream. |
what your answer will be. something like below.
id icream scoop
1 vanilla 1.5
1 chocolate 2
1 strawberry 3
2 raspberry 4
2 vanilla 3
3 chocolate 3
one way to do this.
\w+ =alphabet or number or underscore
(?=) is zero length lookahead assertion which means just see what is next to word of interest in our case it is \w+ which can be alphabet or number or underscore
(?=\sice cream) -- find the word with space and icecream
[0-9\.]+ is number or dot
(\sscoops)/') -- find the word with space and scoops
data have;
length id 8. comments $500.;
infile datalines dlm='09'x;
input ID Comments $ ;
datalines;
1 On Monday I had 1.5 scoops of vanilla ice cream, Tuesday we had 2 scoops of chocolate ice cream, one day we also had 3 scoops of strawberry ice cream, but I can't remember when.
2 Over the week I had 4 scoops of raspberry ice cream, 3 scoops of vanilla ice cream
3 On 1/20/2019, I had 3 scoops of chocolate ice cream.
;
data have1(keep= id icecream amount);
length val val1 icecream amount $4200.;
set have;
start = 1;
stop = length(comments);
re = prxparse('/\w+(?=\sice cream)/');
set have;
call prxnext(re, start, stop, trim(comments), position, length);
do while (position > 0 );
val = substr(comments, position, length);
icecream = catx(" ", icecream, val);
call prxnext(re, start, stop, trim(comments), position, length);
end;
set have;
start1 = 1;
stop1 = length(comments);
re1 = prxparse('/[0-9\.]+(?=\sscoops)/');
call prxnext(re1, start1, stop1, trim(comments), position1, length1);
do while (position1 > 0);
val1 = substr(comments, position1, length1);
amount = catx(" ", amount, val1);
call prxnext(re1, start1, stop1, trim(comments), position1, length1);
end;
run;
proc sql noprint;
select cats("ice", max(countw(icecream," "))),
cats("amount", max(countw(amount," ")))
into :maxice, :maxamount
from have1;
data want;
set have1;
length ice1-&maxice $10.;
array ice[*] $ ice1-&maxice;
array amounts[*] amount1-&maxamount;
do i = 1 to countw(icecream," ");
do j = 1 to countw(amount," ");
ice[i] = scan(icecream, i, ' ');
amounts[j] = scan(amount, j, ' ');
end;
end;
drop i j amount icecream;
run;
proc print;
Obs | id | ice1 | ice2 | ice3 | amount1 | amount2 | amount3 |
---|---|---|---|---|---|---|---|
1 | 1 | vanilla | chocolate | strawberry | 1.5 | 2 | 3 |
2 | 2 | raspberry | vanilla | 4.0 | 3 | . | |
3 | 3 | chocolate | 3.0 | . | . |
If searching for 'ice' rather than 'ice cream' would suffice, then here is one method:
data have (keep=id j scoops flavor);
informat comments $255.;
length scoops flavor $10;
input id comments &;
i=0;
j=0;
ixp=0;
ixs=-2;
do while (findw(comments,"ice",' ,','e',ixs+3));
i+1;
if i eq 1 then ixs+3;
ixp=findw(comments,"ice",' ,','e',ixs)+ixp;
if (scan(comments,ixp-3,' ') eq 'scoops' or
scan(comments,ixp-3,' ') eq 'scoop') and
scan(comments,ixp-2,' ') eq 'of' then do;
j+1;
scoops=catx(' ',scan(comments,ixp-4,' '),scan(comments,ixp-3,' '));
flavor = scan(comments,ixp-1,' ');
output;
end;
ixs=findw(comments,"ice",ixs)+3;
end;
cards;
1 On Monday I had 1.5 scoops of vanilla ice cream, Tuesday we had 2 scoops of chocolate ice cream, one day we also had 3 scoops of strawberry ice cream, but I can't remember when.
2 Over the week I had 4 scoops of raspberry ice cream, 3 scoops of vanilla ice cream
3 On 1/20/2019, I had 3 scoops of chocolate ice cream.
4 On 2/1/2019 I didn't want any ice cream
5 On 2/1/2019 I didn't want any ice cream, but the day before I had 1 scoop of lemon ice
;
%transpose(data=have,out=want, var=scoops flavor, id=j, by=id)
/*
uses transpose macro which can be downloaded from
http://www.sascommunity.org/wiki/A_Better_Way_to_Flip_(Transpose)_a_SAS_Dataset
*/
Art, CEO, AnalystFinder.com
Thanks, I think that this would work. The variables appear to be correctly identified...
but for some reason when I run it, it is only giving me 83 results. I expect there to be more than 454 (since that is how many report eating ice cream... I am not sure what I am doing wrong. CLass_1_p is an indicator for those who reported eating ice cream. Date is the time they were asked (some were asked multiple times to report).
data try6;
set limit(keep=ID Class_I_p RESULT_DATE comments);
i=0;
j=0;
ixp=0;
ixs=-2;
do while (findw(comments,"ICE",' ,','e',ixs+3));
i+1;
if i eq 1 then ixs+3;
ixp=findw(comments,"ICE",' ,','e',ixs)+ixp;
if Class_I_p=1 then do;
j+1;
SCOOPS=catx(' ',scan(comments,ixp-4,' '),scan(comments,ixp-3,' '));
Flavor = scan(comments,ixp-1,' ');
output;
end;
ixs=findw(comments,"ice",ixs)+3;
end;
run;
NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column).
1617:44
NOTE: There were 7699 observations read from the data set WORK.LIMIT.
NOTE: The data set WORK.TRY6 has 83 observations and 13 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
You'd have to provide some sample data that reflects your actual data. My first guess is that you have to add the 'i' modifier to the findw calls to ignore the case of the string 'ice'. I.e., to capture any variation of capitalization.
However, I'd have to run your code against actual data to identify why you are getting the numeric to character conversion note.
Art, CEO, AnalystFinder.com
@drjorg : As I mentioned, one can only guess unless you provide some example records that closely approximate your actual data.
However, that said, there was a logical error in the way you modified the code, and I added the probably necessary 'i' modifier.
See if the following comes closer to what you want/expected:
data limit;
informat id 8.;
informat Class_I_p $1.;
informat Result_Date date9.;
format Result_Date date9.;
informat comments $255.;
input ID Class_I_p RESULT_DATE comments &;
cards;
1 1 02feb2019 On Monday I had 1.5 scoops of vanilla ICE cream, Tuesday we had 2 scoops of chocolate Ice cream, one day we also had 3 scoops of strawberry icE cream, but I can't remember when.
2 1 02feb2019 Over the week I had 4 scoops of raspberry ICE cream, 3 scoops of vanilla Ice cream
3 1 02feb2019 On 1/20/2019, I had 3 scoops of chocolate ice cream.
4 0 02feb2019 On 2/1/2019 I didn't want any ice cream
5 1 02feb2019 On 2/1/2019 I had 1 scoop of lemon iCe
;
data try6;
set limit(keep=ID Class_I_p RESULT_DATE comments);
i=0;
j=0;
ixp=0;
ixs=-2;
if Class_I_p=1 then do;
do while (findw(comments,"ICE",' ,','ei',ixs+3));
i+1;
if i eq 1 then ixs+3;
ixp=findw(comments,"ICE",' ,','ei',ixs)+ixp;
j+1;
SCOOPS=catx(' ',scan(comments,ixp-4,' '),scan(comments,ixp-3,' '));
Flavor = scan(comments,ixp-1,' ');
output;
ixs=findw(comments,"ice",' ,','i',ixs)+3;
end;
end;
run;
%transpose(data=try6,out=want, var=scoops flavor, id=j, by=id RESULT_DATE)
Art, CEO, AnalystFinder.com
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.