Hi everyone,
here is my issue:
I have two available tables:
-a table list_words with one string variable named words) and 10 individuals (or p individuals in general, with p supposed as known)
-a table raw_data with one string variable (maned text) and 20 individuals (or n individuals in general, with n supposed as known)
I would like to create a new variable main_word within the raw_data table.
For each individual i of the raw_data table, if one of the words (i.e. a value of the variable words of the raw_data table) can be found within the text,
then the i-th element of raw_data takes the found word as a value for main_word.
If no word is found within the text for the i-th individual, then main_word receives the value "" for that individual.
is it possible to do such a thing without having to write a DATA step on the raw_data table for each of the words?
(with the word manually entered as a parameter for an indexw function)?
What I tried to do (and which didn't work):
1: Use an array and put each word in a macro-variable.
DATA _NULL_ ; SET list_words; CALL SYMPUT(COMPRESS("wordsmacro"||_N_),words) ; RUN ; %LET array_word = array array_word $30 wordsmacro&1-wordsmacro&10 ; DATA raw_data ; SET raw_data ; LENGTH main_word $ 30 ; DO i = 1 to 10 ; IF INDEXW(text, array_word[i]) > 0 THEN DO ; main_word = array_word[i] ; LEAVE ; END ; END ; RUN ;
The issue here is that array_word apparently isn't defined inside the DATA step.
The log extract with the mistakes is the following:
27 IF INDEXW(text, array_word[i]) > 0
ERROR: Undeclared array referenced: array_word.
ERROR: Variable array_word has not been declared as an array.
28 THEN DO ;
29 main_word = array_word[i] ;
ERROR: Undeclared array referenced: array_word.
ERROR: Variable array_word has not been declared as an array.
30 LEAVE ;
2: Put each word in a macro-variable and do a loop on the table:
DATA _NULL_ ; SET list_words ; CALL SYMPUT(COMPRESS("wordsmacro"||_N_),words) ; RUN ; %MACRO parcours ; DATA raw_data ; SET raw_data ; LENGTH main_word $ 30 ; %DO i = 1 %TO 10 : IF INDEXW(text, &&wordsmacro&i) > 0 THEN DO ; main_word = &&wordsmacro&i ; LEAVE ; END ; %END ; RUN ; %MEND ; %parcours ;
The log extract with the mistakes is the following:
23 %parcours ;
WARNING: Apparent symbolic reference I not resolved.
WARNING: Apparent symbolic reference MOTSMACRO not resolved.
WARNING: Apparent symbolic reference I not resolved.
ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was:
10 : IF INDEXW(text, &&wordsmacro&i) > 0 THEN DO
ERROR: The %TO value of the %DO I loop is invalid.
ERROR: The macro PARCOURS will stop executing.
24
25 GOPTIONS NOACCESSIBLE;
26 %LET _CLIENTTASKLABEL=;
27 %LET _CLIENTPROJECTPATH=;
28 %LET _CLIENTPROJECTNAME=;
29 %LET _SASPROGRAMFILE=;
30
31 ;*';*";*/;quit;run;
____
180
ERROR 180-322: Statement is not valid or it is used out of proper order.
3: I don't know how to code it here, but maybe a hash object could work too?
Some possible tables:
DATA list_words ; INPUT mots $ 1-40 ; DATALINES ; Lorem ipsum dolor sit amet consectetur adipiscing elit sed do RUN ; DATA raw_data ; INPUT texte $ 1-600 ; DATALINES ; Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. RUN ;
First off: Array_word created with the %let statement creates a macro variable that would be referenced when used with &array_word.
The Array_word in the data step is treated as an array because you use the [i] construct and you did not define an array named Array_word. You cannot assign an array multiple times which your attempt would do.
%LET array_word = array array_word $30 wordsmacro&1-wordsmacro&10 ; DATA raw_data ; SET raw_data ; LENGTH main_word $ 30 ; DO i = 1 to 10 ; IF INDEXW(text, array_word[i]) > 0
Before going to the data step examine the result of
%put &array_word.
You will not be using the text of the values of the variable "words" in data set test. You also are using macro variables &1 and &10. Which are not valid macro variable names. Macro variable names have the same naming requirement as data set variables.
Did you get this to work in any form before attempting to use a macro?
Here's one approach using macros and arrays if your idea is to find the first word in each line of text.
data test; input words $; datalines; first second third ; run; data raw_data; infile datalines missover; input text $ 1-50 ; datalines; The sentence contains the word third first is on the second line ; run; proc sql noprint; select quote(strip(words)) into : wordlist separated by ',' from test ; quit; %put &wordlist; %let wordcount=&sqlobs; DATA new_data ; SET raw_data ; LENGTH main_word $ 30 ; array array_word (&wordcount) $ 30 _temporary_(&wordlist); DO i = 1 to 10 ; IF INDEXW(text,array_word[i] ) > 0 THEN DO ; main_word = array_word[i] ; LEAVE ; END ; END ; RUN ;
First off: Array_word created with the %let statement creates a macro variable that would be referenced when used with &array_word.
The Array_word in the data step is treated as an array because you use the [i] construct and you did not define an array named Array_word. You cannot assign an array multiple times which your attempt would do.
%LET array_word = array array_word $30 wordsmacro&1-wordsmacro&10 ; DATA raw_data ; SET raw_data ; LENGTH main_word $ 30 ; DO i = 1 to 10 ; IF INDEXW(text, array_word[i]) > 0
Before going to the data step examine the result of
%put &array_word.
You will not be using the text of the values of the variable "words" in data set test. You also are using macro variables &1 and &10. Which are not valid macro variable names. Macro variable names have the same naming requirement as data set variables.
Did you get this to work in any form before attempting to use a macro?
Here's one approach using macros and arrays if your idea is to find the first word in each line of text.
data test; input words $; datalines; first second third ; run; data raw_data; infile datalines missover; input text $ 1-50 ; datalines; The sentence contains the word third first is on the second line ; run; proc sql noprint; select quote(strip(words)) into : wordlist separated by ',' from test ; quit; %put &wordlist; %let wordcount=&sqlobs; DATA new_data ; SET raw_data ; LENGTH main_word $ 30 ; array array_word (&wordcount) $ 30 _temporary_(&wordlist); DO i = 1 to 10 ; IF INDEXW(text,array_word[i] ) > 0 THEN DO ; main_word = array_word[i] ; LEAVE ; END ; END ; RUN ;
Not sure what you are trying to do, but I presume that you want your word search to be case insensitive. If so, the following example should provide one way of approaching the task:
data list_words; informat words $11.; input words; cards; Lorem ipsum dolor sit amet consectetur adipiscing elit sed do ; data raw_data; informat text $50.; input text &; cards; Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor sit amet consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. ; DATA want (keep=text found); array array_word(999) $ _temporary_; length found $5; do until(eof1); set list_words end=eof1; n+1; array_word[n]=words; end; do until(eof2); SET raw_data end=eof2; call missing(found); DO i=1 to n; test=array_word[i]; if findw(text,strip(array_word[i]),' ','i') then do; found='Yes'; leave; end; end; IF missing(found) then found='no'; output; end; RUN ;
Art, CEO, AnalystFinder.com
Thanks to both, both of your programs worked and solved my issue, so the help was much appreciated.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.