BookmarkSubscribeRSS Feed
keen_sas
Quartz | Level 8

I have 2 data-sets
Keyword data set is having keywords to find and replace the sub strings in the variable.Second data set will have multiple character variables in which we have to identify the keywords present in  these character variables and replace them with the substring present in keyword data set.
i am using the below code to identify the keywords in TEST data set using the keywords present in KEYWORD data set using do loop. Since this is simple data set present only few keywords in the KEYWORD data set and the code executes swiftly.
But if the KEYWORD data set has more than 400 keywords and second data set has more character variables (assume 10 variable), the code will take more time to execute. Since the loop checks for each observation for
400(keywords)*10(variables)*No. of observations present in the data set.

Another disadvantage with the below code is TRANWRD function is case sensitive. 
Is there any alternative method to avoid this do loop and execute the code faster may be using PERL functions or any easy method.

 

/*dataset with multiple character variables*/
data test;
length text $200. ;
input seq text $3 - 40 ;
cards ;
1 Actual date Of The introduction
2 Change consisted In The Insertion
3 Hard And Fast Rule
4 Do OR Die
5 Many A times
6 2d echo tests
7 ast and Bili are the tests
8 Attention To Orthography
9 Conclusions Were closed
;
run;

data test;
set test ;
original=text;
text2=text;
run ;

 

/*KEYWORD dataset to find and replace the substring*/
data keyword ;
length have want $20. ;
input have want ;
cards;
a A
an AN
And and
Or or
To to
Ast AST
bili BILI
were Were
2d 2D
;
run ;

 

/*creating macro variables for keywords*/
proc sql noprint;
select have,want into :source separated by ' ',:target separated by ' ' from keyword ;
quit;

%let cnt_wrd=%sysfunc(countw(&source,' ' )) ;


/*identifying the keyword and replacing*/
data final;
set test;
do w= 1 to &cnt_wrd ;
tgt_=scan("&source",w,' ') ;
rep_=scan("&target",w,' ') ;
/* checking for multiple variables*/
text=tranwrd(text,strip(tgt_)||" ",strip(rep_)||" ") ;
text2=tranwrd(text2,strip(tgt_)||" ",strip(rep_)||" ") ;
end;
run;

 

Any alternative to avoid this do loop .

4 REPLIES 4
Shmuel
Garnet | Level 18

There is no need to enter twice same word but different case.

You can use either FIND or FINDW function with modifier to ignore case.

See sas documentation of those functions.

 

ballardw
Super User

This writes code that you can test this approach with:

data _null_;
   set work.keyword end=last;
   file print;
   length str $ 100;
   if _n_= 1 then do;
      put "data want; Set test; array a text;";
      put "do i = 1 to dim(a);";
   end;
   str = cats('a[i]= tranwrd(a[i],',quote(strip(have)||' '),',' ,quote(strip(want)||' '),');');
   put str;
   if last then do;
   put 'end; run;';
   end;
run;

File Print will write the output or results window. So copy that back to the editor to test.

obviously change the name of the input and output data set names of the first put statement. Replace the word text with the list of variables to apply the rules to.

You could write the code directly to a program file by specifying the name on the File statement. Then use %include to execute the code.

Or look up CALL EXECUTE. it would be relatively easy to transform the above code to CALL Execute. Basically replace the Put with

Call Execute ( ); with either the literal text or the variable Str.

ChrisNZ
Tourmaline | Level 20

>Is there any alternative method to avoid this do loop and execute the code faster may be using PERL functions or any easy method.

Perl expressions are much more expensive than more common functions like TRANWRD, and are not very suited to a list of replacement values.

 

Since you want to compare every word in every string with every keyword, you have little choice but to iterate.

Your logic might be tweakable, but potential gains are limited.

 

Another approach (and I am usure how much faster it is -if at all- you have to test on your data) is to match the words like so:

data WORDS (keep=SEQ WORD WORDU WORDNB);
  set TEST;
  length WORD $20;
  do WORDNB=1 to 99 ;
    WORD =scan(TEXT,WORDNB);
    if WORD=' ' then leave;  
    WORDU=upcase(WORD);    
    output;
  end;
run;

data KEYWORDS; 
  set KEYWORD;
  HAVE =upcase(HAVE);
run;

proc sql _method magic=103;
  create table WORDS_REPLACED as
  select SEQ, coalesce(WANT, WORD) as WORD_REPLACED
  from WORDS
         left join
       KEYWORDS
         on WORDU = HAVE
  order by SEQ, WORDNB;
quit;

data WANT;
  set WORDS_REPLACED;
  by SEQ;
  length WORD $200;
  retain WORD;
  if first.SEQ then call missing(WORD);
  WORD=catx(' ',WORD,WORD_REPLACED);
  if last.SEQ then output;
run;

 

ChrisNZ
Tourmaline | Level 20

Yet another way to possibly reduce the looping could be to to something like this:

 

select quote(upcase(trim(WANT))) into :upcase_list from KEYWORDS where lengthn(compress(WANT,,'l')=0;
select quote(upcase(trim(WANT))) into :locase_list from KEYWORDS where lengthn(compress(WANT,,'u')=0;

do I= 1 to countw(TEXT);
  WORD=scan(TEXT,I) ;
  if upcase(WORD) in (&upcase_list) then substr(TEXT, indexw(TEXT,WORD), ...) = upcase(WORD);
  if upcase(WORD) in (&locase_list) then substr(TEXT, indexw(TEXT,WORD), ...) = lowcase(WORD);
  if upcase(WORD) in (&prcase_list) then substr(TEXT, indexw(TEXT,WORD), ...) = propcase(WORD);
end;

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 493 views
  • 0 likes
  • 4 in conversation