DATA Step, Macro, Functions and more

Splitting of sentence into variables with meaningful words

Reply
Contributor SKK
Contributor
Posts: 35

Splitting of sentence into variables with meaningful words

Hi

Contributor SKK
Contributor
Posts: 35

Re: Splitting of sentence into variables with meaningful words

Hi i want to split a sentence into variables with same length, but i dont want the words to be truncated in the new variable,

if it is going to be truncated then it must move on to the next variable.

For eg: If i have sentence of length 200 bytes, i want to split the sentence into 4 variables each of 50 bytes and they

        should not have truncated words.

I have done a complex macro to do this but i am looking for a much simpler method

Attachment
Super User
Super User
Posts: 7,977

Re: Splitting of sentence into variables with meaningful words

Hi,

You can do it in several ways, here is a datastep using substr to chop up the string:

data have (drop=find_last_space);
  attrib really_long_string format=$200.;
  attrib only_50_chars format=$50.;
  attrib find_last_space format=best.;
  really_long_string="This is a long sentance which will be trimmed at this word as it splits over the row and also at this record as that splits as well.";
  do until (length(strip(really_long_string))=1);
    find_last_space=index(reverse(substr(really_long_string,1,50))," ");
    only_50_chars=substr(really_long_string,1,50-find_last_space);
    really_long_string=substr(really_long_string,52-find_last_space);  /* Note use 52, or strip the result to get rid of blank */
    output;
  end;
run;

Occasional Contributor
Posts: 6

Re: Splitting of sentence into variables with meaningful words

Super User
Posts: 10,041

Re: Splitting of sentence into variables with meaningful words

data have;
x="Hi i want to split a sentence into variables with same length, but i dont want the words to be truncated in the new variable,if it is going to be truncated then it must move on to the next variable.For eg: If i have sentence of length";
output;
x="For eg: If i have sentence of length 200 bytes, i want to split the sentence into 4 variables each of 50 bytes and they should not have truncated words.I have done a complex macro to do this but i am looking for a much simpler method";
output;
run;
data x;
 set have;
 length token $ 100;
 group+1;
 do i=1 to countw(x,' ');
  token=scan(x,i,' ');
  output;
 end;
 drop x;
run;
proc rank data=x out=x1 groups=4;
by group;
var i;
ranks r;
run;
data x2;
 set x1;
 by group r;
 length x $ 200;
 retain x;
 x=catx(' ',x,token);
 if last.r then do; output; call missing(x);end;
 keep group r x;
run;
proc transpose data=x2 out=want prefix=var;
by group;
var x;
id r;
run;


Xia Keshan

Contributor SKK
Contributor
Posts: 35

Re: Splitting of sentence into variables with meaningful words

Thank you guys for ur ideas...

But i wanted a macro like simple program which accepts a input sentence and creates sub variables with small length like 50 bytes.

Proc rank seems to be a excellent idea but the grouping needs to done in trial and error method based on the sub variable length to get the accurate results. So i think its not usefull for automating.

Please go through my program below and through me some ideas to improve it.

***Input sentence***;

%let sentence=We mostly use the default behavior of the DATA STEP to create working code. However, certain common tasks are made easier by overriding the default behavior. In the Pharmaceutical Industry one such common task is LOCF.;

***Counting the words***;

%let nn=%sysfunc(countw("&sentence"));

***Macro for splitting the sentence into separate variable***;

%macro word_split ( sub_var_name=txt, /*sub variable name*/
     sub_var_len=50,  /*length of sub variable*/
     No_sub_var=4);  /*No of sub variable to be created*/
   data new;
   length s $200  &sub_var_name.1-&sub_var_name.&No_sub_var $&sub_var_len;
   Sentence="&sentence";
   array a(*) $20 w1-w&nn;  /*Splitting the sentence into individual words*/
   do i=  1 to &nn;
    a(i)=scan("&sentence",i);
   end;
   do i=  1 to &nn;
    retain s;
    a1=s;
    s=catx (" ",trim(a1),trim(a(i)));  /*Concatenating the words one by one and*/
    nlen=length(s);      /*calculating its length*/
     if nlen gt %sysevalf((&sub_var_len)-10) then do;
      %macro split_var ();  /*Macro for creating sub variable with values*/
       %do i= 1 %to &No_sub_var;
        if (length(trim(&sub_var_name.&i)) le 1) then
         do;
          &sub_var_name.&i = s;
          s=" ";
          nlen=.;
         end;
       %end;
      %mend;
      %split_var ();
     end;
    if nlen ne . then &sub_var_name.&No_sub_var = s;
   end;
  keep Sentence &sub_var_name.1-&sub_var_name.&No_sub_var;
run;

%mend;
***end of Macro***;

%word_split (sub_var_name=var,
     sub_var_len=50, 
     No_sub_var=5)

Super User
Posts: 10,041

Re: Splitting of sentence into variables with meaningful words

You can get all you want by changing my code if you could understand it .Create new variables and settle variable's length is not a big deal. you can get it before data step by macro function. and I believe proc rank can also handle all of these.

Super User
Super User
Posts: 7,977

Re: Splitting of sentence into variables with meaningful words

Pre-process the string and add your split character, then datastep split the string out:

%macro Chop (inds=,invar=,num_output_cols=,max_length=40);
data &inds. (drop=text_to_process);
  set &inds.;
  attrib &invar._new_text format=$2000.;
  attrib text_to_process format=$2000.;
  text_to_process=&invar.;
  do while (length(trim(text_to_process)) > &max_length.);
   &invar._new_text=trim(&invar._new_text)||'$'||substr(text_to_process,1,prxmatch("/\/[^\/]*$/",substr(tranwrd(text_to_process,' ','/'),1,40)));
   text_to_process=substr(text_to_process,prxmatch("/\/[^\/]*$/",substr(tranwrd(text_to_process,' ','/'),1,40)));
  end;
  &invar._new_text=trim(&invar._new_text)||'$'||trim(text_to_process);
run;
  data &inds.;
    set &inds.;
    attrib
    %do i=1 %to &num_output_cols.;
      split_&i.
    %end;
      format=$&max_length..;
    %do i=1 %to &num_output_cols.;
      split_&i.=scan(&invar._new_text,&i.+1,"$");
    %end;
  run;
%mend Chop;

data my_test;
  attrib really_long_string format=$200.;
  really_long_string="This is a long sentance which will be trimmed at this word as it splits over the row and also at this record as that splits as well.";
  output;
run;

%Chop (inds=work.my_test,num_output_cols=4,invar=really_long_string);

Respected Advisor
Posts: 4,173

Re: Splitting of sentence into variables with meaningful words

You could use a regular expression as shown below.

data have;
string="Hi i want to split a sentence into variables with same length, but i dont want the words to be truncated in the new variable,if it is going to be truncated then it must move on to the next variable.For eg: If i have sentence of length";
output;
string="For eg: If i have sentence of length 200 bytes, i want to split the sentence into 4 variables each of 50 bytes and they should not have truncated words.I have done a complex macro to do this but i am looking for a much simpler method";
output;
run;

data prep /*(drop=_Smiley Happy*/;
  length sub_string $50;
  retain _re;
  length _start _stop _pos _len 8;

  if _N_ = 1 then
    _re = prxparse('/(?<=\b)[\w].{0,48}([^\w]|$)/');
  set have;
  _start=1;
  _stop = lengthn(string);
  do until (_start>_stop);
    call prxnext(_re, _start, _stop, string, _pos, _len);
    sub_string = substrn(string, _pos, _len);
    output;
    if missing(sub_string) then leave;
  end;
run;

proc transpose data=prep out=want(drop=_Smiley Happy prefix=var;
  by string notsorted;
  var sub_string;
run;

Super User
Posts: 10,041

Re: Splitting of sentence into variables with meaningful words

Here is .

 


 
%let sentence=We mostly use the default behavior of the DATA STEP to create working code. However, certain common tasks are made easier by overriding the default behavior. In the Pharmaceutical Industry one such common task is LOCF.;

%macro word_split ( sub_var_name=txt, /*sub variable name*/
     sub_var_len=50,  /*length of sub variable*/
     No_sub_var=4);  /*No of sub variable to be created*/
data x;
 length token $ 100;
 x="&sentence";
 do i=1 to countw(x,' ');
  token=scan(x,i,' ');
  length=1+length(token);
  output;
 end;
 drop x;
run;
proc rank data=x out=x1 groups=&No_sub_var ;
var i;
ranks r;
run;
proc sql;
select  max(range) into : max_len from (
 select sum(length)     as range
  from x1
   group by r
);
quit;
%if &max_len lt &sub_var_len %then %do;
  data x2;
 set x1;
 by r;
 length x $ &sub_var_len ;
 retain x;
 x=catx(' ',x,token);
 if last.r then do; output; call missing(x);end;
 keep  r x;
run;
proc transpose data=x2 out=want prefix=&sub_var_name ;
var x;
id r;
run;
 %end;
 %else %put ERROR: variable length is too short;
%mend word_split;

options mprint mlogic symbolgen;
%word_split (sub_var_name=var, 
     sub_var_len=50,  
     No_sub_var=5)

Xia Keshan

Message was edited by: xia keshan

Message was edited by: xia keshan

Ask a Question
Discussion stats
  • 9 replies
  • 2353 views
  • 3 likes
  • 5 in conversation