BookmarkSubscribeRSS Feed
9 REPLIES 9
SKK
Calcite | Level 5 SKK
Calcite | Level 5

Hi i want to split a sentence into variables with same length, but i dont want the words to be truncated in the new variable,

if it is going to be truncated then it must move on to the next variable.

For eg: If i have sentence of length 200 bytes, i want to split the sentence into 4 variables each of 50 bytes and they

        should not have truncated words.

I have done a complex macro to do this but i am looking for a much simpler method

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,

You can do it in several ways, here is a datastep using substr to chop up the string:

data have (drop=find_last_space);
  attrib really_long_string format=$200.;
  attrib only_50_chars format=$50.;
  attrib find_last_space format=best.;
  really_long_string="This is a long sentance which will be trimmed at this word as it splits over the row and also at this record as that splits as well.";
  do until (length(strip(really_long_string))=1);
    find_last_space=index(reverse(substr(really_long_string,1,50))," ");
    only_50_chars=substr(really_long_string,1,50-find_last_space);
    really_long_string=substr(really_long_string,52-find_last_space);  /* Note use 52, or strip the result to get rid of blank */
    output;
  end;
run;

Ksharp
Super User
data have;
x="Hi i want to split a sentence into variables with same length, but i dont want the words to be truncated in the new variable,if it is going to be truncated then it must move on to the next variable.For eg: If i have sentence of length";
output;
x="For eg: If i have sentence of length 200 bytes, i want to split the sentence into 4 variables each of 50 bytes and they should not have truncated words.I have done a complex macro to do this but i am looking for a much simpler method";
output;
run;
data x;
 set have;
 length token $ 100;
 group+1;
 do i=1 to countw(x,' ');
  token=scan(x,i,' ');
  output;
 end;
 drop x;
run;
proc rank data=x out=x1 groups=4;
by group;
var i;
ranks r;
run;
data x2;
 set x1;
 by group r;
 length x $ 200;
 retain x;
 x=catx(' ',x,token);
 if last.r then do; output; call missing(x);end;
 keep group r x;
run;
proc transpose data=x2 out=want prefix=var;
by group;
var x;
id r;
run;


Xia Keshan

SKK
Calcite | Level 5 SKK
Calcite | Level 5

Thank you guys for ur ideas...

But i wanted a macro like simple program which accepts a input sentence and creates sub variables with small length like 50 bytes.

Proc rank seems to be a excellent idea but the grouping needs to done in trial and error method based on the sub variable length to get the accurate results. So i think its not usefull for automating.

Please go through my program below and through me some ideas to improve it.

***Input sentence***;

%let sentence=We mostly use the default behavior of the DATA STEP to create working code. However, certain common tasks are made easier by overriding the default behavior. In the Pharmaceutical Industry one such common task is LOCF.;

***Counting the words***;

%let nn=%sysfunc(countw("&sentence"));

***Macro for splitting the sentence into separate variable***;

%macro word_split ( sub_var_name=txt, /*sub variable name*/
     sub_var_len=50,  /*length of sub variable*/
     No_sub_var=4);  /*No of sub variable to be created*/
   data new;
   length s $200  &sub_var_name.1-&sub_var_name.&No_sub_var $&sub_var_len;
   Sentence="&sentence";
   array a(*) $20 w1-w&nn;  /*Splitting the sentence into individual words*/
   do i=  1 to &nn;
    a(i)=scan("&sentence",i);
   end;
   do i=  1 to &nn;
    retain s;
    a1=s;
    s=catx (" ",trim(a1),trim(a(i)));  /*Concatenating the words one by one and*/
    nlen=length(s);      /*calculating its length*/
     if nlen gt %sysevalf((&sub_var_len)-10) then do;
      %macro split_var ();  /*Macro for creating sub variable with values*/
       %do i= 1 %to &No_sub_var;
        if (length(trim(&sub_var_name.&i)) le 1) then
         do;
          &sub_var_name.&i = s;
          s=" ";
          nlen=.;
         end;
       %end;
      %mend;
      %split_var ();
     end;
    if nlen ne . then &sub_var_name.&No_sub_var = s;
   end;
  keep Sentence &sub_var_name.1-&sub_var_name.&No_sub_var;
run;

%mend;
***end of Macro***;

%word_split (sub_var_name=var,
     sub_var_len=50, 
     No_sub_var=5)

Ksharp
Super User

You can get all you want by changing my code if you could understand it .Create new variables and settle variable's length is not a big deal. you can get it before data step by macro function. and I believe proc rank can also handle all of these.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Pre-process the string and add your split character, then datastep split the string out:

%macro Chop (inds=,invar=,num_output_cols=,max_length=40);
data &inds. (drop=text_to_process);
  set &inds.;
  attrib &invar._new_text format=$2000.;
  attrib text_to_process format=$2000.;
  text_to_process=&invar.;
  do while (length(trim(text_to_process)) > &max_length.);
   &invar._new_text=trim(&invar._new_text)||'$'||substr(text_to_process,1,prxmatch("/\/[^\/]*$/",substr(tranwrd(text_to_process,' ','/'),1,40)));
   text_to_process=substr(text_to_process,prxmatch("/\/[^\/]*$/",substr(tranwrd(text_to_process,' ','/'),1,40)));
  end;
  &invar._new_text=trim(&invar._new_text)||'$'||trim(text_to_process);
run;
  data &inds.;
    set &inds.;
    attrib
    %do i=1 %to &num_output_cols.;
      split_&i.
    %end;
      format=$&max_length..;
    %do i=1 %to &num_output_cols.;
      split_&i.=scan(&invar._new_text,&i.+1,"$");
    %end;
  run;
%mend Chop;

data my_test;
  attrib really_long_string format=$200.;
  really_long_string="This is a long sentance which will be trimmed at this word as it splits over the row and also at this record as that splits as well.";
  output;
run;

%Chop (inds=work.my_test,num_output_cols=4,invar=really_long_string);

Patrick
Opal | Level 21

You could use a regular expression as shown below.

data have;
string="Hi i want to split a sentence into variables with same length, but i dont want the words to be truncated in the new variable,if it is going to be truncated then it must move on to the next variable.For eg: If i have sentence of length";
output;
string="For eg: If i have sentence of length 200 bytes, i want to split the sentence into 4 variables each of 50 bytes and they should not have truncated words.I have done a complex macro to do this but i am looking for a much simpler method";
output;
run;

data prep /*(drop=_:)*/;
  length sub_string $50;
  retain _re;
  length _start _stop _pos _len 8;

  if _N_ = 1 then
    _re = prxparse('/(?<=\b)[\w].{0,48}([^\w]|$)/');
  set have;
  _start=1;
  _stop = lengthn(string);
  do until (_start>_stop);
    call prxnext(_re, _start, _stop, string, _pos, _len);
    sub_string = substrn(string, _pos, _len);
    output;
    if missing(sub_string) then leave;
  end;
run;

proc transpose data=prep out=want(drop=_:) prefix=var;
  by string notsorted;
  var sub_string;
run;

Ksharp
Super User

Here is .

 


 
%let sentence=We mostly use the default behavior of the DATA STEP to create working code. However, certain common tasks are made easier by overriding the default behavior. In the Pharmaceutical Industry one such common task is LOCF.;

%macro word_split ( sub_var_name=txt, /*sub variable name*/
     sub_var_len=50,  /*length of sub variable*/
     No_sub_var=4);  /*No of sub variable to be created*/
data x;
 length token $ 100;
 x="&sentence";
 do i=1 to countw(x,' ');
  token=scan(x,i,' ');
  length=1+length(token);
  output;
 end;
 drop x;
run;
proc rank data=x out=x1 groups=&No_sub_var ;
var i;
ranks r;
run;
proc sql;
select  max(range) into : max_len from (
 select sum(length)     as range
  from x1
   group by r
);
quit;
%if &max_len lt &sub_var_len %then %do;
  data x2;
 set x1;
 by r;
 length x $ &sub_var_len ;
 retain x;
 x=catx(' ',x,token);
 if last.r then do; output; call missing(x);end;
 keep  r x;
run;
proc transpose data=x2 out=want prefix=&sub_var_name ;
var x;
id r;
run;
 %end;
 %else %put ERROR: variable length is too short;
%mend word_split;

options mprint mlogic symbolgen;
%word_split (sub_var_name=var, 
     sub_var_len=50,  
     No_sub_var=5)

Xia Keshan

Message was edited by: xia keshan

Message was edited by: xia keshan

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 5096 views
  • 3 likes
  • 5 in conversation