- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi i want to split a sentence into variables with same length, but i dont want the words to be truncated in the new variable,
if it is going to be truncated then it must move on to the next variable.
For eg: If i have sentence of length 200 bytes, i want to split the sentence into 4 variables each of 50 bytes and they
should not have truncated words.
I have done a complex macro to do this but i am looking for a much simpler method
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
You can do it in several ways, here is a datastep using substr to chop up the string:
data have (drop=find_last_space);
attrib really_long_string format=$200.;
attrib only_50_chars format=$50.;
attrib find_last_space format=best.;
really_long_string="This is a long sentance which will be trimmed at this word as it splits over the row and also at this record as that splits as well.";
do until (length(strip(really_long_string))=1);
find_last_space=index(reverse(substr(really_long_string,1,50))," ");
only_50_chars=substr(really_long_string,1,50-find_last_space);
really_long_string=substr(really_long_string,52-find_last_space); /* Note use 52, or strip the result to get rid of blank */
output;
end;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Try http://support.sas.com/kb/24/672.html
BR,
n4n
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
data have; x="Hi i want to split a sentence into variables with same length, but i dont want the words to be truncated in the new variable,if it is going to be truncated then it must move on to the next variable.For eg: If i have sentence of length"; output; x="For eg: If i have sentence of length 200 bytes, i want to split the sentence into 4 variables each of 50 bytes and they should not have truncated words.I have done a complex macro to do this but i am looking for a much simpler method"; output; run; data x; set have; length token $ 100; group+1; do i=1 to countw(x,' '); token=scan(x,i,' '); output; end; drop x; run; proc rank data=x out=x1 groups=4; by group; var i; ranks r; run; data x2; set x1; by group r; length x $ 200; retain x; x=catx(' ',x,token); if last.r then do; output; call missing(x);end; keep group r x; run; proc transpose data=x2 out=want prefix=var; by group; var x; id r; run;
Xia Keshan
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you guys for ur ideas...
But i wanted a macro like simple program which accepts a input sentence and creates sub variables with small length like 50 bytes.
Proc rank seems to be a excellent idea but the grouping needs to done in trial and error method based on the sub variable length to get the accurate results. So i think its not usefull for automating.
Please go through my program below and through me some ideas to improve it.
***Input sentence***;
%let sentence=We mostly use the default behavior of the DATA STEP to create working code. However, certain common tasks are made easier by overriding the default behavior. In the Pharmaceutical Industry one such common task is LOCF.;
***Counting the words***;
%let nn=%sysfunc(countw("&sentence"));
***Macro for splitting the sentence into separate variable***;
%macro word_split ( sub_var_name=txt, /*sub variable name*/
sub_var_len=50, /*length of sub variable*/
No_sub_var=4); /*No of sub variable to be created*/
data new;
length s $200 &sub_var_name.1-&sub_var_name.&No_sub_var $&sub_var_len;
Sentence="&sentence";
array a(*) $20 w1-w&nn; /*Splitting the sentence into individual words*/
do i= 1 to &nn;
a(i)=scan("&sentence",i);
end;
do i= 1 to &nn;
retain s;
a1=s;
s=catx (" ",trim(a1),trim(a(i))); /*Concatenating the words one by one and*/
nlen=length(s); /*calculating its length*/
if nlen gt %sysevalf((&sub_var_len)-10) then do;
%macro split_var (); /*Macro for creating sub variable with values*/
%do i= 1 %to &No_sub_var;
if (length(trim(&sub_var_name.&i)) le 1) then
do;
&sub_var_name.&i = s;
s=" ";
nlen=.;
end;
%end;
%mend;
%split_var ();
end;
if nlen ne . then &sub_var_name.&No_sub_var = s;
end;
keep Sentence &sub_var_name.1-&sub_var_name.&No_sub_var;
run;
%mend;
***end of Macro***;
%word_split (sub_var_name=var,
sub_var_len=50,
No_sub_var=5)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You can get all you want by changing my code if you could understand it .Create new variables and settle variable's length is not a big deal. you can get it before data step by macro function. and I believe proc rank can also handle all of these.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Pre-process the string and add your split character, then datastep split the string out:
%macro Chop (inds=,invar=,num_output_cols=,max_length=40);
data &inds. (drop=text_to_process);
set &inds.;
attrib &invar._new_text format=$2000.;
attrib text_to_process format=$2000.;
text_to_process=&invar.;
do while (length(trim(text_to_process)) > &max_length.);
&invar._new_text=trim(&invar._new_text)||'$'||substr(text_to_process,1,prxmatch("/\/[^\/]*$/",substr(tranwrd(text_to_process,' ','/'),1,40)));
text_to_process=substr(text_to_process,prxmatch("/\/[^\/]*$/",substr(tranwrd(text_to_process,' ','/'),1,40)));
end;
&invar._new_text=trim(&invar._new_text)||'$'||trim(text_to_process);
run;
data &inds.;
set &inds.;
attrib
%do i=1 %to &num_output_cols.;
split_&i.
%end;
format=$&max_length..;
%do i=1 %to &num_output_cols.;
split_&i.=scan(&invar._new_text,&i.+1,"$");
%end;
run;
%mend Chop;
data my_test;
attrib really_long_string format=$200.;
really_long_string="This is a long sentance which will be trimmed at this word as it splits over the row and also at this record as that splits as well.";
output;
run;
%Chop (inds=work.my_test,num_output_cols=4,invar=really_long_string);
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You could use a regular expression as shown below.
data have;
string="Hi i want to split a sentence into variables with same length, but i dont want the words to be truncated in the new variable,if it is going to be truncated then it must move on to the next variable.For eg: If i have sentence of length";
output;
string="For eg: If i have sentence of length 200 bytes, i want to split the sentence into 4 variables each of 50 bytes and they should not have truncated words.I have done a complex macro to do this but i am looking for a much simpler method";
output;
run;
data prep /*(drop=_:)*/;
length sub_string $50;
retain _re;
length _start _stop _pos _len 8;
if _N_ = 1 then
_re = prxparse('/(?<=\b)[\w].{0,48}([^\w]|$)/');
set have;
_start=1;
_stop = lengthn(string);
do until (_start>_stop);
call prxnext(_re, _start, _stop, string, _pos, _len);
sub_string = substrn(string, _pos, _len);
output;
if missing(sub_string) then leave;
end;
run;
proc transpose data=prep out=want(drop=_:) prefix=var;
by string notsorted;
var sub_string;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is .
%let sentence=We mostly use the default behavior of the DATA STEP to create working code. However, certain common tasks are made easier by overriding the default behavior. In the Pharmaceutical Industry one such common task is LOCF.; %macro word_split ( sub_var_name=txt, /*sub variable name*/ sub_var_len=50, /*length of sub variable*/ No_sub_var=4); /*No of sub variable to be created*/ data x; length token $ 100; x="&sentence"; do i=1 to countw(x,' '); token=scan(x,i,' '); length=1+length(token); output; end; drop x; run; proc rank data=x out=x1 groups=&No_sub_var ; var i; ranks r; run; proc sql; select max(range) into : max_len from ( select sum(length) as range from x1 group by r ); quit; %if &max_len lt &sub_var_len %then %do; data x2; set x1; by r; length x $ &sub_var_len ; retain x; x=catx(' ',x,token); if last.r then do; output; call missing(x);end; keep r x; run; proc transpose data=x2 out=want prefix=&sub_var_name ; var x; id r; run; %end; %else %put ERROR: variable length is too short; %mend word_split; options mprint mlogic symbolgen; %word_split (sub_var_name=var, sub_var_len=50, No_sub_var=5)
Xia Keshan
Message was edited by: xia keshan
Message was edited by: xia keshan