Hi,
I am trying to split a character variable (DVTERM) dynamically, so that it goes up to 200 characters for the first term, and then up to 199 for each succeeding term. For example if a character variable is 1000 characters long, then it would get split 5 times.
Note: i cannot split words, each new term has to start with a new word and it can't be a truncated word
So basically, the character limits are:
DVT1=200
DVT2=199
DVT3=199
DVT4=199
DVT5=199
...
I have found some sources online that can help me in regards to this, but bc of i'm not a savvy programmer, i'm having difficulty adapting the code so that it extends out more terms than just the 2 listed in the example.
data have;
set source.dv;
describe_deviation=strip(compbl(dvterm));
length=length(dvterm);
c=substr(dvterm,201,1);
if .<length<=200 and missing(c) then dvterm=dvt1;
else if 200<length<=400 or not missing(c) then do;
if substr(dvterm,200,1)="" or substr(dvt1,201,1)="" then do;
dvt1=substr(dvterm,1,200);
dvt2=strip(substr(dvterm,201));
end;
else do;
length1=200-length(scan(substr(dvterm,1,200),-1,""));
dvt1=substr(dvterm,1,length1);
dvt2=strip(substr(dvterm,length1+1,200));
end;
end;
run;
source: https://www.lexjansen.com/phuse-us/2021/ct/PAP_CT07.pdf
Here is an example data set about random rainbow facts. For the sake of simplicity, please adapt the code so that instead of 200, it's 20. And instead of 199 for the succeeding split off terms, please make it 19.
DATA rainbow_facts; infile datalines dsd dlm=","; LENGTH Fact $80.; INPUT Fact $; length=length(fact); DATALINES; Rainbows, Rainbows are a mix of light refraction and dispersion., Rainbows have inspired countless myths and legends., The longest observed rainbow lasted for nine hours in Taiwan., Each rainbow color emerges due to different wavelengths of light., Rainbows are not physical objects- they cannot be approached or touched. ; RUN;
Thank you
You can always rebuild the single observation result from the multiple observation result using PROC TRANSPOSE or other methods to convert observations into variables.
Try something like this:
data want;
set rainbow_facts;
index=1;
length next $20;
copy=left(fact);
array x $20 factoid1-factoid5 ;
do until(copy=' ');
next=scan(copy,1,' ');
if length(catx(' ',x[index],next))<= 20 then do;
x[index]=catx(' ',x[index],next);
end;
else do;
index+1;
x[index]=next;
end;
copy=left(substr(copy,length(next)+1));
end;
run;
Try it with some really long words.
The tricky part is deciding how many variables you need to make (the ARRAY statement). If you make it too few you might need to have logic to prevent indexing past the end of the ARRAY. Even though 80/20 is 4 I had to make 5 variables to handle all of your examples.
Why not just split it into exact chunks?
data want;
set rainbow_facts;
length loc 8 factoid $20 ;
do loc=1 to length by 20;
factoid=substr(fact,loc,20);
output;
end;
run;
Result
Obs length loc factoid fact 1 8 1 Rainbows Rainbows 2 54 1 Rainbows are a mix o Rainbows are a mix of light refraction and dispersion. 3 54 21 f light refraction a Rainbows are a mix of light refraction and dispersion. 4 54 41 nd dispersion. Rainbows are a mix of light refraction and dispersion. 5 51 1 Rainbows have inspir Rainbows have inspired countless myths and legends. 6 51 21 ed countless myths a Rainbows have inspired countless myths and legends. 7 51 41 nd legends. Rainbows have inspired countless myths and legends. 8 61 1 The longest observed The longest observed rainbow lasted for nine hours in Taiwan. 9 61 21 rainbow lasted for The longest observed rainbow lasted for nine hours in Taiwan. 10 61 41 nine hours in Taiwan The longest observed rainbow lasted for nine hours in Taiwan. 11 61 61 . The longest observed rainbow lasted for nine hours in Taiwan. 12 65 1 Each rainbow color e Each rainbow color emerges due to different wavelengths of light. 13 65 21 merges due to differ Each rainbow color emerges due to different wavelengths of light. 14 65 41 ent wavelengths of l Each rainbow color emerges due to different wavelengths of light. 15 65 61 ight. Each rainbow color emerges due to different wavelengths of light. 16 72 1 Rainbows are not phy Rainbows are not physical objects- they cannot be approached or touched. 17 72 21 sical objects- they Rainbows are not physical objects- they cannot be approached or touched. 18 72 41 cannot be approached Rainbows are not physical objects- they cannot be approached or touched. 19 72 61 or touched. Rainbows are not physical objects- they cannot be approached or touched.
Or are you trying to not split "words"?
You can always rebuild the single observation result from the multiple observation result using PROC TRANSPOSE or other methods to convert observations into variables.
Try something like this:
data want;
set rainbow_facts;
index=1;
length next $20;
copy=left(fact);
array x $20 factoid1-factoid5 ;
do until(copy=' ');
next=scan(copy,1,' ');
if length(catx(' ',x[index],next))<= 20 then do;
x[index]=catx(' ',x[index],next);
end;
else do;
index+1;
x[index]=next;
end;
copy=left(substr(copy,length(next)+1));
end;
run;
Try it with some really long words.
The tricky part is deciding how many variables you need to make (the ARRAY statement). If you make it too few you might need to have logic to prevent indexing past the end of the ARRAY. Even though 80/20 is 4 I had to make 5 variables to handle all of your examples.
You want the first substring of 5 to have its last non-blank character at position 200 or less. And at position 199 or less for the 2nd through 5th. Because you want strings separated at blanks, you'll need a 6th substring to accommodate the fact that substrings will not always be "full". For instance, if the longest "word" is 20, you could have up to 20 blanks at the end of each of the first 5 substrings, necessitating a 6th of up to 104 characters (5 20-character words plus 4 blanks).
This will satisfy that requirement:
data want (drop=_:);
set have;
array dvt {6} $200 dvt1-dvt6 ;
_temp=txt;
do i =1 to 6 while (_temp^=' ');
/*Find rightmost blank for a 200 (199) character substr*/
do p=ifn(i=1,201,200) by -1 while (char(_temp,p)^=' ');
end;
dv{i}=substr(_temp,1,p-1); /*Copy leftmost substring */
_temp=substr(_temp,p+1); /*Left justify the rest of the original string*/
end;
run;
options validvarname=any mrecall nofmterr nonumber nodate dkricond=nowarn dkrocond=nowarn compress=yes;
/*the path for raw sas datasets <--- input path*/
libname in v9 'C:\temp\a\1.处理原始SAS\SAS' access=readonly;
/*the path for processed sas datasets <--- output path*/
libname out v9 'C:\temp\a\1.处理原始SAS\跑完后SAS' ;
options mprint mlogic symbolgen;
/*清空WORK库的 数据集*/
proc datasets library=work kill nolist nodetails;
quit;
/*清空导出库的 数据集*/
proc datasets library=out kill nolist nodetails;
quit;
/*将 数据集名 统一为 小写并copy到OUT库中*/
proc copy in=in out=out memtype=data;
run;
/*计算所有数据集中字符变量的长度*/
data _null_;
set sashelp.vcolumn(keep=libname memname name type where=(libname='OUT' and upcase(type)='CHAR')) end=last;
by memname;
if _n_=1 then call execute('proc sql;');
if first.memname then call execute(catt('create table _',memname,' as select '));
call execute(cat('max(length(',strip(name),')) as ',name));
if not last.memname then call execute(',');
else call execute(catt('from out.',memname,';'));
if last then call execute('quit;');
run;
/*将上面产生的数据集转置 wide -> long*/
data _null_;
set sashelp.vtable(keep=libname memname where=(libname='WORK' and memname =: '_'));
call execute(catt('proc transpose data=',memname,' out=_',memname,';run;'));
run;
/*合并转置的数据集*/
data change_var_len;
length _name_ $ 40;
set __: indsname=indsname;
dsn=indsname;
run;
/*长度大于200 的变量*/
proc sql;
create table change_var_len_2 as
select *,substr(dsn,8) as new_dsn length=40
from change_var_len
where col1 >200
order by new_dsn,_name_;
quit;
%macro add_id;
%let dsid=%sysfunc(open(change_var_len_2));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%if &nobs ne 0 %then %do;
proc sql;
create table change_var_len_3 as
select distinct new_dsn from change_var_len_2;
quit;
data _null_;
set change_var_len_2;
call execute(catt('data out.',new_dsn,';set out.',new_dsn,';_id_+1;run;'));
run;
%end;
%mend;
%add_id
/********************变量存储长度大于200 的进行分割**********************************/
%macro split_var_len(dsn= ,vname=);
proc datasets library=work nolist nodetails;
save change_var_len_2;
quit;
data _&dsn.;
set out.&dsn.;
if length(&vname.)>200;
keep _id_ &vname.;
run;
data __&dsn.;
set _&dsn.;
length temp _lag_temp $ 300 _temp $ 8;
do i=1 to klength(&vname.);
_lag_temp=temp;
_temp=ksubstr(&vname.,i,1);
temp=cats(temp,_temp);
if length(temp)>200 then do;
temp=_lag_temp;output;
temp=_temp;
end;
end;
if length(temp)<=200 then output;
keep _id_ temp;
run;
proc sql;
alter table __&dsn.
modify temp char(200) format=$200.;
quit;
proc transpose data=__&dsn. out=___&dsn.(drop=_name_ _type_) prefix=&vname._ ;
by _id_;
var temp;
run;
data out.&dsn.;
merge out.&dsn. ___&dsn.;
by _id_;
if missing(&vname._1) then &vname._1= &vname.;
drop &vname.;
run;
%mend;
%macro do_split;
%let dsid=%sysfunc(open(change_var_len_2));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%if &nobs ne 0 %then %do;
data _null_;
set change_var_len_2;
call execute(catt('%split_var_len(dsn=',new_dsn,',vname=',_name_,')'));
run;
%end;
%mend;
%do_split
%macro drop_id;
%let dsid=%sysfunc(open(change_var_len_2));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%if &nobs ne 0 %then %do;
proc sql;
create table change_var_len_3 as
select distinct new_dsn from change_var_len_2;
quit;
data _null_;
set change_var_len_2;
call execute(catt('data out.',new_dsn,';set out.',new_dsn,';drop _id_;run;'));
run;
%end;
%mend;
%drop_id
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.