Solved: Re: How do i adapt this code to split a character variable dynamically...

Hello_there · Posted 01-22-2024 01:23 PM

Hi,

I am trying to split a character variable (DVTERM) dynamically, so that it goes up to 200 characters for the first term, and then up to 199 for each succeeding term. For example if a character variable is 1000 characters long, then it would get split 5 times.

Note: i cannot split words, each new term has to start with a new word and it can't be a truncated word

So basically, the character limits are:

DVT1=200

DVT2=199

DVT3=199

DVT4=199

DVT5=199

...

I have found some sources online that can help me in regards to this, but bc of i'm not a savvy programmer, i'm having difficulty adapting the code so that it extends out more terms than just the 2 listed in the example.

data have;
set source.dv;
describe_deviation=strip(compbl(dvterm));
length=length(dvterm);
c=substr(dvterm,201,1);

if .<length<=200 and missing(c) then dvterm=dvt1;
else if 200<length<=400 or not missing(c) then do;
	if substr(dvterm,200,1)="" or substr(dvt1,201,1)="" then do;
		dvt1=substr(dvterm,1,200);
		dvt2=strip(substr(dvterm,201));
	end;
	else do;
	length1=200-length(scan(substr(dvterm,1,200),-1,""));
	dvt1=substr(dvterm,1,length1);
	dvt2=strip(substr(dvterm,length1+1,200));
	end;
end;
run;

source: https://www.lexjansen.com/phuse-us/2021/ct/PAP_CT07.pdf

Here is an example data set about random rainbow facts. For the sake of simplicity, please adapt the code so that instead of 200, it's 20. And instead of 199 for the succeeding split off terms, please make it 19.

DATA rainbow_facts;
infile datalines dsd dlm=",";
    LENGTH Fact $80.;
    INPUT Fact $;
    length=length(fact);
DATALINES;
Rainbows,
Rainbows are a mix of light refraction and dispersion.,
Rainbows have inspired countless myths and legends.,
The longest observed rainbow lasted for nine hours in Taiwan.,
Each rainbow color emerges due to different wavelengths of light.,
Rainbows are not physical objects- they cannot be approached or touched.
;
RUN;

Thank you

Tom · Posted 01-22-2024 02:34 PM

You can always rebuild the single observation result from the multiple observation result using PROC TRANSPOSE or other methods to convert observations into variables.

Try something like this:

data want;
  set rainbow_facts;
  index=1;
  length next $20;
  copy=left(fact);
  array x $20 factoid1-factoid5 ;
  do until(copy=' ');
    next=scan(copy,1,' ');
    if length(catx(' ',x[index],next))<= 20 then do;
      x[index]=catx(' ',x[index],next);
    end;
    else do;
      index+1;
      x[index]=next;
    end;
    copy=left(substr(copy,length(next)+1));
  end;  
run;

Try it with some really long words.

The tricky part is deciding how many variables you need to make (the ARRAY statement). If you make it too few you might need to have logic to prevent indexing past the end of the ARRAY. Even though 80/20 is 4 I had to make 5 variables to handle all of your examples.

View solution in original post

Tom · Posted 01-22-2024 01:54 PM

Why not just split it into exact chunks?

data want;
  set rainbow_facts;
  length loc 8 factoid $20 ;
  do loc=1 to length by 20;
    factoid=substr(fact,loc,20);
    output;
  end;
run;

Result

Obs length loc factoid fact

1 8 1 Rainbows Rainbows
2 54 1 Rainbows are a mix o Rainbows are a mix of light refraction and dispersion.
3 54 21 f light refraction a Rainbows are a mix of light refraction and dispersion.
4 54 41 nd dispersion. Rainbows are a mix of light refraction and dispersion.
5 51 1 Rainbows have inspir Rainbows have inspired countless myths and legends.
6 51 21 ed countless myths a Rainbows have inspired countless myths and legends.
7 51 41 nd legends. Rainbows have inspired countless myths and legends.
8 61 1 The longest observed The longest observed rainbow lasted for nine hours in Taiwan.
9 61 21 rainbow lasted for The longest observed rainbow lasted for nine hours in Taiwan.
10 61 41 nine hours in Taiwan The longest observed rainbow lasted for nine hours in Taiwan.
11 61 61 . The longest observed rainbow lasted for nine hours in Taiwan.
12 65 1 Each rainbow color e Each rainbow color emerges due to different wavelengths of light.
13 65 21 merges due to differ Each rainbow color emerges due to different wavelengths of light.
14 65 41 ent wavelengths of l Each rainbow color emerges due to different wavelengths of light.
15 65 61 ight. Each rainbow color emerges due to different wavelengths of light.
16 72 1 Rainbows are not phy Rainbows are not physical objects- they cannot be approached or touched.
17 72 21 sical objects- they Rainbows are not physical objects- they cannot be approached or touched.
18 72 41 cannot be approached Rainbows are not physical objects- they cannot be approached or touched.
19 72 61 or touched. Rainbows are not physical objects- they cannot be approached or touched.

Or are you trying to not split "words"?

Hello_there · Posted 01-22-2024 01:55 PM

yes i forgot to mention, i'm not supposed to split words. I'll put it in the OP

Hello_there · Posted 01-22-2024 01:57 PM

additionally the resultant data set should have only 6 observations, the same as the parent data set

Tom · Posted 01-22-2024 02:34 PM

You can always rebuild the single observation result from the multiple observation result using PROC TRANSPOSE or other methods to convert observations into variables.

Try something like this:

data want;
  set rainbow_facts;
  index=1;
  length next $20;
  copy=left(fact);
  array x $20 factoid1-factoid5 ;
  do until(copy=' ');
    next=scan(copy,1,' ');
    if length(catx(' ',x[index],next))<= 20 then do;
      x[index]=catx(' ',x[index],next);
    end;
    else do;
      index+1;
      x[index]=next;
    end;
    copy=left(substr(copy,length(next)+1));
  end;  
run;

Try it with some really long words.

The tricky part is deciding how many variables you need to make (the ARRAY statement). If you make it too few you might need to have logic to prevent indexing past the end of the ARRAY. Even though 80/20 is 4 I had to make 5 variables to handle all of your examples.

Hello_there · Posted 01-22-2024 02:37 PM

Thanks, Tom. I will experiment with this.

mkeintz · Posted 01-22-2024 04:56 PM

You want the first substring of 5 to have its last non-blank character at position 200 or less. And at position 199 or less for the 2nd through 5th. Because you want strings separated at blanks, you'll need a 6th substring to accommodate the fact that substrings will not always be "full". For instance, if the longest "word" is 20, you could have up to 20 blanks at the end of each of the first 5 substrings, necessitating a 6th of up to 104 characters (5 20-character words plus 4 blanks).

This will satisfy that requirement:

data want (drop=_:);
  set have;

  array dvt {6} $200 dvt1-dvt6 ;
  _temp=txt;

  do i =1 to 6 while (_temp^=' ');
    /*Find rightmost blank for a 200 (199) character substr*/
    do p=ifn(i=1,201,200) by -1 while (char(_temp,p)^=' ');  
    end;

    dv{i}=substr(_temp,1,p-1);  /*Copy leftmost substring */
    _temp=substr(_temp,p+1);    /*Left justify the rest of the original string*/
  end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ksharp · Posted 01-22-2024 09:31 PM

options validvarname=any mrecall nofmterr nonumber nodate dkricond=nowarn dkrocond=nowarn compress=yes;
/*the path for raw sas datasets <--- input path*/
libname in v9 'C:\temp\a\1.处理原始SAS\SAS'    access=readonly;

/*the path for processed sas datasets <--- output path*/
libname out v9 'C:\temp\a\1.处理原始SAS\跑完后SAS' ;




options mprint mlogic symbolgen;




/*清空WORK库的 数据集*/
proc datasets library=work kill nolist nodetails;
quit;
/*清空导出库的 数据集*/
proc datasets library=out kill nolist nodetails;
quit;
/*将 数据集名 统一为 小写并copy到OUT库中*/
proc copy in=in out=out memtype=data;
run;
/*计算所有数据集中字符变量的长度*/
data _null_;
 set sashelp.vcolumn(keep=libname memname name type where=(libname='OUT' and upcase(type)='CHAR')) end=last;
 by memname;
 if _n_=1 then call execute('proc sql;');
 if first.memname then call execute(catt('create table _',memname,' as select '));
 call execute(cat('max(length(',strip(name),')) as ',name));
 if not last.memname then call execute(',');
  else call execute(catt('from out.',memname,';'));
 if last then call execute('quit;');
run;
/*将上面产生的数据集转置 wide -> long*/
data _null_;
 set sashelp.vtable(keep=libname memname where=(libname='WORK' and memname =: '_'));
 call execute(catt('proc transpose data=',memname,' out=_',memname,';run;'));
run;
/*合并转置的数据集*/
data change_var_len;
 length _name_ $ 40;
 set __: indsname=indsname;
 dsn=indsname;
run;

/*长度大于200 的变量*/
proc sql;
create table change_var_len_2 as
 select *,substr(dsn,8) as new_dsn length=40
  from change_var_len
   where col1 >200
     order by new_dsn,_name_;
quit;













%macro add_id;
%let dsid=%sysfunc(open(change_var_len_2));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%if &nobs ne 0 %then %do;
proc sql;
create table change_var_len_3 as
 select distinct new_dsn from change_var_len_2;
quit;
data _null_;
 set change_var_len_2;
 call execute(catt('data out.',new_dsn,';set out.',new_dsn,';_id_+1;run;'));
run;
%end;
%mend;
%add_id 



/********************变量存储长度大于200 的进行分割**********************************/
%macro split_var_len(dsn= ,vname=);
proc datasets library=work  nolist nodetails;
save change_var_len_2;
quit;

data _&dsn.;
 set out.&dsn.;
 if length(&vname.)>200;
 keep _id_ &vname.;
run;
data __&dsn.;
 set _&dsn.;
 length  temp  _lag_temp $ 300 _temp $ 8;
 do i=1 to klength(&vname.);
   _lag_temp=temp;
   _temp=ksubstr(&vname.,i,1);
   temp=cats(temp,_temp);
   if length(temp)>200 then do;
     temp=_lag_temp;output;
	 temp=_temp;
   end;
 end;
 if length(temp)<=200 then output;
keep _id_ temp;
run;
proc sql;
alter table __&dsn.
modify temp char(200) format=$200.;
quit;

proc transpose data=__&dsn. out=___&dsn.(drop=_name_ _type_) prefix=&vname._ ;
by _id_;
var temp;
run;
data out.&dsn.;
 merge out.&dsn.  ___&dsn.;
 by _id_;
 if missing(&vname._1) then  &vname._1= &vname.;
 drop &vname.;
run;
%mend;
%macro do_split;
%let dsid=%sysfunc(open(change_var_len_2));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%if &nobs ne 0 %then %do;
data _null_;
 set change_var_len_2;
 call execute(catt('%split_var_len(dsn=',new_dsn,',vname=',_name_,')'));
run;
%end;
%mend;
%do_split





%macro drop_id;
%let dsid=%sysfunc(open(change_var_len_2));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%if &nobs ne 0 %then %do;
proc sql;
create table change_var_len_3 as
 select distinct new_dsn from change_var_len_2;
quit;
data _null_;
 set change_var_len_2;
 call execute(catt('data out.',new_dsn,';set out.',new_dsn,';drop _id_;run;'));
run;
%end;
%mend;
%drop_id

Hello_there · Posted 01-22-2024 10:34 PM

Thanks!

How do i adapt this code to split a character variable dynamically in lengths of 200?

Re: How do i adapt this code to split a character variable dynamically in lengths of 200?

Re: How do i adapt this code to split a character variable dynamically in lengths of 200?

Re: How do i adapt this code to split a character variable dynamically in lengths of 200?

Re: How do i adapt this code to split a character variable dynamically in lengths of 200?

Re: How do i adapt this code to split a character variable dynamically in lengths of 200?

Re: How do i adapt this code to split a character variable dynamically in lengths of 200?

Re: How do i adapt this code to split a character variable dynamically in lengths of 200?

Re: How do i adapt this code to split a character variable dynamically in lengths of 200?

Re: How do i adapt this code to split a character variable dynamically in lengths of 200?

Classroom Training Available!