BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Hello_there
Lapis Lazuli | Level 10

Hi,

 

 

I am trying to split a character variable (DVTERM) dynamically, so that it goes up to 200 characters for the first term, and then up to 199 for each succeeding term. For example if a character variable is 1000 characters long, then it would get split 5 times.

 

Note: i cannot split words, each new term has to start with a new word and it can't be a truncated word

 

So basically, the character limits are:

DVT1=200

DVT2=199

DVT3=199

DVT4=199

DVT5=199

...

 

I have found some sources online that can help me in regards to this, but bc of i'm not a savvy programmer, i'm having difficulty adapting the code so that it extends out more terms than just the 2 listed in the example.

data have;
set source.dv;
describe_deviation=strip(compbl(dvterm));
length=length(dvterm);
c=substr(dvterm,201,1);
if .<length<=200 and missing(c) then dvterm=dvt1; else if 200<length<=400 or not missing(c) then do; if substr(dvterm,200,1)="" or substr(dvt1,201,1)="" then do; dvt1=substr(dvterm,1,200); dvt2=strip(substr(dvterm,201)); end; else do; length1=200-length(scan(substr(dvterm,1,200),-1,"")); dvt1=substr(dvterm,1,length1); dvt2=strip(substr(dvterm,length1+1,200)); end; end; run;

source: https://www.lexjansen.com/phuse-us/2021/ct/PAP_CT07.pdf

 

Here is an example data set about random rainbow facts. For the sake of simplicity, please adapt the code so that instead of 200, it's 20. And instead of 199 for the succeeding split off terms, please make it 19.

DATA rainbow_facts;
infile datalines dsd dlm=",";
    LENGTH Fact $80.;
    INPUT Fact $;
    length=length(fact);
DATALINES;
Rainbows,
Rainbows are a mix of light refraction and dispersion.,
Rainbows have inspired countless myths and legends.,
The longest observed rainbow lasted for nine hours in Taiwan.,
Each rainbow color emerges due to different wavelengths of light.,
Rainbows are not physical objects- they cannot be approached or touched.
;
RUN;

Thank you

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You can always rebuild the single observation result from the multiple observation result using PROC TRANSPOSE or other methods to convert observations into variables.

 

Try something like this:

data want;
  set rainbow_facts;
  index=1;
  length next $20;
  copy=left(fact);
  array x $20 factoid1-factoid5 ;
  do until(copy=' ');
    next=scan(copy,1,' ');
    if length(catx(' ',x[index],next))<= 20 then do;
      x[index]=catx(' ',x[index],next);
    end;
    else do;
      index+1;
      x[index]=next;
    end;
    copy=left(substr(copy,length(next)+1));
  end;  
run;

Try it with some really long words.

The tricky part is deciding how many variables you need to make (the ARRAY statement).  If you make it too few you might need to have logic to prevent indexing past the end of the ARRAY.   Even though 80/20 is 4 I had to make 5 variables to handle all of your examples.

Tom_0-1705952056546.png

 

View solution in original post

8 REPLIES 8
Tom
Super User Tom
Super User

Why not just split it into exact chunks?

data want;
  set rainbow_facts;
  length loc 8 factoid $20 ;
  do loc=1 to length by 20;
    factoid=substr(fact,loc,20);
    output;
  end;
run;

Result

Obs    length    loc          factoid           fact

  1       8        1    Rainbows                Rainbows
  2      54        1    Rainbows are a mix o    Rainbows are a mix of light refraction and dispersion.
  3      54       21    f light refraction a    Rainbows are a mix of light refraction and dispersion.
  4      54       41    nd dispersion.          Rainbows are a mix of light refraction and dispersion.
  5      51        1    Rainbows have inspir    Rainbows have inspired countless myths and legends.
  6      51       21    ed countless myths a    Rainbows have inspired countless myths and legends.
  7      51       41    nd legends.             Rainbows have inspired countless myths and legends.
  8      61        1    The longest observed    The longest observed rainbow lasted for nine hours in Taiwan.
  9      61       21     rainbow lasted for     The longest observed rainbow lasted for nine hours in Taiwan.
 10      61       41    nine hours in Taiwan    The longest observed rainbow lasted for nine hours in Taiwan.
 11      61       61    .                       The longest observed rainbow lasted for nine hours in Taiwan.
 12      65        1    Each rainbow color e    Each rainbow color emerges due to different wavelengths of light.
 13      65       21    merges due to differ    Each rainbow color emerges due to different wavelengths of light.
 14      65       41    ent wavelengths of l    Each rainbow color emerges due to different wavelengths of light.
 15      65       61    ight.                   Each rainbow color emerges due to different wavelengths of light.
 16      72        1    Rainbows are not phy    Rainbows are not physical objects- they cannot be approached or touched.
 17      72       21    sical objects- they     Rainbows are not physical objects- they cannot be approached or touched.
 18      72       41    cannot be approached    Rainbows are not physical objects- they cannot be approached or touched.
 19      72       61     or touched.            Rainbows are not physical objects- they cannot be approached or touched.

Or are you trying to not split "words"?

Hello_there
Lapis Lazuli | Level 10
yes i forgot to mention, i'm not supposed to split words. I'll put it in the OP
Hello_there
Lapis Lazuli | Level 10
additionally the resultant data set should have only 6 observations, the same as the parent data set
Tom
Super User Tom
Super User

You can always rebuild the single observation result from the multiple observation result using PROC TRANSPOSE or other methods to convert observations into variables.

 

Try something like this:

data want;
  set rainbow_facts;
  index=1;
  length next $20;
  copy=left(fact);
  array x $20 factoid1-factoid5 ;
  do until(copy=' ');
    next=scan(copy,1,' ');
    if length(catx(' ',x[index],next))<= 20 then do;
      x[index]=catx(' ',x[index],next);
    end;
    else do;
      index+1;
      x[index]=next;
    end;
    copy=left(substr(copy,length(next)+1));
  end;  
run;

Try it with some really long words.

The tricky part is deciding how many variables you need to make (the ARRAY statement).  If you make it too few you might need to have logic to prevent indexing past the end of the ARRAY.   Even though 80/20 is 4 I had to make 5 variables to handle all of your examples.

Tom_0-1705952056546.png

 

Hello_there
Lapis Lazuli | Level 10
Thanks, Tom. I will experiment with this.
mkeintz
PROC Star

You want the first substring of 5 to have its last non-blank character at position 200 or less.  And at position 199 or less for the 2nd through 5th.  Because you want strings separated at blanks, you'll need a 6th substring to accommodate the fact that substrings will not always be "full".  For instance, if the longest "word" is 20, you could have up to 20 blanks at the end of each of the first 5 substrings, necessitating a 6th of up to 104 characters (5  20-character words plus 4 blanks).

 

This will satisfy that requirement:

 

data want (drop=_:);
  set have;

  array dvt {6} $200 dvt1-dvt6 ;
  _temp=txt;

  do i =1 to 6 while (_temp^=' ');
    /*Find rightmost blank for a 200 (199) character substr*/
    do p=ifn(i=1,201,200) by -1 while (char(_temp,p)^=' ');  
    end;

    dv{i}=substr(_temp,1,p-1);  /*Copy leftmost substring */
    _temp=substr(_temp,p+1);    /*Left justify the rest of the original string*/
  end;
run;

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
options validvarname=any mrecall nofmterr nonumber nodate dkricond=nowarn dkrocond=nowarn compress=yes;
/*the path for raw sas datasets <--- input path*/
libname in v9 'C:\temp\a\1.处理原始SAS\SAS'    access=readonly;

/*the path for processed sas datasets <--- output path*/
libname out v9 'C:\temp\a\1.处理原始SAS\跑完后SAS' ;




options mprint mlogic symbolgen;




/*清空WORK库的 数据集*/
proc datasets library=work kill nolist nodetails;
quit;
/*清空导出库的 数据集*/
proc datasets library=out kill nolist nodetails;
quit;
/*将 数据集名 统一为 小写并copy到OUT库中*/
proc copy in=in out=out memtype=data;
run;
/*计算所有数据集中字符变量的长度*/
data _null_;
 set sashelp.vcolumn(keep=libname memname name type where=(libname='OUT' and upcase(type)='CHAR')) end=last;
 by memname;
 if _n_=1 then call execute('proc sql;');
 if first.memname then call execute(catt('create table _',memname,' as select '));
 call execute(cat('max(length(',strip(name),')) as ',name));
 if not last.memname then call execute(',');
  else call execute(catt('from out.',memname,';'));
 if last then call execute('quit;');
run;
/*将上面产生的数据集转置 wide -> long*/
data _null_;
 set sashelp.vtable(keep=libname memname where=(libname='WORK' and memname =: '_'));
 call execute(catt('proc transpose data=',memname,' out=_',memname,';run;'));
run;
/*合并转置的数据集*/
data change_var_len;
 length _name_ $ 40;
 set __: indsname=indsname;
 dsn=indsname;
run;

/*长度大于200 的变量*/
proc sql;
create table change_var_len_2 as
 select *,substr(dsn,8) as new_dsn length=40
  from change_var_len
   where col1 >200
     order by new_dsn,_name_;
quit;













%macro add_id;
%let dsid=%sysfunc(open(change_var_len_2));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%if &nobs ne 0 %then %do;
proc sql;
create table change_var_len_3 as
 select distinct new_dsn from change_var_len_2;
quit;
data _null_;
 set change_var_len_2;
 call execute(catt('data out.',new_dsn,';set out.',new_dsn,';_id_+1;run;'));
run;
%end;
%mend;
%add_id 



/********************变量存储长度大于200 的进行分割**********************************/
%macro split_var_len(dsn= ,vname=);
proc datasets library=work  nolist nodetails;
save change_var_len_2;
quit;

data _&dsn.;
 set out.&dsn.;
 if length(&vname.)>200;
 keep _id_ &vname.;
run;
data __&dsn.;
 set _&dsn.;
 length  temp  _lag_temp $ 300 _temp $ 8;
 do i=1 to klength(&vname.);
   _lag_temp=temp;
   _temp=ksubstr(&vname.,i,1);
   temp=cats(temp,_temp);
   if length(temp)>200 then do;
     temp=_lag_temp;output;
	 temp=_temp;
   end;
 end;
 if length(temp)<=200 then output;
keep _id_ temp;
run;
proc sql;
alter table __&dsn.
modify temp char(200) format=$200.;
quit;

proc transpose data=__&dsn. out=___&dsn.(drop=_name_ _type_) prefix=&vname._ ;
by _id_;
var temp;
run;
data out.&dsn.;
 merge out.&dsn.  ___&dsn.;
 by _id_;
 if missing(&vname._1) then  &vname._1= &vname.;
 drop &vname.;
run;
%mend;
%macro do_split;
%let dsid=%sysfunc(open(change_var_len_2));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%if &nobs ne 0 %then %do;
data _null_;
 set change_var_len_2;
 call execute(catt('%split_var_len(dsn=',new_dsn,',vname=',_name_,')'));
run;
%end;
%mend;
%do_split





%macro drop_id;
%let dsid=%sysfunc(open(change_var_len_2));
%let nobs=%sysfunc(attrn(&dsid,nlobs));
%let dsid=%sysfunc(close(&dsid));
%if &nobs ne 0 %then %do;
proc sql;
create table change_var_len_3 as
 select distinct new_dsn from change_var_len_2;
quit;
data _null_;
 set change_var_len_2;
 call execute(catt('data out.',new_dsn,';set out.',new_dsn,';drop _id_;run;'));
run;
%end;
%mend;
%drop_id 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 897 views
  • 3 likes
  • 4 in conversation