Split a variable to multiple variables

bharath86 · Posted 10-03-2023 08:30 AM

Hi

I am trying to split a string,

comment='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum' .

using a simple code line,

if length(comment) >0 then a=strip(substr(comment,1,200));

if length(comment) >200 then b=strip(substr(comment,201,200));

so on, on a basic code, can we upgrade this so it doesn't allow word chopping, for example, the word 'tempor' gets pushed to a new var or be kept in the same var if the length of 200 permits otherwise the entire word gets pushed to new var from first letter without chopping the word as 'tem' 'por' to two different variables.

Please advise.

Thanks

JosvanderVelden · Posted 10-03-2023 09:04 AM

Please look at the thread: https://communities.sas.com/t5/SAS-Programming/splitting-text-which-is-more-than-200char-to-multiple...

yabwon · Posted 10-03-2023 09:28 AM

Try like that:

data have;
comment='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.';
run;
proc print;
run;


%macro cutMeIntoParts(
 have
,variable
,want=want
,L=100
,keep=part
);

data &want.;
  set &have.;

  length part $ &L.;

  do i=1 by 1;
    word = scan(&variable., i, " ");

    if word = " " then 
      do;
        output;
        leave;
      end;
    /*
    put word= / part =;
    */

    if length(catx(" ",part, word))>&L. then
      do;
        output;
        part = word;
      end;
    else 
      part = catx(" ",part, word);
   
  end;

  keep &keep.;
run;

%mend cutMeIntoParts;

%cutMeIntoParts(have,comment,L=50)


proc print data=want(keep=part);
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

yabwon · Posted 10-05-2023 02:58 AM

One more (just for fun) approach with "funcy filename" and "fast _infile_ read" inspired by:

https://communities.sas.com/t5/SAS-Programming/How-to-delimit-large-dataset-28-Million-rows-into-700...

Data:

data have;
length comment $ 32767;
comment='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.';
output;
comment="Split a variable to multiple variables. 
Hi, I am trying to split a string, comment='....'. using a simple code line.
So on, on a basic code, can we upgrade this so it doesn't allow word chopping, for example, the word 'tempor' gets pushed to a new var or be kept in the same var if the length of 200 permits otherwise the entire word gets pushed to new var from first letter without chopping the word as 'tem' 'por' to two different variables. Please advise. Thanks";
output;
run;
proc print;
run;

Code:

%macro cutMeIntoParts2(
 have
,variable
,want=want
,L = 100
,WN = 200  /* max. expected number of words */
,keep=part
);
filename f "!SASROOT/*.cfg";
data &want.;
  set &have.; 
  infile f truncover dlm = " ";
  input @1 @;
  _infile_ = &variable.;
  input @1 (m1-m&WN.) (: $128.) @@;

  mmX=" ";
  length part $ &L.;

  array mm $ m1--m&WN. mmX;
  do over mm;
    L = lengthn(mm);
    
    if P + L > &L.-1 then 
      do; 
        output; 
        part = " "; 
      end;
    if L = 0 then 
      do; 
        output; 
        leave; 
      end;

    part = catx(" ", part, mm);
    P = lengthn(part);
  end;
  keep &keep.; 
run;
filename f clear;


%mend cutMeIntoParts2;

%cutMeIntoParts2(have,comment,L=50)


proc print data=want(keep=part);
run;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Patrick · Posted 10-03-2023 10:35 PM

Based on the sample code found in the docu for the CALL PRXNEXT Routine below should work.

%let l_source=1000;
%let l_part  =200;
%let n_parts =%eval(&l_source/&l_part +1);

data demo(drop=_:);
  length text $&l_source;
  array text_parts_ {&n_parts} $&l_part;
  _prxid = prxparse("/\b(.{1,%eval(&l_part-1)}\b)/");
  text='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum';
  _start = 1;
  _stop = length(text);

  /* Use PRXNEXT to find the first instance of the pattern, */
  /* then use DO WHILE to find all further instances.       */
  /* PRXNEXT changes the _start parameter so that searching  */
  /* begins again after the last match.                     */
  call prxnext(_prxid, _start, _stop, text, _pos, _len);

  do while (_pos > 0);
    _i=sum(_i,1);
    text_parts_[_i] = substr(text, _pos, _len);
    call prxnext(_prxid, _start, _stop, text, _pos, _len);
  end;
run;

proc print data=demo;
  var text_parts_:;
run;

Ksharp · Posted 10-03-2023 11:20 PM

Just for having some fun.

data have;
id+1;
length comment $ 20000;
comment='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum' ;
run;
data temp;
 set have;
 length token temp want $ 400 want2 $ 200;
 if length(comment)>200 then do;
 do i=1 to countw(comment,' ');
  token=scan(comment,i,' ');
  temp=want;
  want=catx(' ',want,token);
    if length(want)>200 then do;
      want=temp;want2=want;output;want=token;
    end;
  end;
  if not missing(want) then do;want2=want; output;end;
  end;
  else do;want2=comment; output;end;
  keep id want2;
run;
proc transpose data=temp out=want;
by id;
var want2;
run;

mkeintz · Posted 10-04-2023 09:24 PM

Two of the responses you received build each split word-by-word, from left to right. The other, using regular expressions, I suspect, counts columns under the hood from the left up to the split length, making sure to not truncate the rightmost word.

The code below takes a different tack. It starts each split by looking at the rightmost blank candidate column and stepping backward until a blank is located. Then the entire resulting split is copied, and the next leftmost column and rightmost blank candidate are established, and the same approach is used for the next split. I use the term "rightmost blank candidate" because for a split of, say 100 bytes length, the rightmost valid blank candidate in the source would be byte number 101, just after the maximum split length. It's only a candidate, however, until a word separator is encountered.

The thought here is that there is likely a lot fewer comparisons taking place searching for a blank in this way, than there is in assembling each split from the left. Especially for wide splits.

data have;
comment='Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.';
run;


%let L_source=1000;
%let L_split=100;
%let nsplits=%eval( (&L_source + &L_split-1)/&L_split );

data want (drop=_: i);
  set have ;

  array split {&nsplits} $&L_split;

  _left_col=1;
  do i=1 to &nsplits until (substr(comment,_right_col)=' ' or _left_col>length(comment));
    do _right_col=min(length(comment),_left_col+&L_split+1) by -1 until (char(comment,_right_col)=' ' or _right_col=length(comment));
    end;
    split{i}=substr(comment,_left_col,_right_col+1-_left_col);
    _left_col=min(_right_col+1,length(comment)+1);
  end;
run;

Notes on the code:

The "do i=1 to &nsplits ..." keeps going until either (a) only trailing blanks remain, or (b) the left column for a new split is beyond the length of the source variable. This can happen if you have a source variable storage length that is not an exact multiple of the split length.
The second DO loop finds the appropriate value for the rightmost candidate for a blank column (or in the case of the last split, it will be the last character of the source variable, which might not be blank).
The next two lines simply copy the current split and updates the left column for use in determining the next split.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Split a variable to multiple variables

Re: Split a variable to multiple variables

Re: Split a variable to multiple variables

Re: Split a variable to multiple variables

Re: Split a variable to multiple variables

Re: Split a variable to multiple variables

Re: Split a variable to multiple variables

Split a variable to multiple variables

Re: Split a variable to multiple variables

Re: Split a variable to multiple variables

Re: Split a variable to multiple variables

Re: Split a variable to multiple variables

Re: Split a variable to multiple variables

Re: Split a variable to multiple variables

Register Today!

SAS Training: Just a Click Away