SAS Programming

Rakesh93 · Posted 01-01-2020 02:12 AM

I have a dataset of IE in that a variable TEST having greater than 200 Characters in some of its observation. I want those observation split into different variables having Length of 200 char.

ErikLund_Jensen · Posted 01-01-2020 02:13 PM

Hi @Rakesh93

Here is a slightly different way of tackling the problem, where it is not necessary to know the number of new variables. The idea is:

First extra spaces are compressed.
Then the input string is read word for word and added to the output string until there isn't room for any more, in this case the string is written to output, and building of a new string starts etc.
The temp data set has a varying number of records pr. ID, each with a string of max. 20 chars, and no words chopped.
Finally Proc Transpose is used to "reassemble" the string as a list of variables

Length is set to 20 for test purposes and should be changed in 2 places.

data have; 
	length ID 8.;
	infile cards truncover;
	input str $char80.; 
	ID = _N_;
cards;
New Year
Copenhagen is the Capital of Denmark
Jensen is a´defunct British manufacturer of sports cars
Another line
;
run;

data tmp (drop=str i word); set have;
	length nystr $20;
	str = compbl(str);
	do i = 1 to countw(str,' ');
		word = scan(str,i,' ');
		if length(nystr) + length(word) + 1 > 20 then do;
			output;
			nystr = word;
		end;
		else nystr = catx(' ',nystr,scan(str,i,' '));
	end;
	if nystr ne '' then output;
run;

proc transpose data=tmp out=want (drop=_name_); 
	by ID;
	var _character_;
run;

Output:

View solution in original post

unison · Posted 01-01-2020 03:02 AM

Break your problem into steps.

First you need to determine the length of the longest observation in TEST variable.

Once you have that, you can then scan across each observation (by 200) checking if the character sitting in the 200n+1 position is a space (if not, find the closest space in an earlier position). I’d recommend looking into array processing for this step.

-unison

KachiM · Posted 01-01-2020 04:29 AM

Hi @Rakesh93,

Here is an illustration for you. You can alter the lengths to your need.

data aa;
input list & $36.;
datalines;
Pack my box with twelve dozen balls
This is another sentence to split
;
run;


data _null_;
length str1 $20 str2 $20 s $1;
   set aa;
   do i = 20 by - 1 until(s = ' ');
      s = char(list,i);
   end;
   str1 = trim(substr(list, 1, i-1));
   put '---->' str1=;
   str2 = strip(substr(list, i+1));
   put '---->' str2 =;
run;

The TRIM( ) and STRIP( ) are used to take care of more than one SPACE character between words.

ErikLund_Jensen · Posted 01-01-2020 02:13 PM

Hi @Rakesh93

Here is a slightly different way of tackling the problem, where it is not necessary to know the number of new variables. The idea is:

First extra spaces are compressed.
Then the input string is read word for word and added to the output string until there isn't room for any more, in this case the string is written to output, and building of a new string starts etc.
The temp data set has a varying number of records pr. ID, each with a string of max. 20 chars, and no words chopped.
Finally Proc Transpose is used to "reassemble" the string as a list of variables

Length is set to 20 for test purposes and should be changed in 2 places.

data have; 
	length ID 8.;
	infile cards truncover;
	input str $char80.; 
	ID = _N_;
cards;
New Year
Copenhagen is the Capital of Denmark
Jensen is a´defunct British manufacturer of sports cars
Another line
;
run;

data tmp (drop=str i word); set have;
	length nystr $20;
	str = compbl(str);
	do i = 1 to countw(str,' ');
		word = scan(str,i,' ');
		if length(nystr) + length(word) + 1 > 20 then do;
			output;
			nystr = word;
		end;
		else nystr = catx(' ',nystr,scan(str,i,' '));
	end;
	if nystr ne '' then output;
run;

proc transpose data=tmp out=want (drop=_name_); 
	by ID;
	var _character_;
run;

Output:

hashman · Posted 01-01-2020 08:15 PM

@Rakesh93:

If you don't mind guessing the number of needed columns beforehand, the algorithm is simple:

data have ;                                                                                                                             
  input txt $80. ;                                                                                                                      
  cards ;                                                                                                                               
The time has come, the Walrus said, to talk of many things:                                                                             
Of shoes and ships and sealing-wax, of cabbages and kings                                                                               
And why the sea is boiling hot and whether pigs have wings                                                                              
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
%let w = 20 ; * change to 200 in your case ;                                                                                            
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  set have ;                                                                                                                            
  array c $ &w col1-col5 ;                                                                                                              
  _i_ = 1 ;                                                                                                                             
  do _x = 1 to countw (txt, "") ;                                                                                                       
    c = catx (" ", c, scan (txt, _x, "")) ;                                                                                             
    if length (c) <= &w and length (c) + length (scan (txt, _x + 1)) > &w or _x = countw (txt, "") then _i_ + 1 ;                       
  end ;                                                                                                                                 
run ;

If you want it to be truly dynamic, run the same exact algorithm twice: (1) to determine the actual dimension of array C and (2) to produce the final output, e.g.:

data _null_ ;                                                                                                                           
  set have end = z ;                                                                                                                    
  length c $ &w ;                                                                                                                       
  do _x = 1 to countw (txt, "") ;                                                                                                       
    c = catx (" ", c, scan (txt, _x, "")) ;                                                                                             
    if length (c) <= &w and length (c) + length (scan (txt, _x + 1)) > &w or _x = countw (txt, "") then do ;                            
      _i_ = sum (_i_, 1) ;                                                                                                              
      call missing (c) ;                                                                                                                
    end ;                                                                                                                               
  end ;                                                                                                                                 
  last._i_ = max (last._i_, _i_) ;                                                                                                      
  if z then call symputx ("d", last._i_) ;                                                                                              
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  set have ;                                                                                                                            
  array c $ &w col1-col&d ;                                                                                                             
  _i_ = 1 ;                                                                                                                             
  do _x = 1 to countw (txt, "") ;                                                                                                       
    c = catx (" ", c, scan (txt, _x, "")) ;                                                                                             
    if length (c) <= &w and length (c) + length (scan (txt, _x + 1)) > &w or _x = countw (txt, "") then _i_ + 1 ;                       
  end ;                                                                                                                                 
run ;

You can massage it if you wish to encapsulate the repetitive code.

Kind regards

Paul D.

KachiM · Posted 01-02-2020 05:23 AM

@Rakesh93

@hashman has given a solution. I improve my previous program to do the job dynamically. You have now a choice to choose.

As before text line is split into 20 characters with word-boundary. The split sub-list is saved as a big list each separated by a pipe character (|). The maximum number of such splits are passed as a macro variable for the next data step.

In that step, sub-lists are scanned as variables for output.

data have;
input txt $80.;
datalines;
I love hash tables. You will, too, after you see what they can do. 
This paper will start with the classic 2 table join.
Nothing new, you say? But this version does it in one step without 
sorting. And, it's the fastest method around.
Next, you will see the use of hash tables to join multiple 
datasets on different keys in one pass, again with no
sorting. How about unlimited recursive chained lookups 
(like when you need to update an ID through as many
incarnations as it may have had)? Hash tables solve that one, 
too. And before you catch your breath, you'll see
how to create n datasets from one (or more), where n is 
determined (at runtime) by the number of values
encountered in a variable. No muss, no fuss. One step 
(well, OK, this one does require a prior sort) and you're
done. We’ll wind up with an outer join, just to show that hash 
tables can do that, too. Take a look at what they can
deliver, and you'll be counting the ways you love hash tables 
before the year is out. 
;
run;

The program:

%let w = 20;
data temp;
   set have end = eof;
   length w $ &w;
   length str $80 strout $80;
   str = compbl(txt); /* remove more than one SPACE */
   split = 0;
   retain maxsplit;
   do until(str=' ');
   /* Check the Word boundary ends at the W-th position. 
      Confirm it based on the next position */
      if char(str,&w) ne ' ' and char(str, &w+1)=' ' then i=&w+1;  
      else 
      do i = &w by - 1 until(s = ' ');
         s = char(str,i);
      end;
      w = substr(str,1,i-1);
      call catx('|', strout, w);
      split + 1;
      str = substr(str, i+1);
   end;
   maxsplit = max(maxsplit, split);
   if eof then call symputx('maxsplit', maxsplit);
keep strout; 
run;

data want;
   set temp;
   array s[&maxsplit] $ &w;
   do i = 1 to &maxsplit;
      s[i] = scan(strout,i,'|');
   end;
drop i strout;
run;

SAS Programming

How to split a variable into 200 Character without chopping a word dynamically.

Re: How to split a variable into 200 Character without chopping a word dynamically.

Re: How to split a variable into 200 Character without chopping a word dynamically.

Re: How to split a variable into 200 Character without chopping a word dynamically.

Re: How to split a variable into 200 Character without chopping a word dynamically.

Re: How to split a variable into 200 Character without chopping a word dynamically.

Re: How to split a variable into 200 Character without chopping a word dynamically.

Split a long text string, 200 or less using a ';' delimiter- retaining...

Character Variablen in DoubleQuotes einschließen

Extract a specific word in a character string

Need help generating output in Word

cannot split variable into character and numerical variables

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...