BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Rakesh93
Calcite | Level 5
I have a dataset of IE in that a variable TEST having greater than 200 Characters in some of its observation. I want those observation split into different variables having Length of 200 char.
1 ACCEPTED SOLUTION

Accepted Solutions
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Rakesh93 

 

Here is a slightly different way of tackling the problem, where it is not necessary to know the number of new variables. The idea is:

 

  1. First extra spaces are compressed.
  2. Then the input string is read word for word and added to the output string until there isn't room for any more, in this case the string is written to output, and building of a new string starts etc.
  3. The temp data set has a varying number of records pr. ID, each with a string of max. 20 chars, and no words chopped.
  4. Finally Proc Transpose is used to "reassemble" the string as a list of variables

Length is set to 20 for test purposes and should be changed in 2 places.

 

data have; 
	length ID 8.;
	infile cards truncover;
	input str $char80.; 
	ID = _N_;
cards;
New Year
Copenhagen is the Capital of Denmark
Jensen is a´defunct British manufacturer of sports cars
Another line
;
run;

data tmp (drop=str i word); set have;
	length nystr $20;
	str = compbl(str);
	do i = 1 to countw(str,' ');
		word = scan(str,i,' ');
		if length(nystr) + length(word) + 1 > 20 then do;
			output;
			nystr = word;
		end;
		else nystr = catx(' ',nystr,scan(str,i,' '));
	end;
	if nystr ne '' then output;
run;

proc transpose data=tmp out=want (drop=_name_); 
	by ID;
	var _character_;
run;


Output:

 

 

splitxpose.gif

View solution in original post

5 REPLIES 5
unison
Lapis Lazuli | Level 10

Break your problem into steps.

 

First you need to determine the length of the longest observation in TEST variable. 

Once you have that, you can then scan across each observation (by 200) checking if the character sitting in the 200n+1 position is a space (if not, find the closest space in an earlier position). I’d recommend looking into array processing for this step. 

-unison
KachiM
Rhodochrosite | Level 12

Hi @Rakesh93,

 

 

Here is an illustration for you. You can alter the lengths to your need.

 

data aa;
input list & $36.;
datalines;
Pack my box with twelve dozen balls
This is another sentence to split
;
run;


data _null_;
length str1 $20 str2 $20 s $1;
   set aa;
   do i = 20 by - 1 until(s = ' ');
      s = char(list,i);
   end;
   str1 = trim(substr(list, 1, i-1));
   put '---->' str1=;
   str2 = strip(substr(list, i+1));
   put '---->' str2 =;
run;

The TRIM( ) and STRIP( ) are used to take care of more than one SPACE character between words.

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Rakesh93 

 

Here is a slightly different way of tackling the problem, where it is not necessary to know the number of new variables. The idea is:

 

  1. First extra spaces are compressed.
  2. Then the input string is read word for word and added to the output string until there isn't room for any more, in this case the string is written to output, and building of a new string starts etc.
  3. The temp data set has a varying number of records pr. ID, each with a string of max. 20 chars, and no words chopped.
  4. Finally Proc Transpose is used to "reassemble" the string as a list of variables

Length is set to 20 for test purposes and should be changed in 2 places.

 

data have; 
	length ID 8.;
	infile cards truncover;
	input str $char80.; 
	ID = _N_;
cards;
New Year
Copenhagen is the Capital of Denmark
Jensen is a´defunct British manufacturer of sports cars
Another line
;
run;

data tmp (drop=str i word); set have;
	length nystr $20;
	str = compbl(str);
	do i = 1 to countw(str,' ');
		word = scan(str,i,' ');
		if length(nystr) + length(word) + 1 > 20 then do;
			output;
			nystr = word;
		end;
		else nystr = catx(' ',nystr,scan(str,i,' '));
	end;
	if nystr ne '' then output;
run;

proc transpose data=tmp out=want (drop=_name_); 
	by ID;
	var _character_;
run;


Output:

 

 

splitxpose.gif

hashman
Ammonite | Level 13

@Rakesh93:

If you don't mind guessing the number of needed columns beforehand, the algorithm is simple:

data have ;                                                                                                                             
  input txt $80. ;                                                                                                                      
  cards ;                                                                                                                               
The time has come, the Walrus said, to talk of many things:                                                                             
Of shoes and ships and sealing-wax, of cabbages and kings                                                                               
And why the sea is boiling hot and whether pigs have wings                                                                              
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
%let w = 20 ; * change to 200 in your case ;                                                                                            
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  set have ;                                                                                                                            
  array c $ &w col1-col5 ;                                                                                                              
  _i_ = 1 ;                                                                                                                             
  do _x = 1 to countw (txt, "") ;                                                                                                       
    c = catx (" ", c, scan (txt, _x, "")) ;                                                                                             
    if length (c) <= &w and length (c) + length (scan (txt, _x + 1)) > &w or _x = countw (txt, "") then _i_ + 1 ;                       
  end ;                                                                                                                                 
run ;       

If you want it to be truly dynamic, run the same exact algorithm twice: (1) to determine the actual dimension of array C and (2) to produce the final output, e.g.:

data _null_ ;                                                                                                                           
  set have end = z ;                                                                                                                    
  length c $ &w ;                                                                                                                       
  do _x = 1 to countw (txt, "") ;                                                                                                       
    c = catx (" ", c, scan (txt, _x, "")) ;                                                                                             
    if length (c) <= &w and length (c) + length (scan (txt, _x + 1)) > &w or _x = countw (txt, "") then do ;                            
      _i_ = sum (_i_, 1) ;                                                                                                              
      call missing (c) ;                                                                                                                
    end ;                                                                                                                               
  end ;                                                                                                                                 
  last._i_ = max (last._i_, _i_) ;                                                                                                      
  if z then call symputx ("d", last._i_) ;                                                                                              
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = _:) ;                                                                                                                 
  set have ;                                                                                                                            
  array c $ &w col1-col&d ;                                                                                                             
  _i_ = 1 ;                                                                                                                             
  do _x = 1 to countw (txt, "") ;                                                                                                       
    c = catx (" ", c, scan (txt, _x, "")) ;                                                                                             
    if length (c) <= &w and length (c) + length (scan (txt, _x + 1)) > &w or _x = countw (txt, "") then _i_ + 1 ;                       
  end ;                                                                                                                                 
run ;       

You can massage it if you wish to encapsulate the repetitive code. 

 

Kind regards

Paul D.

 

KachiM
Rhodochrosite | Level 12

@Rakesh93 

 

 @hashman has given a solution. I improve my previous program to do the job dynamically. You have now a choice to choose.

 

As before text line is split into 20 characters with word-boundary. The split sub-list is saved as a big list each separated by a pipe character (|). The maximum number of such splits are passed as a macro variable for the next data step.

In that step, sub-lists are scanned as variables for output.

 

data have;
input txt $80.;
datalines;
I love hash tables. You will, too, after you see what they can do. 
This paper will start with the classic 2 table join.
Nothing new, you say? But this version does it in one step without 
sorting. And, it's the fastest method around.
Next, you will see the use of hash tables to join multiple 
datasets on different keys in one pass, again with no
sorting. How about unlimited recursive chained lookups 
(like when you need to update an ID through as many
incarnations as it may have had)? Hash tables solve that one, 
too. And before you catch your breath, you'll see
how to create n datasets from one (or more), where n is 
determined (at runtime) by the number of values
encountered in a variable. No muss, no fuss. One step 
(well, OK, this one does require a prior sort) and you're
done. We’ll wind up with an outer join, just to show that hash 
tables can do that, too. Take a look at what they can
deliver, and you'll be counting the ways you love hash tables 
before the year is out. 
;
run;

The program:

 

%let w = 20;
data temp;
   set have end = eof;
   length w $ &w;
   length str $80 strout $80;
   str = compbl(txt); /* remove more than one SPACE */
   split = 0;
   retain maxsplit;
   do until(str=' ');
   /* Check the Word boundary ends at the W-th position. 
      Confirm it based on the next position */
      if char(str,&w) ne ' ' and char(str, &w+1)=' ' then i=&w+1;  
      else 
      do i = &w by - 1 until(s = ' ');
         s = char(str,i);
      end;
      w = substr(str,1,i-1);
      call catx('|', strout, w);
      split + 1;
      str = substr(str, i+1);
   end;
   maxsplit = max(maxsplit, split);
   if eof then call symputx('maxsplit', maxsplit);
keep strout; 
run;

data want;
   set temp;
   array s[&maxsplit] $ &w;
   do i = 1 to &maxsplit;
      s[i] = scan(strout,i,'|');
   end;
drop i strout;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 18004 views
  • 5 likes
  • 5 in conversation