BookmarkSubscribeRSS Feed
DJ09
Calcite | Level 5

How to split a string at less than 200 character in order to derive SDTM variable like DVTERMxx? My variable value has a length of 650 character so my assumption is it ends up in 4 variables. I am not sure. Any help will be appreciated.

15 REPLIES 15
ballardw
Super User

An example of what you have and what the output should be would help. Differe types of string values may be manipulated easiest with different approaches.

DJ09
Calcite | Level 5

Here is an example:

Var1= I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to the groceries.I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to the groceries.I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to the groceries.

I want to split this var at 200 character without chopping any word but get close to 200 and make the first 200 as my first var like xxTERM and the rest of the string will go to the next variable xxTERM01 , xxTERM02 and so on. I hope this make sense?

Haikuo
Onyx | Level 15

data have;

var1="I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to the groceries.I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to the groceries.I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to the groceries.";

run;

proc sql;

  select cats(ceil(max(length(var1))/200)) into :dim from have;

  quit;

data want;

   set have;

   array split(&dim) $200. xxTERM1-xxTERM&dim;j=1;

   do i=1 to &dim WHILE (not missing(scan(var1,j)));

    len=0;

     do j=j by 1 while (not missing(scan(var1,j)));

       if length(scan(var1,j))+len>200 then leave;

split(i)=catx(' ',split(i),scan(var1,j));

len=length(split(i));

       end;

   end;

   keep xx:;

   run;


Haikuo

andreas_lds
Jade | Level 19

If you don't want to keep var1 in the output dataset then try this. The proc sql setting the variable dim by Haikuo needs further testing. As long as your text contains short words, using ceil will most likely provide a number large enough to create the required number of substring-variables.


proc sql noprint;

  select ceil(max(length(var1)) / 200)

            into :dim separated by '' /* strips leading blanks */

            from have

      ;

quit;

data want;

      set have;

    

      length

            Part01-Part&dim $ 200

            i 8

      ;

      array parts Part:;

    i = 1;

      do while (length(Var1) > 200);

            blankPos = find(var1, ' ', -200);

            parts = substr(Var1, 1, blankPos -1);

            i = i + 1;

            Var1 = substr(Var1, blankPos);

      end;

      if length(var1) > 0 then do;

            parts = var1;

      end;

      drop var1 i blankPos /*Buffer*/;

run;

Ksharp
Super User

Or transpose it into vertical.

data have;
length var1 $ 4000 ;
var1='I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to the groceries.I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to the groceries.I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to the groceries.';
output;
var1='I am going to market to get some groceries.I will not return till the evening. If there is anyway possible for you to cook the dinner.I will not return till the evening. If there is anyway possible for you to cook the dinner.I will not return till the evening. If there is anyway possible for you to cook the dinner.I will not return till the evening. If there is anyway possible for you to cook the dinner.I will not return till the evening. If there is anyway possible for you to cook the dinner.';
output;
run;
data temp(keep=id x);
 set have;
 id=_n_;
 do i=1 to countw(var1,' ');
  x=scan(var1,i,' ');
  output;
 end;
run;
data temp;
 set temp;
 length xx $ 400;
 retain xx;
 retain n 1;
 xx=catx(' ',xx,x);
 if  id ne lag(id) then do;n=1;     xx=x;end;
 if length(xx) gt 200  then do;n+1;xx=x;end;
 run;
 data x;
  set temp;
  by n notsorted;
  if last.n;
run;
 
 proc transpose data=x out=want prefix=temp;
 by id;
 id n;
 var xx;
run;

Ksharp

robertrao
Quartz | Level 8

Hi,

I have a couple of questions in the code.

data temp;

set temp;

length xx $ 400;       <<<<<----length of var1 was 40000 and why are we giving a length of only 400 to xx??????

retain xx;

retain n 1; <<<<<<<<<<<<<<<<----------------------------does this mean n variable to have a length of 1 while it retains???

xx=catx(' ',xx,x);

if  id ne lag(id) then do;n=1;     xx=x;end;

if length(xx) gt 200  then do;n+1;xx=x;end;   /*i also have difficulty in understanding what these 2 lines of code does???*/

run;

Please help

Thanks

Ksharp
Super User

"

length xx $ 400; <<<<<----length of var1 was 40000 and why are we giving a length of only 400 to xx??????

"

It is used to hold a variable which has $200. you need . I make it as $400. is to make sure greater than $200, and we can judge if its length is greater than $200 or not. (if length(xx) gt 200 )

"retain n 1; <<<<<<<<<<<<<<<<----------------------------does this mean n variable to have a length of 1 while it retains???"

no, it means n is numeric variable and its initiative value is 1 .

"

if  id ne lag(id) then do;n=1;     xx=x;end;"

at the beginning of every group, I initiative xx as x (xx has the same value with x).and reinstate group id (variable n) as 1.



"if length(xx) gt 200  then do;n+1;xx=x;end;   /*i also have difficulty in understanding what these 2 lines of code does???*/"

I make several groups (variable n), each group 's length ( variable xx 's length) is not greater than 200 which you need to split.

Actually here xx variable is what we need, then proc transpose to make it what we finally want.

Ksharp

Message was edited by: xia keshan

robertrao
Quartz | Level 8

Hi,

I ran through the code again.

I was wondering about the following statement:

if  id ne lag(id) then do;n=1;     xx=x;end;"

There are 151 words in the first statement

there are 94 words in the second statement

lag(id) is the same for the first 151 records except for the 1st record so n  should not equls 1.But to my surprise I find n=1 after running this code

Could some one explain to me please...Same with the other 94 records

Thanks

Ksharp
Super User

Actually it equals   if first.id then .............

when it is the first obs of group id (first.id) then assign the first word of 151 words to variable xx and n=1;

at this time, if length(xx) gt 200  will evaluate the first word 's length ,it should not be great than 200 so n will not +1 .

Ksharp

Tom
Super User Tom
Super User

If you just want to chop into 200 character pieces then just do it.

Let's say you have variable that is 650.  So that will take 4 pieces, 1-200, 201-400, 401-600 and 601 on.

data want ;

  set have ;

  array _x $200 DVTERM01 - DVTERM04  ;

  do i=1 to 650 by 200 ;

    _x(i) = substr(dvterm,i);

end;

run;

CTorres
Quartz | Level 8

I like better the solution given by andreas_Ids than the first one presented by Ksharp. It's more clear, more simple and did not lose the dots that var1 contais to separate the sentences.

art297
Opal | Level 21

@CTorres: But that solution will fail if a sufficiently long word exists in the last split that would cause 5 parts rather than 4 even though the string length is 800.  Below is such a case where the code would fail, along with code that wouldn't fail given a string like that:

data have (drop=i);

  var1="I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast ofhaving the storm this weekend.I hate going to the groceries.I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to the groceries.I am going to market to get some groceries. I will not return till the evening. If there is anyway possible for you to cook the dinner. I am sure the grocery stores may be busy due to the forecast of having the storm this weekend.I hate going to those open allnight grocery stores.";

  do i=1 to 1000;output;end;

run;

data need (keep=_line _partnum part);

  set have;

  length part $200;

  _line+1;

  _partnum=1;

  j=1;

  do WHILE (not missing(scan(var1,j,' ')));

    if length(catx(' ',part,scan(var1,j,' '))) gt 200 or

     missing(scan(var1,j+1,' ')) then do;

      output;

      _partnum+1;

      part=scan(var1,j,' ');

    end;

    else part=catx(' ',part,scan(var1,j,' '));

    j+1;

  end;

run;

proc transpose data=need out=want (drop=_:) prefix=part;

  var part;

  id _partnum;

  by _line;

run;

Ksharp
Super User

BTW. my code will still keep the dots variable contains . Smiley Wink

Scott_Mitchell
Quartz | Level 8

I find the following quite useful.

%LET DIM  = ;

%LET DSID = %SYSFUNC(OPEN(HAVE,I));

%LET DIM  = %SYSFUNC(CEIL(%EVAL(%SYSFUNC(VARLEN(&DSID,%SYSFUNC(VARNUM(&DSID,STDM))))/200)));

%LET RC   = %SYSFUNC(CLOSE(&DSID));

DATA WANT ;

  SET HAVE;

  ARRAY SPLITIT $200 DVTERM1-DVTERM&DIM. _CHARACTER_;

  DO I = 1 TO &DIM.;

    SPLITIT{I} = SUBSTR(STDM,(I-1)*200+1);

  END;

RUN;

%PUT &DIM.;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 15 replies
  • 10606 views
  • 3 likes
  • 11 in conversation