Desktop productivity for business analysts and programmers

Extracting words from the substring

Reply
Frequent Contributor
Posts: 89

Extracting words from the substring

Hello, I have a very long varying character field named comment. I want break that field into 7 fields.

and I use

comment_one = substr(comment,1,240);

    comment_extraone = substr(comment,241,480);

    comment_extratwo = substr(comment,481,720);

   comment_extrathree = substr(comment,721,960);

   comment_extrafour = substr(comment,961,1200);

   comment_extrafive=substr(comment,1201,1440);

   comment_extrasix = substr(comment,1441,1680);

but extracted substring looked like "I went to football game ..............then I reached ho"

since the length of comment_one is 240 it showed "ho" as last word but it is actually "home".

I am wondering if there is a way to fix this. Index function does not work since each comment

is of different length.

Thanks

Super Contributor
Posts: 307

Re: Extracting words from the substring

Is there a reason why each extracted comment has to be a specified number of characters? Unless your original comment field is neatly separated by number of characters (e.g. the first sub-comment always ends at 240, the second always ends at 480, etc), the results are going to be less than desireable -- as it seems you've discovered.

If you are trying to break a large comment field into smaller fields, it might make more sense to break it up based on natural sentence endings (e.g. period, question mark, exclamation mark).

Super Contributor
Posts: 333

Re: Extracting words from the substring

So would it be fair to say you want to split the comment by punctuation? Your could do something like the following:

data want;

  length comment_1 - comment_7 $ 100;

  set have;

  comment_1 = scan(comment,1,".","RP");

  comment_2 = scan(comment,2,".","RP");

  comment_3 = scan(comment,3,".","RP");

  comment_4 = scan(comment,4,".","RP");

  comment_5 = scan(comment,5,".","RP");

  comment_6 = scan(comment,6,".","RP");

  comment_7 = scan(comment,7,".","RP");

run;

Code could be condensed with an arrary and do loop but I left it alone as to not confuse matters. The R and P are optional modifiers that trim leading and trailing spaces from the result (R) and use all punctuation as delimiters (P).

EJ

Frequent Contributor
Posts: 89

Re: Extracting words from the substring

I am transferring this extracted sub strings into Crystal cell and it has 256 as maximum character length. I am trying to fit as much as possible.

Respected Advisor
Posts: 4,997

Re: Extracting words from the substring

This reminds me of the good old days before PROC REPORT performed all the text-wrapping for you.  Here's an approach.

array extra {6} $240 comment_extra1 - comment_extra6;

dum = comment;

drop dum;

_i_=0;

if dum > ' ' then do until (dum=' ');
   do _k_=241 to 1 by -1;

      if substr(dum, _k_, 1) = ' ' then do;

         _i_ + 1;

         extra{_i_} = substr(dum, 1, _k_-1);

         dum = left(substr(dum, k));

      end;

   end;

end;

It's up to you to make sure that 6 variables is enough.  Sometimes an individual word can be quite lengthy.  In one medical dictionary that I examined, several words were each over 40 characters long.

Good luck.

Trusted Advisor
Posts: 1,060

Re: Extracting words from the substring

Here's another approach. It hasn't been exhaustively tested.

Tom

data have;
input;
LongDataRecord = _infile_;
cards;
Long text string 1
Long text string 2
Long text string 3
Long text string 4
run;

data want(keep=ShortDataRecord);
length ShortDataRecord $256;
retain prxid;

if _n_=1 then
  prxid=prxparse('/\s+\S*$/o');
set have;
length TestPart $257;

do while(^missing(LongDataRecord));
  TestPart = substr(LongDataRecord, 1, 257);
  FoundPosn = prxmatch(prxid, TestPart);

  if FoundPosn > 0 then
   do;
    LastChar = FoundPosn - 1;
    FirstChar = FoundPosn + 1;
    ShortDataRecord = substr(LongDataRecord, 1, LastChar);
    LongDataRecord = left(substrn(LongDataRecord, FirstChar));
    output;
   end;
  else
   do;
    ShortDataRecord = substr(LongDataRecord, 1, 256);
    LongDataRecord = left(substrn(257, 1));
    output;
   end;
end;
run;

Ask a Question
Discussion stats
  • 5 replies
  • 265 views
  • 0 likes
  • 5 in conversation