topic Re: Extracting words from the substring in SAS Enterprise Guide

Extracting words from the substring

learner_sas — Thu, 05 Sep 2013 17:38:54 GMT

Hello, I have a very long varying character field named comment. I want break that field into 7 fields.

and I use

comment_one = substr(comment,1,240);

comment_extraone = substr(comment,241,480);

comment_extratwo = substr(comment,481,720);

comment_extrathree = substr(comment,721,960);

comment_extrafour = substr(comment,961,1200);

comment_extrafive=substr(comment,1201,1440);

comment_extrasix = substr(comment,1441,1680);

but extracted substring looked like "I went to football game ..............then I reached ho"

since the length of comment_one is 240 it showed "ho" as last word but it is actually "home".

I am wondering if there is a way to fix this. Index function does not work since each comment

is of different length.

Thanks

Re: Extracting words from the substring

Fugue — Thu, 05 Sep 2013 18:41:11 GMT

Is there a reason why each extracted comment has to be a specified number of characters? Unless your original comment field is neatly separated by number of characters (e.g. the first sub-comment always ends at 240, the second always ends at 480, etc), the results are going to be less than desireable -- as it seems you've discovered.

If you are trying to break a large comment field into smaller fields, it might make more sense to break it up based on natural sentence endings (e.g. period, question mark, exclamation mark).

Re: Extracting words from the substring

esjackso — Thu, 05 Sep 2013 18:56:33 GMT

So would it be fair to say you want to split the comment by punctuation? Your could do something like the following:

data want;

length comment_1 - comment_7 $ 100;

set have;

comment_1 = scan(comment,1,".","RP");

comment_2 = scan(comment,2,".","RP");

comment_3 = scan(comment,3,".","RP");

comment_4 = scan(comment,4,".","RP");

comment_5 = scan(comment,5,".","RP");

comment_6 = scan(comment,6,".","RP");

comment_7 = scan(comment,7,".","RP");

run;

Code could be condensed with an arrary and do loop but I left it alone as to not confuse matters. The R and P are optional modifiers that trim leading and trailing spaces from the result (R) and use all punctuation as delimiters (P).

Re: Extracting words from the substring

learner_sas — Thu, 05 Sep 2013 18:58:46 GMT

I am transferring this extracted sub strings into Crystal cell and it has 256 as maximum character length. I am trying to fit as much as possible.

Re: Extracting words from the substring

Astounding — Thu, 05 Sep 2013 19:28:09 GMT

This reminds me of the good old days before PROC REPORT performed all the text-wrapping for you. Here's an approach.

array extra {6} $240 comment_extra1 - comment_extra6;

dum = comment;

drop dum;

_i_=0;

if dum > ' ' then do until (dum=' ');
do _k_=241 to 1 by -1;

if substr(dum, _k_, 1) = ' ' then do;

_i_ + 1;

extra{_i_} = substr(dum, 1, _k_-1);

dum = left(substr(dum, k));

end;

It's up to you to make sure that 6 variables is enough. Sometimes an individual word can be quite lengthy. In one medical dictionary that I examined, several words were each over 40 characters long.

Good luck.

Re: Extracting words from the substring

TomKari — Thu, 05 Sep 2013 20:06:08 GMT

Here's another approach. It hasn't been exhaustively tested.

Tom

data have;
input;
LongDataRecord = _infile_;
cards;
Long text string 1
Long text string 2
Long text string 3
Long text string 4
run;

data want(keep=ShortDataRecord);
length ShortDataRecord $256;
retain prxid;

if _n_=1 then
prxid=prxparse('/\s+\S*$/o');
set have;
length TestPart $257;

do while(^missing(LongDataRecord));
TestPart = substr(LongDataRecord, 1, 257);
FoundPosn = prxmatch(prxid, TestPart);

if FoundPosn > 0 then
   do;
    LastChar = FoundPosn - 1;
    FirstChar = FoundPosn + 1;
    ShortDataRecord = substr(LongDataRecord, 1, LastChar);
    LongDataRecord = left(substrn(LongDataRecord, FirstChar));
    output;
   end;
else
   do;
    ShortDataRecord = substr(LongDataRecord, 1, 256);
    LongDataRecord = left(substrn(257, 1));
    output;
   end;
end;
run;