Solved: Re: Remove leading and trailing delimiters (~)

shanmukh2 · Posted 11-13-2019 01:21 PM

Hi,

I have a dataset with the following data

id comment

123 ~~%abc~asd~

234 abc~def~

456 ~~ .vbr~123~567

678 ~*bvs~

The Required result I need is to remove leading and trailing ~ but need the ~ in between.

id comment

123 %abc~asd

234 abc~def

456 .vbr~123~567

678 *bvs

I am guessing I should use regex here, but not sure how, can some one help thanks.

Shmuel · Posted 11-13-2019 02:19 PM

Try next code:

do i=1 to length(var);
    if substr(var,i,1) ne '~' then do;
       var = substr(var,i);
       leave;
    end;
end;
do j=length(var) to 1 by -1;
    if substr(var,i,1) ne '~' then do;
       var = substr(var,1,j);
       leave;
    end;
end;

check the code what happens, and what do you want to do, in case var = '~~~~~~~' (only delimiters: put i= j=).

View solution in original post

Shmuel · Posted 11-13-2019 01:44 PM

If there are no spaces in the text then:

var = tran(compbl(tran(var,' ','~')),'~',' ');

that will replace multi '~' into one only.

shanmukh2 · Posted 11-13-2019 01:56 PM

It's not working , there are spaces in the comments

Shmuel · Posted 11-13-2019 02:19 PM

Try next code:

do i=1 to length(var);
    if substr(var,i,1) ne '~' then do;
       var = substr(var,i);
       leave;
    end;
end;
do j=length(var) to 1 by -1;
    if substr(var,i,1) ne '~' then do;
       var = substr(var,1,j);
       leave;
    end;
end;

check the code what happens, and what do you want to do, in case var = '~~~~~~~' (only delimiters: put i= j=).

Tom · Posted 11-13-2019 02:20 PM

You could probably figure out how to write a regular expression pretty easily.

But you could also just figure out how to find the part of the string you want by calculating some pointers into the string.

----+---10----+---20
~~  .vbr~123~567
  ^------- first_nondlm = 3
            ^------- last_dlm = 13
               ^----- str_length = 16
  ^---------^   substr_length = 10

Here is some code. You might need a little more logic to handle cases without trailing delimiters.

data have;
 input id comment $40.;
cards;
123 ~~%abc~asd~
234 abc~def~
456 ~~  .vbr~123~567
678 ~*bvs~
999
;

data want;
  set have;
  first_nondlm=verify(comment,'~');
  last_dlm=findc(comment,'~','b');
  str_length=lengthn(comment);
  substr_length=min(str_length,last_dlm) - first_nondlm;
  want=substrn(comment,first_nondlm,substr_length);
run;
proc print;
run;

Result

                                  first_                 str_     substr_
Obs     id    comment             nondlm    last_dlm    length     length    want

 1     123    ~~%abc~asd~            3         11         11          8      %abc~asd
 2     234    abc~def~               1          8          8          7      abc~def
 3     456    ~~  .vbr~123~567       3         13         16         10        .vbr~123
 4     678    ~*bvs~                 2          6          6          4      *bvs
 5     999                           1          0          0         -1

shanmukh2 · Posted 11-13-2019 04:40 PM

Thank you, this solution works too.

r_behata · Posted 11-13-2019 02:50 PM


data want;
	IF _N_ = 1 THEN
		
			PATTERN = prxparse('/[^~+].+[^~+]/');

		

	RETAIN PATTERN;
	LENGTH comment_ $20 ;
	INPUT Id comment $5-20;
	CALL PRXSUBSTR(PATTERN,trim(comment),START,LENGTH);


			comment_ = SUBSTR(comment,START,LENGTH);

	KEEP Id comment comment_;
	DATALINES;
123 ~~%abc~asd~
234 abc~def~
456 ~~  .vbr~123~567
678 ~*bvs~
;
run;

mkeintz · Posted 11-13-2019 04:53 PM

I wouldn't bother with regex, which I believe can be slow. Instead, consider taking advantage of the CHAR, LENGTH, and SUBSTR functions:

data have;
  input id comment $20.;
  put (_all_) (=);
datalines;
123 ~~%abc~asd~
234 abc~def~
456 ~~  .vbr~123~567
678 ~*bvs~
run;


data want (drop=_:);
  set have;
  original=comment;
  do _c=1 by 1 until (char(comment,_c)^='~');;
  end;
  comment=substr(comment,_c);
  do _c=length(comment) by -1 until(char(comment,_c)^='~');
  end;
  comment=substr(comment,1,_c);
run;

This leads leading blanks in the ID=456 comment. If you want to remove all leading and trailing blanks and ~'s, then consider NOT IN instead of ^=, as in:

data want (drop=_:);
  set have;
  original=comment;
  do _c=1 by 1 until (char(comment,_c) not in ('~',' '));
  end;
  comment=substr(comment,_c);
  do _c=length(comment) by -1 until(char(comment,_c) not in ('~',' '));
  end;
  comment=substr(comment,1,_c);
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

novinosrin · Posted 11-13-2019 06:55 PM

SCAN might be very fast IMHO


data have;
 input id comment $40.;
cards;
123 ~~%abc~asd~
234 abc~def~
456 ~~  .vbr~123~567
678 ~*bvs~
;

data want;
set have;
length want $50;
do _n_=1 to countw(strip(comment),'~');
 want=catx('~',want,scan(strip(comment),_n_,'~'));
end;
run;

proc print noobs;run;

id	comment	want
123	~~%abc~asd~	%abc~asd
234	abc~def~	abc~def
456	~~ .vbr~123~567	.vbr~123~567
678	~*bvs~	*bvs

Ksharp · Posted 11-14-2019 07:23 AM

data have;
 input id comment $40.;
cards;
123 ~~%abc~asd~
234 abc~def~
456 ~~  .vbr~123~567
678 ~*bvs~
999
;
data want;
 set have;
 want=prxchange('s/^~+|~+$//',-1,strip(comment));
run;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away