Solved: Re: Extracting text between symbols

J_J_J · Posted 10-25-2019 07:28 AM

Hi

I want to extract characters between two symbols.

Number of characters is always different, it can be 3, 4 or more: #ABC# or #ABCD# etc.

Example:

blablablabla #ABC# blablabla

the output i want to get is ABC

I tried to use:

scan(t1.COMMENT_TXT, 2, "##"), but output was blablablabla blablalba

Maybe it should be combination on FIND and SUBSTRING, but I don't know how to use unknown number of characters in SUBSTRING function.

Can anybody help me please?

PeterClemmensen · Posted 10-25-2019 07:48 AM

Alternatively..

data _null_;
    string='blablablabla #ABC# blablabla';
    newstring=prxchange('s/.*#(.+)#.*/\1/', -1, string);
    put newstring=;
run;

Result:

newstring=ABC

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

Jagadishkatam · Posted 10-25-2019 07:36 AM

you are almost right, but a small modification to scan(t1.COMMENT_TXT, 2, "#")

data have;
input text&$100.;
news=scan(text,2,'#');
cards;
blablablabla #ABC# blablabla
;
run;

Thanks,
Jag

J_J_J · Posted 10-25-2019 07:40 AM

Thanks, Jag!

scan(t1.COMMENT_TXT, 2, "#") works, but the result is ABC blablabla, all characters after last # stay too 😞

Jagadishkatam · Posted 10-25-2019 07:42 AM

they will not , see the output i am getting below

Thanks,
Jag

J_J_J · Posted 10-25-2019 07:48 AM

I don't use cards, because there are thousands of row with different data

data have;
input text&$100.;
news=scan(text,2,'#');
cards;blablablabla #ABC# blablabla;
run;

I use ode:

PROC SQL;
CREATE TABLE WANT AS
SELECT DISTINCT

t1.ID,
t1.CREATED,
t1.COMMENT_TXT,
(scan(t1.COMMENT_TXT, 2, "#")) AS ABC
FROM data.COMMENT t1
;QUIT;

ballardw · Posted 10-25-2019 10:27 AM

@J_J_J wrote:

I don't use cards, because there are thousands of row with different data
data have;
input text&$100.;
news=scan(text,2,'#');
cards;blablablabla #ABC# blablabla;
run;
I use ode:

PROC SQL;
CREATE TABLE WANT AS
SELECT DISTINCT

t1.ID,
t1.CREATED,
t1.COMMENT_TXT,
(scan(t1.COMMENT_TXT, 2, "#")) AS ABC
FROM data.COMMENT t1
;QUIT;

A cards statement was used to provide data as an example. There is no implication that you need to use cards.

BTW, you need to pay very close attention when using cards or datalines in line data. The data step you posted in response:

data have;
input text&$100.;
news=scan(text,2,'#');
cards;blablablabla #ABC# blablabla;
run;

Does not run for several reasons. The only thing that can be on a line with CARDS; or DATALINES; or the Cards4; version is the cards and dataline and ending semicolon.

Also the semicolon to end the data MUST be on a separate line or the last line of data is not read.

1    data have;
2    input text&$100.;
3    news=scan(text,2,'#');
4    cards;

NOTE: The data set USER.HAVE has 0 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           0.02 seconds
      cpu time            0.00 seconds


4  !       blablablabla #ABC# blablabla;
           ------------
           180

ERROR 180-322: Statement is not valid or it is used out of proper order.

5    run;

So that data step really needs to be

data have;
input text&$100.;
news=scan(text,2,'#');
cards;
blablablabla #ABC# blablabla
;

the Run; is actually optional as the semicolon at the end of the data lines ends the data step. I include "run" only to make the editor show the code folding and start/end of procedures properly.

You also if using scan in a data step should set the length of the new variable to be the longest expected value. Otherwise you may have either truncated data for very long values or lots of wasted storage space with the default length of the new variable matching that of the scanned text.

Tom · Posted 10-25-2019 10:52 AM

the Run; is actually optional as the semicolon at the end of the data lines ends the data step. I include "run" only to make the editor show the code folding and start/end of procedures properly.

That sounds like a bug in the Editor. Perhaps you should raise it as an SAS ballot option.

PeterClemmensen · Posted 10-25-2019 07:48 AM

Alternatively..

data _null_;
    string='blablablabla #ABC# blablabla';
    newstring=prxchange('s/.*#(.+)#.*/\1/', -1, string);
    put newstring=;
run;

Result:

newstring=ABC

The DATA to DATA Step Macro
Blog: SASnrd

J_J_J · Posted 10-25-2019 07:50 AM

Draycut, it works! Thank you!!!

J_J_J · Posted 10-25-2019 08:23 AM

Could you please explain metecharacters you've used: prxchange('s/.*#(.+)#.*/\1/', -1, string)

Tom · Posted 10-25-2019 10:50 AM

@J_J_J wrote:

Hi

I want to extract characters between two symbols.

Number of characters is always different, it can be 3, 4 or more: #ABC# or #ABCD# etc.

Example:

blablablabla #ABC# blablabla

the output i want to get is ABC

I tried to use:

scan(t1.COMMENT_TXT, 2, "##"), but output was blablablabla blablalba

Maybe it should be combination on FIND and SUBSTRING, but I don't know how to use unknown number of characters in SUBSTRING function.

Can anybody help me please?

You have listed the same delimiter character twice in your list of delimiter characters. The extra # in the list of delimiters does nothing. SCAN() will treat any occurrence of any of the delimiter characters the same.

You cannot get that result from that function call on your posted input. To get that result the input string would need to look like:

some other text#blablablabla blablalba

Or

some other text#blablablabla blablalba#more other text

Do all of your examples have # characters?

blablablabla blablalba

Do they have 2 and only 2 # characters? Could there be just one?

some other text#blablablabla blablalba

Could there be three?

some other text#ABC#CDE#more text

Could there be four or more?

some other text#ABC#more text#CDE#and even more text

Is the # character ever the first character?

#AB#some other text

J_J_J · Posted 10-28-2019 01:44 AM

Do all of your examples have # characters?

blablablabla blablalba

Not all examples have # characters.

Do they have 2 and only 2 # characters? Could there be just one?

some other text#blablablabla blablalba

No, some lines have more than 2 # characters, but not less than 2

Could there be three?

some other text#ABC#CDE#more text

It could be text #abc# text #abs# text

Could there be four or more?

some other text#ABC#more text#CDE#and even more text

Yes

Is the # character ever the first character?

#AB#some other text

# character can be first, in the middle of the text or last in the sentence:

#abc# text

text #abc# text

Text #abc#

text #abc# text #abc# text

Classroom Training Available!