Solved: Re: Extract information between strings

may0423 · Posted 12-26-2016 11:42 PM

Hi all,

I have no idea how to extract information between strings.... for example I need the information Maple Grove, MN between <dt>Location</dt> and </dd>. (original info: <dt>Location</dt> <dd>Maple Grove, MN</dd>)

Thank you!

Shmuel · Posted 12-27-2016 12:24 AM

You gave sample data as:

<dt>Location</dt> <dd>Maple Grove, MN</dd>

assuming your data is in format:

<dt>xxxxx</dt> <dd>yyyyyyyyy</dd>

then you can extract your info by:

data want;

set have;

pos1 = index(string, '<dd>') + 4;

pos2 = index(string, '</dd>');

info = substr(string,pos1, pos2-pos1 +1);

run;

View solution in original post

Shmuel · Posted 12-27-2016 12:24 AM

You gave sample data as:

<dt>Location</dt> <dd>Maple Grove, MN</dd>

assuming your data is in format:

<dt>xxxxx</dt> <dd>yyyyyyyyy</dd>

then you can extract your info by:

data want;

set have;

pos1 = index(string, '<dd>') + 4;

pos2 = index(string, '</dd>');

info = substr(string,pos1, pos2-pos1 +1);

run;

may0423 · Posted 12-27-2016 12:55 AM

thank you Shmuel for the immediate response! but the location format is not fixed. here are other examples:

<dt>Location</dt> <dd>Austin, TX</dd>

<dt>Location</dt> <dd>MD</dd>

Kurt_Bremser · Posted 12-27-2016 03:02 AM

@may0423 wrote:

thank you @Shmuel for the immediate response! but the location format is not fixed. here are other examples:

<dt>Location</dt> <dd>Austin, TX</dd>

<dt>Location</dt> <dd>MD</dd>

@Shmuels code will work in both examples, it does not use any fixed positions. Just try it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

may0423 · Posted 12-27-2016 01:44 PM

Hi @Kurt_Bremser,

@Shmuel's code works! I might type in something wrong last night while adjusting the code........

Thank you both 🙂

Patrick · Posted 12-27-2016 05:41 AM

@may0423

Perl Regular Expressions offer a powerfull syntax when it comes to searching for patterns in a string.

It needs a bit of skill-up time before you can make use of it - but it's very worthwhile learning.

data have;
	input locat $50.;
	datalines;
<dt>location</dt><dd>Mapple grove,MN</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>Austin, TX</dd> 
;
run;

data want(drop=_:);
  set have;
  length address $40;
  retain _prxid;
  if _n_=1 then _prxid=prxparse('/<dt>location<\/dt><dd>(.*)<\/dd>/oi');
  if prxmatch(_prxid,locat) then 
    address=prxposn(_prxid, 1, locat);
run;

http://support.sas.com/documentation/cdl/en/lefunctionsref/69762/HTML/default/viewer.htm#n1lru1b4uoo...

http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p0s9ilagexm...

may0423 · Posted 12-27-2016 01:48 PM

Hi @Patrick,

I am very new to SAS. Thank you for the notes! It's getting interesting

mnjtrana · Posted 12-27-2016 04:03 AM

You could have used the SCAN function to extract the data based on delimiters, But SCAN function doesn't support 'word' delimiters. So you can try changing the delimitersor the <tags> and convert them to some symbols using tranwrd function and then can extract the information required.

Try running the below code:

data old;
	input locat $50.;
	datalines;
<dt>location</dt><dd>Mapple grove,MN</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>Austin, TX</dd> 
;
run;

Data new;
	set old;
	length column1 $50. column2 $50.;
	column1=scanq(tranwrd(tranwrd(locat,'<dt>','^'),'</dt>','^'),1,'^');
	column2=scanq(tranwrd(tranwrd(locat,'<dd>','#'),'</dd>','#'),2,'#');
run;

output:

Cheers from India!

Manjeet

rogerjdeangelis · Posted 12-27-2016 10:26 AM

/* T100910 Extract or remove all text between two strings  */

inspired by
http://stackoverflow.com/questions/9721052/r-extract-part-of-string

and

https://goo.gl/KYV45R
https://communities.sas.com/t5/Base-SAS-Programming/Extract-information-between-strings/m-p/321238

also posted to SAS-L a couple of times

HAVE
====

ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD

WANT
====

ROGER DEANGELIS 49 SPACKENKILL ROAD

OR

NOVEMBER 21, 1963


DETAILS
                    REMOVE ONLY
               ---------------------
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
               ----------------------

                     KEEP ONLY
                    ------------------
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
                    ------------------

SOLUTIONS


 ____
|  _ \ ___ _ __ ___   _____   _____
| |_) / _ \ '_ ` _ \ / _ \ \ / / _ \
|  _ <  __/ | | | | | (_) \ V /  __/
|_| \_\___|_| |_| |_|\___/ \_/ \___|


data _null_;
     retain compiled_pattern;
     length cleantxt $96;
     input txt & $100.;
     if _n_ =1 then do;
         compiled_pattern = prxparse('s/L15(.+?)49//');
     end;
     call prxchange(compiled_pattern,-1,txt,cleantxt);
     put cleantxt;
cards4;
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD
;;;;
run;quit;
 _  __
| |/ /___  ___ _ __
| ' // _ \/ _ \ '_ \
| . \  __/  __/ |_) |
|_|\_\___|\___| .__/
              |_|

data _null_;
     retain compiled_pattern;
     length cleantext $96;
     input txt & $100.;
     if _n_ =1 then do;
         compiled_pattern = prxparse('s/L15(.+?)49//');
     end;
     if prxmatch(compiled_pattern, txt) then cleantext = prxposn(compiled_pattern, 1, txt);
     put cleantext;
cards4;
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD
;;;;
run;quit;

may0423 · Posted 12-27-2016 01:50 PM

Hi @mnjtrana,

Thank you very much for the code!

Registration is open

SAS Training: Just a Click Away