Hi all,
I have no idea how to extract information between strings.... for example I need the information Maple Grove, MN between <dt>Location</dt> and </dd>. (original info: <dt>Location</dt> <dd>Maple Grove, MN</dd>)
Thank you!
You gave sample data as:
<dt>Location</dt> <dd>Maple Grove, MN</dd>
assuming your data is in format:
<dt>xxxxx</dt> <dd>yyyyyyyyy</dd>
then you can extract your info by:
data want;
set have;
pos1 = index(string, '<dd>') + 4;
pos2 = index(string, '</dd>');
info = substr(string,pos1, pos2-pos1 +1);
run;
You gave sample data as:
<dt>Location</dt> <dd>Maple Grove, MN</dd>
assuming your data is in format:
<dt>xxxxx</dt> <dd>yyyyyyyyy</dd>
then you can extract your info by:
data want;
set have;
pos1 = index(string, '<dd>') + 4;
pos2 = index(string, '</dd>');
info = substr(string,pos1, pos2-pos1 +1);
run;
thank you Shmuel for the immediate response! but the location format is not fixed. here are other examples:
<dt>Location</dt> <dd>Austin, TX</dd>
<dt>Location</dt> <dd>MD</dd>
Hi @Kurt_Bremser,
@Shmuel's code works! I might type in something wrong last night while adjusting the code........
Thank you both 🙂
Perl Regular Expressions offer a powerfull syntax when it comes to searching for patterns in a string.
It needs a bit of skill-up time before you can make use of it - but it's very worthwhile learning.
data have;
input locat $50.;
datalines;
<dt>location</dt><dd>Mapple grove,MN</dd>
<dt>location</dt><dd>New york,NY</dd>
<dt>location</dt><dd>New york,NY</dd>
<dt>location</dt><dd>Austin, TX</dd>
;
run;
data want(drop=_:);
set have;
length address $40;
retain _prxid;
if _n_=1 then _prxid=prxparse('/<dt>location<\/dt><dd>(.*)<\/dd>/oi');
if prxmatch(_prxid,locat) then
address=prxposn(_prxid, 1, locat);
run;
You could have used the SCAN function to extract the data based on delimiters, But SCAN function doesn't support 'word' delimiters. So you can try changing the delimitersor the <tags> and convert them to some symbols using tranwrd function and then can extract the information required.
Try running the below code:
data old;
input locat $50.;
datalines;
<dt>location</dt><dd>Mapple grove,MN</dd>
<dt>location</dt><dd>New york,NY</dd>
<dt>location</dt><dd>New york,NY</dd>
<dt>location</dt><dd>Austin, TX</dd>
;
run;
Data new;
set old;
length column1 $50. column2 $50.;
column1=scanq(tranwrd(tranwrd(locat,'<dt>','^'),'</dt>','^'),1,'^');
column2=scanq(tranwrd(tranwrd(locat,'<dd>','#'),'</dd>','#'),2,'#');
run;
output:
/* T100910 Extract or remove all text between two strings */
inspired by
http://stackoverflow.com/questions/9721052/r-extract-part-of-string
and
https://goo.gl/KYV45R
https://communities.sas.com/t5/Base-SAS-Programming/Extract-information-between-strings/m-p/321238
also posted to SAS-L a couple of times
HAVE
====
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
WANT
====
ROGER DEANGELIS 49 SPACKENKILL ROAD
OR
NOVEMBER 21, 1963
DETAILS
REMOVE ONLY
---------------------
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
----------------------
KEEP ONLY
------------------
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
------------------
SOLUTIONS
____
| _ \ ___ _ __ ___ _____ _____
| |_) / _ \ '_ ` _ \ / _ \ \ / / _ \
| _ < __/ | | | | | (_) \ V / __/
|_| \_\___|_| |_| |_|\___/ \_/ \___|
data _null_;
retain compiled_pattern;
length cleantxt $96;
input txt & $100.;
if _n_ =1 then do;
compiled_pattern = prxparse('s/L15(.+?)49//');
end;
call prxchange(compiled_pattern,-1,txt,cleantxt);
put cleantxt;
cards4;
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD
;;;;
run;quit;
_ __
| |/ /___ ___ _ __
| ' // _ \/ _ \ '_ \
| . \ __/ __/ |_) |
|_|\_\___|\___| .__/
|_|
data _null_;
retain compiled_pattern;
length cleantext $96;
input txt & $100.;
if _n_ =1 then do;
compiled_pattern = prxparse('s/L15(.+?)49//');
end;
if prxmatch(compiled_pattern, txt) then cleantext = prxposn(compiled_pattern, 1, txt);
put cleantext;
cards4;
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD
;;;;
run;quit;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.