Hi all,
I have no idea how to extract information between strings.... for example I need the information Maple Grove, MN between <dt>Location</dt> and </dd>. (original info: <dt>Location</dt> <dd>Maple Grove, MN</dd>)
Thank you!
You gave sample data as:
<dt>Location</dt> <dd>Maple Grove, MN</dd>
assuming your data is in format:
<dt>xxxxx</dt> <dd>yyyyyyyyy</dd>
then you can extract your info by:
data want;
set have;
pos1 = index(string, '<dd>') + 4;
pos2 = index(string, '</dd>');
info = substr(string,pos1, pos2-pos1 +1);
run;
You gave sample data as:
<dt>Location</dt> <dd>Maple Grove, MN</dd>
assuming your data is in format:
<dt>xxxxx</dt> <dd>yyyyyyyyy</dd>
then you can extract your info by:
data want;
set have;
pos1 = index(string, '<dd>') + 4;
pos2 = index(string, '</dd>');
info = substr(string,pos1, pos2-pos1 +1);
run;
thank you Shmuel for the immediate response! but the location format is not fixed. here are other examples:
<dt>Location</dt> <dd>Austin, TX</dd>
<dt>Location</dt> <dd>MD</dd>
Hi @Kurt_Bremser,
@Shmuel's code works! I might type in something wrong last night while adjusting the code........
Thank you both 🙂
Perl Regular Expressions offer a powerfull syntax when it comes to searching for patterns in a string.
It needs a bit of skill-up time before you can make use of it - but it's very worthwhile learning.
data have;
input locat $50.;
datalines;
<dt>location</dt><dd>Mapple grove,MN</dd>
<dt>location</dt><dd>New york,NY</dd>
<dt>location</dt><dd>New york,NY</dd>
<dt>location</dt><dd>Austin, TX</dd>
;
run;
data want(drop=_:);
set have;
length address $40;
retain _prxid;
if _n_=1 then _prxid=prxparse('/<dt>location<\/dt><dd>(.*)<\/dd>/oi');
if prxmatch(_prxid,locat) then
address=prxposn(_prxid, 1, locat);
run;
You could have used the SCAN function to extract the data based on delimiters, But SCAN function doesn't support 'word' delimiters. So you can try changing the delimitersor the <tags> and convert them to some symbols using tranwrd function and then can extract the information required.
Try running the below code:
data old;
input locat $50.;
datalines;
<dt>location</dt><dd>Mapple grove,MN</dd>
<dt>location</dt><dd>New york,NY</dd>
<dt>location</dt><dd>New york,NY</dd>
<dt>location</dt><dd>Austin, TX</dd>
;
run;
Data new;
set old;
length column1 $50. column2 $50.;
column1=scanq(tranwrd(tranwrd(locat,'<dt>','^'),'</dt>','^'),1,'^');
column2=scanq(tranwrd(tranwrd(locat,'<dd>','#'),'</dd>','#'),2,'#');
run;
output:
/* T100910 Extract or remove all text between two strings */
inspired by
http://stackoverflow.com/questions/9721052/r-extract-part-of-string
and
https://goo.gl/KYV45R
https://communities.sas.com/t5/Base-SAS-Programming/Extract-information-between-strings/m-p/321238
also posted to SAS-L a couple of times
HAVE
====
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
WANT
====
ROGER DEANGELIS 49 SPACKENKILL ROAD
OR
NOVEMBER 21, 1963
DETAILS
REMOVE ONLY
---------------------
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
----------------------
KEEP ONLY
------------------
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
------------------
SOLUTIONS
____
| _ \ ___ _ __ ___ _____ _____
| |_) / _ \ '_ ` _ \ / _ \ \ / / _ \
| _ < __/ | | | | | (_) \ V / __/
|_| \_\___|_| |_| |_|\___/ \_/ \___|
data _null_;
retain compiled_pattern;
length cleantxt $96;
input txt & $100.;
if _n_ =1 then do;
compiled_pattern = prxparse('s/L15(.+?)49//');
end;
call prxchange(compiled_pattern,-1,txt,cleantxt);
put cleantxt;
cards4;
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD
;;;;
run;quit;
_ __
| |/ /___ ___ _ __
| ' // _ \/ _ \ '_ \
| . \ __/ __/ |_) |
|_|\_\___|\___| .__/
|_|
data _null_;
retain compiled_pattern;
length cleantext $96;
input txt & $100.;
if _n_ =1 then do;
compiled_pattern = prxparse('s/L15(.+?)49//');
end;
if prxmatch(compiled_pattern, txt) then cleantext = prxposn(compiled_pattern, 1, txt);
put cleantext;
cards4;
ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD
GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD
;;;;
run;quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.