BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
may0423
Obsidian | Level 7

Hi all, 

 

I have no idea how to extract information between strings.... for example I need the information Maple Grove, MN between <dt>Location</dt> and </dd>. (original info: <dt>Location</dt> <dd>Maple Grove, MN</dd>)

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

You gave sample data as:

     <dt>Location</dt> <dd>Maple Grove, MN</dd>

 

assuming your data is in format:

   <dt>xxxxx</dt> <dd>yyyyyyyyy</dd>

then you can extract your info by:

 

data want;

  set have;

        pos1 = index(string, '<dd>') + 4;

        pos2 = index(string, '</dd>');

        info = substr(string,pos1, pos2-pos1 +1);

run;

 

View solution in original post

9 REPLIES 9
Shmuel
Garnet | Level 18

You gave sample data as:

     <dt>Location</dt> <dd>Maple Grove, MN</dd>

 

assuming your data is in format:

   <dt>xxxxx</dt> <dd>yyyyyyyyy</dd>

then you can extract your info by:

 

data want;

  set have;

        pos1 = index(string, '<dd>') + 4;

        pos2 = index(string, '</dd>');

        info = substr(string,pos1, pos2-pos1 +1);

run;

 

may0423
Obsidian | Level 7

thank you  for the immediate response! but the location format is not fixed. here are other examples:

 

<dt>Location</dt> <dd>Austin, TX</dd>

 

<dt>Location</dt> <dd>MD</dd>

Kurt_Bremser
Super User

@may0423 wrote:

thank you @Shmuel for the immediate response! but the location format is not fixed. here are other examples:

 

<dt>Location</dt> <dd>Austin, TX</dd>

 

<dt>Location</dt> <dd>MD</dd>


@Shmuels code will work in both examples, it does not use any fixed positions. Just try it.

may0423
Obsidian | Level 7

Hi @Kurt_Bremser,

 

@Shmuel's code works! I might type in something wrong last night while adjusting the code........

 

Thank you both 🙂

Patrick
Opal | Level 21

@may0423

Perl Regular Expressions offer a powerfull syntax when it comes to searching for patterns in a string.

It needs a bit of skill-up time before you can make use of it - but it's very worthwhile learning.

data have;
	input locat $50.;
	datalines;
<dt>location</dt><dd>Mapple grove,MN</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>Austin, TX</dd> 
;
run;

data want(drop=_:);
  set have;
  length address $40;
  retain _prxid;
  if _n_=1 then _prxid=prxparse('/<dt>location<\/dt><dd>(.*)<\/dd>/oi');
  if prxmatch(_prxid,locat) then 
    address=prxposn(_prxid, 1, locat);
run;

http://support.sas.com/documentation/cdl/en/lefunctionsref/69762/HTML/default/viewer.htm#n1lru1b4uoo...

http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p0s9ilagexm...

 

 

may0423
Obsidian | Level 7

Hi @Patrick,

 

I am very new to SAS. Thank you for the notes! It's getting interesting Smiley Happy

mnjtrana
Pyrite | Level 9

You could have used the SCAN function to extract the data based on delimiters, But SCAN function doesn't support 'word' delimiters. So you can try changing the delimitersor the <tags> and convert them to some symbols using tranwrd function and then can extract the information required.

Try running the below code:

data old;
	input locat $50.;
	datalines;
<dt>location</dt><dd>Mapple grove,MN</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>Austin, TX</dd> 
;
run;

Data new;
	set old;
	length column1 $50. column2 $50.;
	column1=scanq(tranwrd(tranwrd(locat,'<dt>','^'),'</dt>','^'),1,'^');
	column2=scanq(tranwrd(tranwrd(locat,'<dd>','#'),'</dd>','#'),2,'#');
run;

output:

 

output


Cheers from India!

Manjeet
rogerjdeangelis
Barite | Level 11
/* T100910 Extract or remove all text between two strings  */

inspired by
http://stackoverflow.com/questions/9721052/r-extract-part-of-string

and

https://goo.gl/KYV45R
https://communities.sas.com/t5/Base-SAS-Programming/Extract-information-between-strings/m-p/321238

also posted to SAS-L a couple of times
HAVE ==== ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD WANT ==== ROGER DEANGELIS 49 SPACKENKILL ROAD OR NOVEMBER 21, 1963 DETAILS REMOVE ONLY --------------------- ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD ---------------------- KEEP ONLY ------------------ ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD ------------------ SOLUTIONS ____ | _ \ ___ _ __ ___ _____ _____ | |_) / _ \ '_ ` _ \ / _ \ \ / / _ \ | _ < __/ | | | | | (_) \ V / __/ |_| \_\___|_| |_| |_|\___/ \_/ \___| data _null_; retain compiled_pattern; length cleantxt $96; input txt & $100.; if _n_ =1 then do; compiled_pattern = prxparse('s/L15(.+?)49//'); end; call prxchange(compiled_pattern,-1,txt,cleantxt); put cleantxt; cards4; ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD ;;;; run;quit; _ __ | |/ /___ ___ _ __ | ' // _ \/ _ \ '_ \ | . \ __/ __/ |_) | |_|\_\___|\___| .__/ |_| data _null_; retain compiled_pattern; length cleantext $96; input txt & $100.; if _n_ =1 then do; compiled_pattern = prxparse('s/L15(.+?)49//'); end; if prxmatch(compiled_pattern, txt) then cleantext = prxposn(compiled_pattern, 1, txt); put cleantext; cards4; ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD ;;;; run;quit;
may0423
Obsidian | Level 7

Hi @mnjtrana,

 

Thank you very much for the code!

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2935 views
  • 6 likes
  • 6 in conversation