BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
may0423
Obsidian | Level 7

Hi all, 

 

I have no idea how to extract information between strings.... for example I need the information Maple Grove, MN between <dt>Location</dt> and </dd>. (original info: <dt>Location</dt> <dd>Maple Grove, MN</dd>)

 

Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
Shmuel
Garnet | Level 18

You gave sample data as:

     <dt>Location</dt> <dd>Maple Grove, MN</dd>

 

assuming your data is in format:

   <dt>xxxxx</dt> <dd>yyyyyyyyy</dd>

then you can extract your info by:

 

data want;

  set have;

        pos1 = index(string, '<dd>') + 4;

        pos2 = index(string, '</dd>');

        info = substr(string,pos1, pos2-pos1 +1);

run;

 

View solution in original post

9 REPLIES 9
Shmuel
Garnet | Level 18

You gave sample data as:

     <dt>Location</dt> <dd>Maple Grove, MN</dd>

 

assuming your data is in format:

   <dt>xxxxx</dt> <dd>yyyyyyyyy</dd>

then you can extract your info by:

 

data want;

  set have;

        pos1 = index(string, '<dd>') + 4;

        pos2 = index(string, '</dd>');

        info = substr(string,pos1, pos2-pos1 +1);

run;

 

may0423
Obsidian | Level 7

thank you  for the immediate response! but the location format is not fixed. here are other examples:

 

<dt>Location</dt> <dd>Austin, TX</dd>

 

<dt>Location</dt> <dd>MD</dd>

Kurt_Bremser
Super User

@may0423 wrote:

thank you @Shmuel for the immediate response! but the location format is not fixed. here are other examples:

 

<dt>Location</dt> <dd>Austin, TX</dd>

 

<dt>Location</dt> <dd>MD</dd>


@Shmuels code will work in both examples, it does not use any fixed positions. Just try it.

may0423
Obsidian | Level 7

Hi @Kurt_Bremser,

 

@Shmuel's code works! I might type in something wrong last night while adjusting the code........

 

Thank you both 🙂

Patrick
Opal | Level 21

@may0423

Perl Regular Expressions offer a powerfull syntax when it comes to searching for patterns in a string.

It needs a bit of skill-up time before you can make use of it - but it's very worthwhile learning.

data have;
	input locat $50.;
	datalines;
<dt>location</dt><dd>Mapple grove,MN</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>Austin, TX</dd> 
;
run;

data want(drop=_:);
  set have;
  length address $40;
  retain _prxid;
  if _n_=1 then _prxid=prxparse('/<dt>location<\/dt><dd>(.*)<\/dd>/oi');
  if prxmatch(_prxid,locat) then 
    address=prxposn(_prxid, 1, locat);
run;

http://support.sas.com/documentation/cdl/en/lefunctionsref/69762/HTML/default/viewer.htm#n1lru1b4uoo...

http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p0s9ilagexm...

 

 

may0423
Obsidian | Level 7

Hi @Patrick,

 

I am very new to SAS. Thank you for the notes! It's getting interesting Smiley Happy

mnjtrana
Pyrite | Level 9

You could have used the SCAN function to extract the data based on delimiters, But SCAN function doesn't support 'word' delimiters. So you can try changing the delimitersor the <tags> and convert them to some symbols using tranwrd function and then can extract the information required.

Try running the below code:

data old;
	input locat $50.;
	datalines;
<dt>location</dt><dd>Mapple grove,MN</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>Austin, TX</dd> 
;
run;

Data new;
	set old;
	length column1 $50. column2 $50.;
	column1=scanq(tranwrd(tranwrd(locat,'<dt>','^'),'</dt>','^'),1,'^');
	column2=scanq(tranwrd(tranwrd(locat,'<dd>','#'),'</dd>','#'),2,'#');
run;

output:

 

output


Cheers from India!

Manjeet
rogerjdeangelis
Barite | Level 11
/* T100910 Extract or remove all text between two strings  */

inspired by
http://stackoverflow.com/questions/9721052/r-extract-part-of-string

and

https://goo.gl/KYV45R
https://communities.sas.com/t5/Base-SAS-Programming/Extract-information-between-strings/m-p/321238

also posted to SAS-L a couple of times
HAVE ==== ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD WANT ==== ROGER DEANGELIS 49 SPACKENKILL ROAD OR NOVEMBER 21, 1963 DETAILS REMOVE ONLY --------------------- ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD ---------------------- KEEP ONLY ------------------ ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD ------------------ SOLUTIONS ____ | _ \ ___ _ __ ___ _____ _____ | |_) / _ \ '_ ` _ \ / _ \ \ / / _ \ | _ < __/ | | | | | (_) \ V / __/ |_| \_\___|_| |_| |_|\___/ \_/ \___| data _null_; retain compiled_pattern; length cleantxt $96; input txt & $100.; if _n_ =1 then do; compiled_pattern = prxparse('s/L15(.+?)49//'); end; call prxchange(compiled_pattern,-1,txt,cleantxt); put cleantxt; cards4; ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD ;;;; run;quit; _ __ | |/ /___ ___ _ __ | ' // _ \/ _ \ '_ \ | . \ __/ __/ |_) | |_|\_\___|\___| .__/ |_| data _null_; retain compiled_pattern; length cleantext $96; input txt & $100.; if _n_ =1 then do; compiled_pattern = prxparse('s/L15(.+?)49//'); end; if prxmatch(compiled_pattern, txt) then cleantext = prxposn(compiled_pattern, 1, txt); put cleantext; cards4; ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD ;;;; run;quit;
may0423
Obsidian | Level 7

Hi @mnjtrana,

 

Thank you very much for the code!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2188 views
  • 6 likes
  • 6 in conversation