DATA Step, Macro, Functions and more

Extract information between strings

Accepted Solution Solved
Reply
Contributor
Posts: 24
Accepted Solution

Extract information between strings

Hi all, 

 

I have no idea how to extract information between strings.... for example I need the information Maple Grove, MN between <dt>Location</dt> and </dd>. (original info: <dt>Location</dt> <dd>Maple Grove, MN</dd>)

 

Thank you!


Accepted Solutions
Solution
‎12-27-2016 01:52 PM
Trusted Advisor
Posts: 1,410

Re: Extract information between strings

You gave sample data as:

     <dt>Location</dt> <dd>Maple Grove, MN</dd>

 

assuming your data is in format:

   <dt>xxxxx</dt> <dd>yyyyyyyyy</dd>

then you can extract your info by:

 

data want;

  set have;

        pos1 = index(string, '<dd>') + 4;

        pos2 = index(string, '</dd>');

        info = substr(string,pos1, pos2-pos1 +1);

run;

 

View solution in original post


All Replies
Solution
‎12-27-2016 01:52 PM
Trusted Advisor
Posts: 1,410

Re: Extract information between strings

You gave sample data as:

     <dt>Location</dt> <dd>Maple Grove, MN</dd>

 

assuming your data is in format:

   <dt>xxxxx</dt> <dd>yyyyyyyyy</dd>

then you can extract your info by:

 

data want;

  set have;

        pos1 = index(string, '<dd>') + 4;

        pos2 = index(string, '</dd>');

        info = substr(string,pos1, pos2-pos1 +1);

run;

 

Contributor
Posts: 24

Re: Extract information between strings

thank you  for the immediate response! but the location format is not fixed. here are other examples:

 

<dt>Location</dt> <dd>Austin, TX</dd>

 

<dt>Location</dt> <dd>MD</dd>

Super User
Posts: 6,982

Re: Extract information between strings


may0423 wrote:

thank you  for the immediate response! but the location format is not fixed. here are other examples:

 

<dt>Location</dt> <dd>Austin, TX</dd>

 

<dt>Location</dt> <dd>MD</dd>


@Shmuels code will work in both examples, it does not use any fixed positions. Just try it.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Contributor
Posts: 24

Re: Extract information between strings

Hi @KurtBremser,

 

@Shmuel's code works! I might type in something wrong last night while adjusting the code........

 

Thank you both Smiley Happy

Respected Advisor
Posts: 3,908

Re: Extract information between strings

@may0423

Perl Regular Expressions offer a powerfull syntax when it comes to searching for patterns in a string.

It needs a bit of skill-up time before you can make use of it - but it's very worthwhile learning.

data have;
	input locat $50.;
	datalines;
<dt>location</dt><dd>Mapple grove,MN</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>Austin, TX</dd> 
;
run;

data want(drop=_:);
  set have;
  length address $40;
  retain _prxid;
  if _n_=1 then _prxid=prxparse('/<dt>location<\/dt><dd>(.*)<\/dd>/oi');
  if prxmatch(_prxid,locat) then 
    address=prxposn(_prxid, 1, locat);
run;

http://support.sas.com/documentation/cdl/en/lefunctionsref/69762/HTML/default/viewer.htm#n1lru1b4uoo...

http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p0s9ilagexm...

 

 

Contributor
Posts: 24

Re: Extract information between strings

Hi @Patrick,

 

I am very new to SAS. Thank you for the notes! It's getting interesting Smiley Happy

Contributor
Posts: 52

Re: Extract information between strings

You could have used the SCAN function to extract the data based on delimiters, But SCAN function doesn't support 'word' delimiters. So you can try changing the delimitersor the <tags> and convert them to some symbols using tranwrd function and then can extract the information required.

Try running the below code:

data old;
	input locat $50.;
	datalines;
<dt>location</dt><dd>Mapple grove,MN</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>New york,NY</dd> 
<dt>location</dt><dd>Austin, TX</dd> 
;
run;

Data new;
	set old;
	length column1 $50. column2 $50.;
	column1=scanq(tranwrd(tranwrd(locat,'<dt>','^'),'</dt>','^'),1,'^');
	column2=scanq(tranwrd(tranwrd(locat,'<dd>','#'),'</dd>','#'),2,'#');
run;

output:

 

output


Cheers from India!

Manjeet
Valued Guide
Posts: 505

Re: Extract information between strings

/* T100910 Extract or remove all text between two strings  */

inspired by
http://stackoverflow.com/questions/9721052/r-extract-part-of-string

and

https://goo.gl/KYV45R
https://communities.sas.com/t5/Base-SAS-Programming/Extract-information-between-strings/m-p/321238

also posted to SAS-L a couple of times
HAVE ==== ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD WANT ==== ROGER DEANGELIS 49 SPACKENKILL ROAD OR NOVEMBER 21, 1963 DETAILS REMOVE ONLY --------------------- ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD ---------------------- KEEP ONLY ------------------ ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD ------------------ SOLUTIONS ____ | _ \ ___ _ __ ___ _____ _____ | |_) / _ \ '_ ` _ \ / _ \ \ / / _ \ | _ < __/ | | | | | (_) \ V / __/ |_| \_\___|_| |_| |_|\___/ \_/ \___| data _null_; retain compiled_pattern; length cleantxt $96; input txt & $100.; if _n_ =1 then do; compiled_pattern = prxparse('s/L15(.+?)49//'); end; call prxchange(compiled_pattern,-1,txt,cleantxt); put cleantxt; cards4; ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD ;;;; run;quit; _ __ | |/ /___ ___ _ __ | ' // _ \/ _ \ '_ \ | . \ __/ __/ |_) | |_|\_\___|\___| .__/ |_| data _null_; retain compiled_pattern; length cleantext $96; input txt & $100.; if _n_ =1 then do; compiled_pattern = prxparse('s/L15(.+?)49//'); end; if prxmatch(compiled_pattern, txt) then cleantext = prxposn(compiled_pattern, 1, txt); put cleantext; cards4; ROGER DEANGELIS L15 NOVEMBER 21, 1963 49 SPACKENKILL ROAD GREGG DEANGELIS L15 NOVEMBER 18, 1963 49 PENSACOLA L ROAD ;;;; run;quit;
Contributor
Posts: 24

Re: Extract information between strings

Hi @mnjtrana,

 

Thank you very much for the code!

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 362 views
  • 6 likes
  • 6 in conversation