Solved: Re: How to get substring from a string of characters in clumsy data

Siva_Harish · Posted 08-30-2023 03:07 PM

Hello Folks,

I have three questions :

1) How to pull digits along with IC_(underscore) by ignoring all other characters/words below:

I have used prxchange function to extract below but I was able to pick only up to underscore.

extract=prxchange('s/.*(IC.+_).*/\1/', -1, Str)

STR:
IC_1599- CANCELED 
IC_115_Unkown+ sd
IC_151-Postponed from Aug 8
IC_3549_NC3438921-NWA
IC_83-IC-Spo.,
IC_1901 CEO/Non
IC_1567_ET and Cha
IC_159832-CANCELED
IC_1217 C&S IC -17
IC_153852176_C p+ A
IC_164// SPH
IC_1552_S&c
IC_1196-S And C
MSA: IC_3581
SOW: IC_1044
ID 9401_SOW_OREN SMALETZ   IC_108

2) How to exactly match keyword when we have mutliple words as below

Mapping column has values below: 
1) closed_noc, reclined
2)  reclined,closed_noc
Source has only 'Closed' value
str=index(upcase(cats(mapping)), upcase(cats(source))). While using this, closed value is not present in Mapping column but it is saying it has value as index is not looking for exact word. Is there any function which will pick exact keyword when we have comma delimted values?

3) While importing text files using proc import I am seeing datastep details in the log. How can we avoid printing this?

Please help us on this

Patrick · Posted 11-05-2023 05:41 PM

The \b metacharacter matches a word boundary. Based on your sample data it looked to me like the desired substring always starts at the beginning of the string or then is preceded by such a word boundary.

Here the docu for metacharacters.

If you remove the \b then IC can be anywhere in the string and you would also extract substring "ICTORY ROAD _99" from string "VICTORY ROAD _99". If your data doesn't contain strings where the RegEx without the \b would wrongly match then you're good to go.

If that's not the case then please provide sample data with desired result as exhaustive as possible for all the different text patterns.

Let's say the character right before IC is either a word boundary or a dash or an underscore then the RegEx could look as follows.

_prxid=prxparse('/(\b|(?<=[_-]))IC[^_]*_\s*\d+/oi');

The pipe is an OR condition, the highlighted bit is a positive look behind. You find more detail under the referenced docu link.

View solution in original post

ballardw · Posted 08-30-2023 05:52 PM

Last one first:

@Siva_Harish wrote:

3) While importing text files using proc import I am seeing datastep details in the log. How can we avoid printing this?

Write a data step of your own. Especially if you have multiple files that should have the same structure. Proc Import will make different guesses for each file read. So sometimes the result is such that when you need to combine resulting data sets you get errors because of different variable types or loss of data when variables have been defined to have different lengths. Best is to use documentation of the file contents to make sure that each column is read properly. Or save the code generated by proc import, clean it up as needed and then use that to read the file.

For answers to question 1 it would be a good idea to provide what you expect for the result. Showing code that does not get you the result you need isn't of great help if we don't know what you want.

Question 2 you may be looking for INDEXW or FINDW function. They find complete "words" matching the search term. However you need to look at the function documentation for the complete definition of "word" as it can be a bit flexible. Findw has options to do case insensitive searches as well.

Patrick · Posted 08-30-2023 08:22 PM

I suggest you only ask one question at a time.

Below one way for extraction of the source string pattern you're after.

data have;
  infile datalines truncover;
  input str $100.;
datalines;
IC_1599- CANCELED 
IC_115_Unkown+ sd
IC_151-Postponed from Aug 8
IC_3549_NC3438921-NWA
IC_83-IC-Spo.,
IC_1901 CEO/Non
IC_1567_ET and Cha
IC_159832-CANCELED
IC_1217 C&S IC -17
IC_153852176_C p+ A
IC_164// SPH
IC_1552_S&c
IC_1196-S And C
MSA: IC_3581
SOW: IC_1044
ID 9401_SOW_OREN SMALETZ   IC_108
ID 9401_SOW_OREN SMALETZ   xx_108
;

data want;
  set have;
  length extract $40;
  _prxid=prxparse('/IC_\d+/oi');
  call prxsubstr(_prxid,str,_pos,_len);
  if _pos>0 then extract=substr(str,_pos,_len);
  drop _:;
run;

Look into findw() for your 2nd question.

I consider it a good thing that Proc Import prints the data step to the log. Not sure if there is an option to only suppress this part of logging.

Siva_Harish · Posted 09-05-2023 11:21 AM

Thanks for your response Patrick!

While implementing the below code on original data, it is not identifying position/length of the 'IC_' word even though it is present is data and it is case sensitive. Can please what could be the reasons?

data want;
  set have;
  length extract $40;
  _prxid=prxparse('/IC_\d+/oi');
  call prxsubstr(_prxid,str,_pos,_len);
  if _pos>0 then extract=substr(str,_pos,_len);
  drop _:;
run;

re:Look into findw() for your 2nd question, Findw() is not working when we have two comma seperated values hence I took another approach by making values into single quptes and using for where cluase.Thanks

Patrick · Posted 09-05-2023 08:16 PM

Please provide representative sample data via a "data have;" step (same as I've done).

Make sure you provide samples where the current RegEx doesn't work.

Siva_Harish · Posted 09-06-2023 08:53 AM

Our data starts with IC<contains n number of different characters><_different digits> <some special/alphabets characters may or may not exist>. I have given 3,4,5th row as an example. I thought this will work for all the characters which starts with IC even though there are some alphabets after IC but is not. I want to extract starting from IC and picking upto last digit after underscore.


data have;
  infile datalines truncover;
  input str $100.;
datalines;
IC_1599- CANCELED 
IC_115_Unkown+ sd
ICNEWdocumentloaded_150-Postponed from Aug 8
ICOLDagreementstated_3549_CW3438921-MSA
ICPRPaprovalreq_24848 - IC_3549_NC3438921-NWA
IC_83-IC-Spo.,
IC_1901 CEO/Non
IC_1567_ET and Cha
IC_159832-CANCELED
IC_1217 C&S IC -17
IC_153852176_C p+ A
IC_164// SPH
IC_1552_S&c
IC_1196-S And C
MSA: ICREQpaper_3581
SOW: ICDigitalsign_ 1044  signature not in sync 1044
ID 9401_SOW_OREN SMALETZ   IC_108
ID 9401_SOW_OREN SMALETZ   xx_108
;

data want;
  set have;
  length extract $40;
  _prxid=prxparse('/IC_\d+/oi');
  call prxsubstr(_prxid,str,_pos,_len);
  if _pos>0 then extract=substr(str,_pos,_len);
  drop _:;
run;

Patrick · Posted 09-06-2023 08:22 PM

I'm not 100% sure that I've fully understood which pattern fits the requirement. If below code doesn't return what you need then please share for your have data also the desired result.

data want;
  set have;
  length extract $40;
  _prxid=prxparse('/\bIC[^_]*_\s*\d+/oi');
  call prxsubstr(_prxid,str,_pos,_len);
  if _pos>0 then extract=substr(str,_pos,_len);
  drop _:;
run;

Siva_Harish · Posted 11-03-2023 07:01 AM

Hi @Patrick ,

Thanks for helping with regex!

While I was looking the code, I found one issue which is not being handled by regex. If the value is preceded with <special characters like (_)(..etc> (EX: _IC_2582)then result is becoming null due to keeping of 'b' in the below prxparse function . When I have removed 'b' then results are coming up as expected. Can you help me to understand the significance of 'b' in below piece of code that you had shared earlier?

_prxid=prxparse('/\bIC[^_]*_\s*\d+/oi');

Patrick · Posted 11-05-2023 05:41 PM

The \b metacharacter matches a word boundary. Based on your sample data it looked to me like the desired substring always starts at the beginning of the string or then is preceded by such a word boundary.

Here the docu for metacharacters.

If you remove the \b then IC can be anywhere in the string and you would also extract substring "ICTORY ROAD _99" from string "VICTORY ROAD _99". If your data doesn't contain strings where the RegEx without the \b would wrongly match then you're good to go.

If that's not the case then please provide sample data with desired result as exhaustive as possible for all the different text patterns.

Let's say the character right before IC is either a word boundary or a dash or an underscore then the RegEx could look as follows.

_prxid=prxparse('/(\b|(?<=[_-]))IC[^_]*_\s*\d+/oi');

The pipe is an OR condition, the highlighted bit is a positive look behind. You find more detail under the referenced docu link.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!