BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Siva_Harish
Obsidian | Level 7

Hello Folks,

 

I have three questions :

 

1) How to pull digits along with IC_(underscore)  by ignoring all other characters/words below:

I have used prxchange function to extract below but I was able to pick only up to underscore.

extract=prxchange('s/.*(IC.+_).*/\1/', -1, Str)

 

STR:
IC_1599- CANCELED IC_115_Unkown+ sd IC_151-Postponed from Aug 8 IC_3549_NC3438921-NWA IC_83-IC-Spo., IC_1901 CEO/Non IC_1567_ET and Cha IC_159832-CANCELED IC_1217 C&S IC -17 IC_153852176_C p+ A IC_164// SPH IC_1552_S&c IC_1196-S And C MSA: IC_3581 SOW: IC_1044 ID 9401_SOW_OREN SMALETZ  IC_108

 

2) How to exactly match keyword when we have mutliple words as below 

Mapping column has values below: 
1) closed_noc, reclined
2)  reclined,closed_noc
Source has only 'Closed' value
str=index(upcase(cats(mapping)), upcase(cats(source))). While using this, closed value is not present in Mapping column but it is saying it has value as index is not looking for exact word. Is there any function which will pick exact keyword when we have comma delimted values?

3) While importing text files using proc import I am seeing datastep details in the log. How can we avoid printing this?

 

Please help us on this 

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

The \b metacharacter matches a word boundary. Based on your sample data it looked to me like the desired substring always starts at the beginning of the string or then is preceded by such a word boundary.

Here the docu for metacharacters.

Patrick_0-1699223206476.png

 

If you remove the \b then IC can be anywhere in the string and you would also extract substring "ICTORY ROAD _99" from string "VICTORY ROAD _99". If your data doesn't contain strings where the RegEx without the \b would wrongly match then you're good to go. 

 

If that's not the case then please provide sample data with desired result as exhaustive as possible for all the different text patterns.

 

Let's say the character right before IC is either a word boundary or a dash or an underscore then the RegEx could look as follows.

_prxid=prxparse('/(\b|(?<=[_-]))IC[^_]*_\s*\d+/oi');

The pipe is an OR condition, the highlighted bit is a positive look behind. You find more detail under the referenced docu link.

Patrick_0-1699224996818.png

 

View solution in original post

8 REPLIES 8
ballardw
Super User

Last one first:


@Siva_Harish wrote:

3) While importing text files using proc import I am seeing datastep details in the log. How can we avoid printing this?


Write a data step of your own. Especially if you have multiple files that should have the same structure. Proc Import will make different guesses for each file read. So sometimes the result is such that when you need to combine resulting data sets you get errors because of different variable types or loss of data when variables have been defined to have different lengths. Best is to use documentation of the file contents to make sure that each column is read properly. Or save the code generated by proc import, clean it up as needed and then use that to read the file.

 

For answers to question 1 it would be a good idea to provide what you expect for the result. Showing code that does not get you the result you need isn't of great help if we don't know what you want.

 

Question 2 you may be looking for INDEXW or FINDW function. They find complete "words" matching the search term. However you need to look at the function documentation for the complete definition of "word" as it can be a bit flexible. Findw has options to do case insensitive searches as well.

Patrick
Opal | Level 21

I suggest you only ask one question at a time. 

Below one way for extraction of the source string pattern you're after.

data have;
  infile datalines truncover;
  input str $100.;
datalines;
IC_1599- CANCELED 
IC_115_Unkown+ sd
IC_151-Postponed from Aug 8
IC_3549_NC3438921-NWA
IC_83-IC-Spo.,
IC_1901 CEO/Non
IC_1567_ET and Cha
IC_159832-CANCELED
IC_1217 C&S IC -17
IC_153852176_C p+ A
IC_164// SPH
IC_1552_S&c
IC_1196-S And C
MSA: IC_3581
SOW: IC_1044
ID 9401_SOW_OREN SMALETZ   IC_108
ID 9401_SOW_OREN SMALETZ   xx_108
;

data want;
  set have;
  length extract $40;
  _prxid=prxparse('/IC_\d+/oi');
  call prxsubstr(_prxid,str,_pos,_len);
  if _pos>0 then extract=substr(str,_pos,_len);
  drop _:;
run;

 

Look into findw() for your 2nd question.

 

I consider it a good thing that Proc Import prints the data step to the log. Not sure if there is an option to only suppress this part of logging.

Siva_Harish
Obsidian | Level 7

Thanks for your response Patrick!

While implementing the below code on original data, it is not identifying position/length of the 'IC_' word even though it is present is data and it is case sensitive. Can please what could be the reasons?

data want;
  set have;
  length extract $40;
  _prxid=prxparse('/IC_\d+/oi');
  call prxsubstr(_prxid,str,_pos,_len);
  if _pos>0 then extract=substr(str,_pos,_len);
  drop _:;
run;

 

re:Look into findw() for your 2nd question, Findw() is not working when we have two comma seperated values hence I took another approach by making values into single quptes and using for where cluase.Thanks

Patrick
Opal | Level 21

Please provide representative sample data via a "data have;" step (same as I've done).

Make sure you provide samples where the current RegEx doesn't work.

Siva_Harish
Obsidian | Level 7

Our data starts with IC<contains n number of different characters><_different digits> <some special/alphabets characters may or may not exist>. I have given 3,4,5th row as an example. I thought this will work for all the characters which starts with IC even though there are some alphabets after IC but is not. I want to extract starting from IC and picking upto last digit after underscore. 

 

 


data have;
  infile datalines truncover;
  input str $100.;
datalines;
IC_1599- CANCELED 
IC_115_Unkown+ sd
ICNEWdocumentloaded_150-Postponed from Aug 8
ICOLDagreementstated_3549_CW3438921-MSA
ICPRPaprovalreq_24848 - IC_3549_NC3438921-NWA
IC_83-IC-Spo.,
IC_1901 CEO/Non
IC_1567_ET and Cha
IC_159832-CANCELED
IC_1217 C&S IC -17
IC_153852176_C p+ A
IC_164// SPH
IC_1552_S&c
IC_1196-S And C
MSA: ICREQpaper_3581
SOW: ICDigitalsign_ 1044  signature not in sync 1044
ID 9401_SOW_OREN SMALETZ   IC_108
ID 9401_SOW_OREN SMALETZ   xx_108
;

data want;
  set have;
  length extract $40;
  _prxid=prxparse('/IC_\d+/oi');
  call prxsubstr(_prxid,str,_pos,_len);
  if _pos>0 then extract=substr(str,_pos,_len);
  drop _:;
run;
Patrick
Opal | Level 21

I'm not 100% sure that I've fully understood which pattern fits the requirement. If below code doesn't return what you need then please share for your have data also the desired result.

data want;
  set have;
  length extract $40;
  _prxid=prxparse('/\bIC[^_]*_\s*\d+/oi');
  call prxsubstr(_prxid,str,_pos,_len);
  if _pos>0 then extract=substr(str,_pos,_len);
  drop _:;
run;

Patrick_0-1694046122584.png

 

Siva_Harish
Obsidian | Level 7

Hi @Patrick ,

Thanks for helping with regex!

While I was looking the code, I found one issue which is not being handled by regex. If the value is preceded with <special characters like (_)(..etc> (EX:  _IC_2582)then result is becoming null due to keeping of 'b' in the below prxparse function . When I have removed 'b'  then results are coming up as expected. Can you help me to understand the significance of 'b' in below piece of code that you had shared earlier?

_prxid=prxparse('/\bIC[^_]*_\s*\d+/oi');

 

 

Patrick
Opal | Level 21

The \b metacharacter matches a word boundary. Based on your sample data it looked to me like the desired substring always starts at the beginning of the string or then is preceded by such a word boundary.

Here the docu for metacharacters.

Patrick_0-1699223206476.png

 

If you remove the \b then IC can be anywhere in the string and you would also extract substring "ICTORY ROAD _99" from string "VICTORY ROAD _99". If your data doesn't contain strings where the RegEx without the \b would wrongly match then you're good to go. 

 

If that's not the case then please provide sample data with desired result as exhaustive as possible for all the different text patterns.

 

Let's say the character right before IC is either a word boundary or a dash or an underscore then the RegEx could look as follows.

_prxid=prxparse('/(\b|(?<=[_-]))IC[^_]*_\s*\d+/oi');

The pipe is an OR condition, the highlighted bit is a positive look behind. You find more detail under the referenced docu link.

Patrick_0-1699224996818.png

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1110 views
  • 0 likes
  • 3 in conversation