BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Andygray
Quartz | Level 8

How to identify special characters in a character string and assign them to a new variable in the below wanted data structure?

WANT dataset should look like:

STRING_Variable     FLAG_Variable(to have 1 or 0)          Identified_Specialcharacters    RemovedVariable_without_special_characters

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

@Patrick said:

As soon as the word "pattern" gets dropped I'm thinking automatically "Regular Expression". What you want is certainly very doable but for us to give you real advice you need to be very clear in what you have and what you want.

 

See more about the PRX* functions in SAS documentation.

 

Code solution for the data provided: 

 
data have;
  input string :$20.;
  if prxmatch('/[^A-Z 0-9]/i',string) then
    do;
      phone_check=1;
      Identified_Specialcharacters = prxchange('s/[A-Z 0-9]//i',-1,string);
      fix_string= prxchange('s/[^A-Z 0-9]//i',-1,string);
    end;
  cards;
AA5D16.6!I
BETA.$$$1O
CAP%^T&12.5T
CONJ.*()625V1
DIGO.06_+=25T
IV.D10W250
;
run;

View solution in original post

14 REPLIES 14
ballardw
Super User

You should provide a more explicit example of the results as different methods of "removal" could create different results. For example, are the order of the removals important?

Also you should define exactly which characters you consider special as one persons "special character" could be someone else's punctuation.

Is it acceptable to replace the removed characters with spaces, under scores or other character or do the results have to collapse across the removed characters?

As one example:

newstring = translate(stringvariable,'                ','*/><?!@#$%^&()_+');

will replace all of the characters inside the last set of ' ' with blanks.

Andygray
Quartz | Level 8

Thank you I found this code below on a random search.  The pattern below simply says "get rid of anything that isn't a letter or a number". Well this kinda removes fine, however I want to flag=1 if special characters exist, extract the special characters to assign them to Identified_Specialcharacters variable and then remove.

But for phone columns, the code removes the valid '-'. and prefix '+' for values like 1-999-999-9999

data have;

   input string :$20.;

phone='+1-999-999-9999';

phone_check= prxchange('s/[^A-Z 0-9]//i',-1,phone);

     address  = prxchange('s/[^A-Z 0-9]//i',-1,string);

cards;

AA5D16.6!I

BETA.$$$1O

CAP%^T&12.5T

CONJ.*()625V1

DIGO.06_+=25T

IV.D10W250

;

proc print;run;


arodriguez
Lapis Lazuli | Level 10

You always could use the compres function with options

*This line keep only letters and numbers;

without_special=compress(var,' abcdefghijklmnopqrstuvwxyz0123456789','KU');

*This line keep only characters that are not letters and numbers;

with_special=compress(var,' abcdefghijklmnopqrstuvwxyz0123456789','U');

*If there is a special character then a flag to one;

if not missing(with_special) then flag=1;

Andygray
Quartz | Level 8

Thank you, your solution works well for address values, however when it comes to phone values like +1-888-888-8888, here the + and - are not supposed to be removed or identified as special characters do to speak in this pattern which was happening when i tested it. Can you please help?

Patrick
Opal | Level 21

As soon as the word "pattern" gets dropped I'm thinking automatically "Regular Expression". What you want is certainly very doable but for us to give you real advice you need to be very clear in what you have and what you want.

Best would be: Give us a data step creating some sample data and then tell/show us how the desired result should look like. This will allow us to provide you some real code as answer.

Please invest some time in creating representative sample data which covers as many of your real cases as you can think of. ....there are a few examples here in this community where someone asked a question similar to yours, then got answers/code but then came back uncounted times saying "yes, that worked but now I need also to have something to cover this additional case".

...and your phone number example just made me think: Is this about data standardization? If yes then for things like phone numbers there is OOTB DataFlux functionality available. Do you have by any chance the Data Quality Server (standard or advanced) licensed and available at your site? SAS(R) 9.4 Data Quality Server: Reference

ballardw
Super User

Note above that I mentioned DEFINE what is a "special character" to be removed.

And if there are different "special characters" for different variables mention that.

Time type data might not consider : to be special

Longitude and latitude might not consider ' and " and the degree symbol to be special

Social security numbers might not consider - as special

Some folks use periods in phone numbers instead of - (and parentheses for extensions)

And that's just a start of "special character" that are data dependent.

Andygray
Quartz | Level 8

I am so sorry. The design technical specification says the following, if this explains better:

Email variable:

Description:

Requires at least one character, @, at least one character, dot, at least one character. Space, comma, and semicolon are not permitted.

Formula:

Is Null OR ((Like "*? @?*.?*") AND (Not Like "*[ ,;]*"))

Phone Variable:

Description:

Validates that the Phone number is in (999) 999-9999 format. This works by using the REGEX function to check that the number has ten digits in the (999) 999-9999 format.

Formula:

NOT (REGEX (Phone, "\\D*? (\\d\\D*?){10}"))

Error Message:

US phone numbers should be in this format: (999) 999-9999.

Error Location:

Phone

Address Variable:

Description:

Validates the street address. It can have 

Formula:

AND ((REGEX (Address1,\d{1, 5}\s\w.\s(\b\w*\b\s){1,2}\w*\.”)))

Error Message:

No special characters in the last name.

Patrick
Opal | Level 21

Well, if the technical spec is that detailed then your job is simply to implement what has been defined already. Do you need to run these rules against a SAS data set or within a data base (which one)?

In the RegEx for phone numbers I can't see anything covering the '+' case. If you believe this has been missed then you need first to raise a defect against the spec before implementing something which hasn't been defined.

Andygray
Quartz | Level 8

I need to run these rules against a SAS data set. Basically, like a quality check before the data loaded into database but no data quality server or anything. We have just base 9.3. Also, I have received the corrective action to include '+' for phone numbers. Can you help?

Sample Data:

Address                                                                       Email Values: test.username@gmail.com

4HAWTHORNEWAY                                       
47 SHUTTLE MEADOW AV
621 S 27TH AVENUE
5539 W. JACKSON BLVD
13011 SW 82ND ST
1770 TIGERTALE AVE
43857 VICKSBURG CT.
Patrick
Opal | Level 21

Using more or less the RegEx as provided in your spec against the address data you've posted could look code wise as below. The problem is that the RegEx provided doesn't match a single time.

Looking a bit into this RegEx I feel it needs tweaking; but then that's what you've got in the spec to be implemented.

You will need to get up-to-speed with RegEx to be able to ask the right questions but in general: You either get a spec which describes in narrative what a valid address is and then you implement this defining your own RegEx - or you get a RegEx to implement and then that's what you're doing and it works or doesn't; and if it doesn't then you raise a defect against the spec.

The good thing about Perl Regular Expressions: The syntax is the same in SAS and other programming languages so you can pretty much use what gets provided in the spec.

data have;

  infile datalines truncover;

  input address $40.;

  addr_valid_flg= prxmatch('/\d{1, 5}\s\w.\s(\b\w*\b\s){1,2}\w*/',address) >0;

  datalines;

4HAWTHORNEWAY                                      

47 SHUTTLE MEADOW AV

621 S 27TH AVENUE

5539 W. JACKSON BLVD

13011 SW 82ND ST

1770 TIGERTALE AVE

43857 VICKSBURG CT.

;

run;

Andygray
Quartz | Level 8

Sorry and Thank you   I will do that. I really do appreciate your high levels of tolerance and still willing to help. I will have a word with the business "what a valid address is" and get a RegEx and come back to you if you don't mind. Besides, to check email is it similar as you demonstrated for address using prxmatch?

Patrick
Opal | Level 21

Yes, you can use the same approach for validation of any string.

Because RegEx is so widely used there is also a lot of stuff out there on the Internet which you can use as a starting point. Just one example: Useful Regular Expressions for Data Validation in Google Forms

What I tried to say with "you need to get up to speed with RegEx": Someone is writing specs for you with a concrete RegEx in it. You as the developer have to implement against spec which in this case doesn't leave you any freedom other than to use the RegEx "as is" in your code. This would make your job really easy if the RegEx is correct, but if not then you have to go back to the responsible for the spec; and that's where you need to understand the RegEx as you need to be able to explain why the RegEx is not working for the data you're dealing with.

Getting into RegEx takes a bit but it's a skill very worth acquiring.

http://support.sas.com/documentation/cdl/en/lefunctionsref/67398/HTML/default/viewer.htm#p1vz3ljudbd...

SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition

Ksharp
Super User

@Patrick said:

As soon as the word "pattern" gets dropped I'm thinking automatically "Regular Expression". What you want is certainly very doable but for us to give you real advice you need to be very clear in what you have and what you want.

 

See more about the PRX* functions in SAS documentation.

 

Code solution for the data provided: 

 
data have;
  input string :$20.;
  if prxmatch('/[^A-Z 0-9]/i',string) then
    do;
      phone_check=1;
      Identified_Specialcharacters = prxchange('s/[A-Z 0-9]//i',-1,string);
      fix_string= prxchange('s/[^A-Z 0-9]//i',-1,string);
    end;
  cards;
AA5D16.6!I
BETA.$$$1O
CAP%^T&12.5T
CONJ.*()625V1
DIGO.06_+=25T
IV.D10W250
;
run;
RICARDOIS
Calcite | Level 5

Love Pearl Regular expressions!!!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 48120 views
  • 3 likes
  • 6 in conversation