- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How to identify special characters in a character string and assign them to a new variable in the below wanted data structure?
WANT dataset should look like:
STRING_Variable FLAG_Variable(to have 1 or 0) Identified_Specialcharacters RemovedVariable_without_special_characters
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Patrick said:
As soon as the word "pattern" gets dropped I'm thinking automatically "Regular Expression". What you want is certainly very doable but for us to give you real advice you need to be very clear in what you have and what you want.
See more about the PRX* functions in SAS documentation.
Code solution for the data provided:
data have;
input string :$20.;
if prxmatch('/[^A-Z 0-9]/i',string) then
do;
phone_check=1;
Identified_Specialcharacters = prxchange('s/[A-Z 0-9]//i',-1,string);
fix_string= prxchange('s/[^A-Z 0-9]//i',-1,string);
end;
cards;
AA5D16.6!I
BETA.$$$1O
CAP%^T&12.5T
CONJ.*()625V1
DIGO.06_+=25T
IV.D10W250
;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You should provide a more explicit example of the results as different methods of "removal" could create different results. For example, are the order of the removals important?
Also you should define exactly which characters you consider special as one persons "special character" could be someone else's punctuation.
Is it acceptable to replace the removed characters with spaces, under scores or other character or do the results have to collapse across the removed characters?
As one example:
newstring = translate(stringvariable,' ','*/><?!@#$%^&()_+');
will replace all of the characters inside the last set of ' ' with blanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you I found this code below on a random search. The pattern below simply says "get rid of anything that isn't a letter or a number". Well this kinda removes fine, however I want to flag=1 if special characters exist, extract the special characters to assign them to Identified_Specialcharacters variable and then remove.
But for phone columns, the code removes the valid '-'. and prefix '+' for values like 1-999-999-9999
data have;
input string :$20.;
phone='+1-999-999-9999';
phone_check= prxchange('s/[^A-Z 0-9]//i',-1,phone);
address = prxchange('s/[^A-Z 0-9]//i',-1,string);
cards;
AA5D16.6!I
BETA.$$$1O
CAP%^T&12.5T
CONJ.*()625V1
DIGO.06_+=25T
IV.D10W250
;
proc print;run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You always could use the compres function with options
*This line keep only letters and numbers;
without_special=compress(var,' abcdefghijklmnopqrstuvwxyz0123456789','KU');
*This line keep only characters that are not letters and numbers;
with_special=compress(var,' abcdefghijklmnopqrstuvwxyz0123456789','U');
*If there is a special character then a flag to one;
if not missing(with_special) then flag=1;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, your solution works well for address values, however when it comes to phone values like +1-888-888-8888, here the + and - are not supposed to be removed or identified as special characters do to speak in this pattern which was happening when i tested it. Can you please help?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As soon as the word "pattern" gets dropped I'm thinking automatically "Regular Expression". What you want is certainly very doable but for us to give you real advice you need to be very clear in what you have and what you want.
Best would be: Give us a data step creating some sample data and then tell/show us how the desired result should look like. This will allow us to provide you some real code as answer.
Please invest some time in creating representative sample data which covers as many of your real cases as you can think of. ....there are a few examples here in this community where someone asked a question similar to yours, then got answers/code but then came back uncounted times saying "yes, that worked but now I need also to have something to cover this additional case".
...and your phone number example just made me think: Is this about data standardization? If yes then for things like phone numbers there is OOTB DataFlux functionality available. Do you have by any chance the Data Quality Server (standard or advanced) licensed and available at your site? SAS(R) 9.4 Data Quality Server: Reference
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Note above that I mentioned DEFINE what is a "special character" to be removed.
And if there are different "special characters" for different variables mention that.
Time type data might not consider : to be special
Longitude and latitude might not consider ' and " and the degree symbol to be special
Social security numbers might not consider - as special
Some folks use periods in phone numbers instead of - (and parentheses for extensions)
And that's just a start of "special character" that are data dependent.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am so sorry. The design technical specification says the following, if this explains better:
Email variable:
Description: | Requires at least one character, @, at least one character, dot, at least one character. Space, comma, and semicolon are not permitted. |
Formula: | Is Null OR ((Like "*? @?*.?*") AND (Not Like "*[ ,;]*")) |
Phone Variable:
Description: | Validates that the Phone number is in (999) 999-9999 format. This works by using the REGEX function to check that the number has ten digits in the (999) 999-9999 format. |
Formula: | NOT (REGEX (Phone, "\\D*? (\\d\\D*?){10}")) |
Error Message: | US phone numbers should be in this format: (999) 999-9999. |
Error Location: | Phone |
Address Variable:
Description: | Validates the street address. It can have |
Formula: | AND ((REGEX (Address1,\d{1, 5}\s\w.\s(\b\w*\b\s){1,2}\w*\.”))) |
Error Message: | No special characters in the last name. |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Well, if the technical spec is that detailed then your job is simply to implement what has been defined already. Do you need to run these rules against a SAS data set or within a data base (which one)?
In the RegEx for phone numbers I can't see anything covering the '+' case. If you believe this has been missed then you need first to raise a defect against the spec before implementing something which hasn't been defined.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I need to run these rules against a SAS data set. Basically, like a quality check before the data loaded into database but no data quality server or anything. We have just base 9.3. Also, I have received the corrective action to include '+' for phone numbers. Can you help?
Sample Data:
Address Email Values: test.username@gmail.com
4HAWTHORNEWAY |
47 SHUTTLE MEADOW AV |
621 S 27TH AVENUE |
5539 W. JACKSON BLVD |
13011 SW 82ND ST |
1770 TIGERTALE AVE |
43857 VICKSBURG CT. |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Using more or less the RegEx as provided in your spec against the address data you've posted could look code wise as below. The problem is that the RegEx provided doesn't match a single time.
Looking a bit into this RegEx I feel it needs tweaking; but then that's what you've got in the spec to be implemented.
You will need to get up-to-speed with RegEx to be able to ask the right questions but in general: You either get a spec which describes in narrative what a valid address is and then you implement this defining your own RegEx - or you get a RegEx to implement and then that's what you're doing and it works or doesn't; and if it doesn't then you raise a defect against the spec.
The good thing about Perl Regular Expressions: The syntax is the same in SAS and other programming languages so you can pretty much use what gets provided in the spec.
data have;
infile datalines truncover;
input address $40.;
addr_valid_flg= prxmatch('/\d{1, 5}\s\w.\s(\b\w*\b\s){1,2}\w*/',address) >0;
datalines;
4HAWTHORNEWAY
47 SHUTTLE MEADOW AV
621 S 27TH AVENUE
5539 W. JACKSON BLVD
13011 SW 82ND ST
1770 TIGERTALE AVE
43857 VICKSBURG CT.
;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sorry and Thank you I will do that. I really do appreciate your high levels of tolerance and still willing to help. I will have a word with the business "what a valid address is" and get a RegEx and come back to you if you don't mind. Besides, to check email is it similar as you demonstrated for address using prxmatch?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you can use the same approach for validation of any string.
Because RegEx is so widely used there is also a lot of stuff out there on the Internet which you can use as a starting point. Just one example: Useful Regular Expressions for Data Validation in Google Forms
What I tried to say with "you need to get up to speed with RegEx": Someone is writing specs for you with a concrete RegEx in it. You as the developer have to implement against spec which in this case doesn't leave you any freedom other than to use the RegEx "as is" in your code. This would make your job really easy if the RegEx is correct, but if not then you have to go back to the responsible for the spec; and that's where you need to understand the RegEx as you need to be able to explain why the RegEx is not working for the data you're dealing with.
Getting into RegEx takes a bit but it's a skill very worth acquiring.
SAS(R) 9.4 Functions and CALL Routines: Reference, Third Edition
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Patrick said:
As soon as the word "pattern" gets dropped I'm thinking automatically "Regular Expression". What you want is certainly very doable but for us to give you real advice you need to be very clear in what you have and what you want.
See more about the PRX* functions in SAS documentation.
Code solution for the data provided:
data have;
input string :$20.;
if prxmatch('/[^A-Z 0-9]/i',string) then
do;
phone_check=1;
Identified_Specialcharacters = prxchange('s/[A-Z 0-9]//i',-1,string);
fix_string= prxchange('s/[^A-Z 0-9]//i',-1,string);
end;
cards;
AA5D16.6!I
BETA.$$$1O
CAP%^T&12.5T
CONJ.*()625V1
DIGO.06_+=25T
IV.D10W250
;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Love Pearl Regular expressions!!!