Hi SAS Forum,
/*Hi SAS Forum,
I have this data set. It shows the postal codes of each account.
But the postal_code field has a mix of values. Some are correct, some are strange, some are wrong.
*/
data have;
input account_number postal_code $ 6-12;
cards;
1111 T2H 1B5
2222 T9E 6C8
3333 01000
4444 1097C7
5555 123
7777 A0A 1K0
9999 J0N 1G3
8888 11654-2011
1234 T9W 3H2
;
run;
/*
I wanted to create a new variable called "Validation" with following rules.
fifth is a numeric, sixth is a letter, 7th is a numeric
then validation = 'correct'
*/
/*For clarity, I have indicated below the values of variable "Validation" in front of each record*/
data have;
input account_number postal_code $ 6-12 validation;
cards;
1111 T2H 1B5 /*correct*/
2222 T9E 6C8 /*correct*/
3333 01000 /*strange*/
4444 1097C7 /*strange*/
5555 123 /*strange*/
7777 A0A 1K0 /*wrong*/
9999 J0N 1G3 /*wrong*/
8888 T1654-2011 /*strange*/
1234 T9W 3H2 /*correct*/
;
run;
Q: Could someone help me how to do this.
Thanks for your time.
Miris
Some conditional logic should help here, with the use of the substring function.
I don't have SAS o my home computer so I havnt been able to check for errors.
/* define lists of numbers and letters, although I suspect SAS may have a shortcut for this. */
%let nums = ("1", "2", "3", "4", "5", "6", "7", "8", "9", "0");
%let letters = ("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z");
data want;
set have;
if
upcase(substr(postal_code,1,1)). = "T"
AND
substr(postal_code,2,1) in &nums.
AND
upcase(substr(postal_code,3,1)) in &letters.
AND
substr(postal_code,4,1) = " "
AND
substr(postal_code,5,1) in &nums.
AND
upcase(substr(postal_code,6,1)) in &letters.
AND
substr(postal_code,7,1) in &nums.
then validation = "correct";
/*if the mength of your field is more than 7 characters you may want to check if the remaining ones are blank,
in your example the field post_code has 7 characters so no check needed */
else if
upcase(substr(postal_code,1,1)) in &letters
AND
substr(postal_code,2,1) in &nums.
AND
upcase(substr(postal_code,3,1)) in &letters.
AND
substr(postal_code,4,1) = " "
AND
substr(postal_code,5,1) in &nums.
AND
upcase(substr(postal_code,6,1)) in &letters.
AND
substr(postal_code,7,1) in &nums.
then validation = "wrong "; /*extra spaces needed to avoid truncation */
else validation="strange";
Some conditional logic should help here, with the use of the substring function.
I don't have SAS o my home computer so I havnt been able to check for errors.
/* define lists of numbers and letters, although I suspect SAS may have a shortcut for this. */
%let nums = ("1", "2", "3", "4", "5", "6", "7", "8", "9", "0");
%let letters = ("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z");
data want;
set have;
if
upcase(substr(postal_code,1,1)). = "T"
AND
substr(postal_code,2,1) in &nums.
AND
upcase(substr(postal_code,3,1)) in &letters.
AND
substr(postal_code,4,1) = " "
AND
substr(postal_code,5,1) in &nums.
AND
upcase(substr(postal_code,6,1)) in &letters.
AND
substr(postal_code,7,1) in &nums.
then validation = "correct";
/*if the mength of your field is more than 7 characters you may want to check if the remaining ones are blank,
in your example the field post_code has 7 characters so no check needed */
else if
upcase(substr(postal_code,1,1)) in &letters
AND
substr(postal_code,2,1) in &nums.
AND
upcase(substr(postal_code,3,1)) in &letters.
AND
substr(postal_code,4,1) = " "
AND
substr(postal_code,5,1) in &nums.
AND
upcase(substr(postal_code,6,1)) in &letters.
AND
substr(postal_code,7,1) in &nums.
then validation = "wrong "; /*extra spaces needed to avoid truncation */
else validation="strange";
This code could be simplified by:
Concept:
use a regular expression catching correct and wrong postal codes - strange codes do not match the expression. Use first-function to verify that the code starts with "T".
RegEx: /^[A-Z]\d[A-Z] \d[A-Z]\d$/
Postal codes a generally public information. Couldn't you try to builds a lookup table, and testing using that (i.e. SAS formats). That will give you correct at least.
Since I assume this is some kind of data quality applications, how big is this work? Many rows, how often etc? You might want to look a DQ tool that can automate a great part of this process.
In Canada, postal codes are owned by Canada Post and not generally public :smileyconfused:
However, even if a postal code is in the correct format it may not be a valid postal code, for example, X are rarely used in postal codes.
Depending on who you work for, you may have access to a list of valid postal codes.
A list of Canadian postal codes, crowd sourced (quality not assured) is here:
Hi Murray_Court,
Thank you very much for this code, it worked very well. And I learned a lot how to handle
this kind of coding tasks.
Hi andreas_lds
This is the first time I heard about "Regular expressions". I serached in the google
and now I have some understanding. Thanks a lot for introducing this.
Hi LinusH
I should try your short cut and compare with Murray_Court's resutls. Thanks a lot.
Hi Reeza
I learned the term "crowd sourced", and many thanks for the postal code site as well.
Regards
Mirisage
data have;
input account_number postal_code $ 6-12;
length flag $ 7;
if prxmatch('/^T\d[A-Z] \d[A-Z]\d$/',postal_code) then flag='correct';
else if prxmatch('/^[^T]\d[A-Z] \d[A-Z]\d$/',postal_code) then flag='wrong';
else flag='strange';
cards;
1111 T2H 1B5
2222 T9E 6C8
3333 01000
4444 1097C7
5555 123
7777 A0A 1K0
9999 J0N 1G3
8888 11654-2011
1234 T9W 3H2
;
run;
Hi Hai.kuo,
I ran your code and compared the output with the output produced by Murray_Court's code.
proc compare base=have compare=want; run;
Both code produced identical resutls. Thank you so much.
First, I will read some literature to understand what is happening in your magical code and then come back to you if still I cannot understand this code logic.
Regards
Mirisage
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.