Hi SAS Forum,
/*Hi SAS Forum,
I have this data set. It shows the postal codes of each account.
But the postal_code field has a mix of values. Some are correct, some are strange, some are wrong.
*/
data have;
input account_number postal_code $ 6-12;
cards;
1111 T2H 1B5
2222 T9E 6C8
3333 01000
4444 1097C7
5555 123
7777 A0A 1K0
9999 J0N 1G3
8888 11654-2011
1234 T9W 3H2
;
run;
/*
I wanted to create a new variable called "Validation" with following rules.
fifth is a numeric, sixth is a letter, 7th is a numeric
then validation = 'correct'
*/
/*For clarity, I have indicated below the values of variable "Validation" in front of each record*/
data have;
input account_number postal_code $ 6-12 validation;
cards;
1111 T2H 1B5 /*correct*/
2222 T9E 6C8 /*correct*/
3333 01000 /*strange*/
4444 1097C7 /*strange*/
5555 123 /*strange*/
7777 A0A 1K0 /*wrong*/
9999 J0N 1G3 /*wrong*/
8888 T1654-2011 /*strange*/
1234 T9W 3H2 /*correct*/
;
run;
Q: Could someone help me how to do this.
Thanks for your time.
Miris
Some conditional logic should help here, with the use of the substring function.
I don't have SAS o my home computer so I havnt been able to check for errors.
/* define lists of numbers and letters, although I suspect SAS may have a shortcut for this. */
%let nums = ("1", "2", "3", "4", "5", "6", "7", "8", "9", "0");
%let letters = ("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z");
data want;
set have;
if
upcase(substr(postal_code,1,1)). = "T"
AND
substr(postal_code,2,1) in &nums.
AND
upcase(substr(postal_code,3,1)) in &letters.
AND
substr(postal_code,4,1) = " "
AND
substr(postal_code,5,1) in &nums.
AND
upcase(substr(postal_code,6,1)) in &letters.
AND
substr(postal_code,7,1) in &nums.
then validation = "correct";
/*if the mength of your field is more than 7 characters you may want to check if the remaining ones are blank,
in your example the field post_code has 7 characters so no check needed */
else if
upcase(substr(postal_code,1,1)) in &letters
AND
substr(postal_code,2,1) in &nums.
AND
upcase(substr(postal_code,3,1)) in &letters.
AND
substr(postal_code,4,1) = " "
AND
substr(postal_code,5,1) in &nums.
AND
upcase(substr(postal_code,6,1)) in &letters.
AND
substr(postal_code,7,1) in &nums.
then validation = "wrong "; /*extra spaces needed to avoid truncation */
else validation="strange";
Some conditional logic should help here, with the use of the substring function.
I don't have SAS o my home computer so I havnt been able to check for errors.
/* define lists of numbers and letters, although I suspect SAS may have a shortcut for this. */
%let nums = ("1", "2", "3", "4", "5", "6", "7", "8", "9", "0");
%let letters = ("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z");
data want;
set have;
if
upcase(substr(postal_code,1,1)). = "T"
AND
substr(postal_code,2,1) in &nums.
AND
upcase(substr(postal_code,3,1)) in &letters.
AND
substr(postal_code,4,1) = " "
AND
substr(postal_code,5,1) in &nums.
AND
upcase(substr(postal_code,6,1)) in &letters.
AND
substr(postal_code,7,1) in &nums.
then validation = "correct";
/*if the mength of your field is more than 7 characters you may want to check if the remaining ones are blank,
in your example the field post_code has 7 characters so no check needed */
else if
upcase(substr(postal_code,1,1)) in &letters
AND
substr(postal_code,2,1) in &nums.
AND
upcase(substr(postal_code,3,1)) in &letters.
AND
substr(postal_code,4,1) = " "
AND
substr(postal_code,5,1) in &nums.
AND
upcase(substr(postal_code,6,1)) in &letters.
AND
substr(postal_code,7,1) in &nums.
then validation = "wrong "; /*extra spaces needed to avoid truncation */
else validation="strange";
This code could be simplified by:
Concept:
use a regular expression catching correct and wrong postal codes - strange codes do not match the expression. Use first-function to verify that the code starts with "T".
RegEx: /^[A-Z]\d[A-Z] \d[A-Z]\d$/
Postal codes a generally public information. Couldn't you try to builds a lookup table, and testing using that (i.e. SAS formats). That will give you correct at least.
Since I assume this is some kind of data quality applications, how big is this work? Many rows, how often etc? You might want to look a DQ tool that can automate a great part of this process.
In Canada, postal codes are owned by Canada Post and not generally public :smileyconfused:
However, even if a postal code is in the correct format it may not be a valid postal code, for example, X are rarely used in postal codes.
Depending on who you work for, you may have access to a list of valid postal codes.
A list of Canadian postal codes, crowd sourced (quality not assured) is here:
Hi Murray_Court,
Thank you very much for this code, it worked very well. And I learned a lot how to handle
this kind of coding tasks.
Hi andreas_lds
This is the first time I heard about "Regular expressions". I serached in the google
and now I have some understanding. Thanks a lot for introducing this.
Hi LinusH
I should try your short cut and compare with Murray_Court's resutls. Thanks a lot.
Hi Reeza
I learned the term "crowd sourced", and many thanks for the postal code site as well.
Regards
Mirisage
data have;
input account_number postal_code $ 6-12;
length flag $ 7;
if prxmatch('/^T\d[A-Z] \d[A-Z]\d$/',postal_code) then flag='correct';
else if prxmatch('/^[^T]\d[A-Z] \d[A-Z]\d$/',postal_code) then flag='wrong';
else flag='strange';
cards;
1111 T2H 1B5
2222 T9E 6C8
3333 01000
4444 1097C7
5555 123
7777 A0A 1K0
9999 J0N 1G3
8888 11654-2011
1234 T9W 3H2
;
run;
Hi Hai.kuo,
I ran your code and compared the output with the output produced by Murray_Court's code.
proc compare base=have compare=want; run;
Both code produced identical resutls. Thank you so much.
First, I will read some literature to understand what is happening in your magical code and then come back to you if still I cannot understand this code logic.
Regards
Mirisage
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.