I have two datasets. One with all the countries and respective postal code formats as follows (note: N means a number, A means an alphabet):
dataset1:
Country Format
Albania NNNN
Argentina ANNNNAAA
Canada ANA NAN
United States NNNNN
........and so on
Other dataset with some counties and postal codes as follows:
dataset2:
Country Zip_Code
United States 55344
United States TW18 4BQ
Canada V6C 2X8
France 31360
What I need to accomplish is I need to lookup countries and zip codes in dataset 2 to their respective format in dataset 1 and flag if it is in improper format. Desired output as below:
Country Zip_Code Improper_Flag
United States 55344 N
United States TW18 4BQ Y
Canada V6C 2X8 N
France 31360 N
Any help is greatly appreciated. Thank you
data have;
infile datalines dlm=",";
length country zip format $20;
input country $ zip $;
retain Rx;
if _N_=1 then Rx=rxparse("$U to A,$D to N");
format= rxchange(rx, 1000, zip);
drop Rx;
cards;
Canada,A6C 2X8
Canada,A6C-2X8
France,31360
Switzerland,123
United States,55344
United States,TW18 4BQ
;run;
data codes;
infile cards dlm=",";
input country : $20. code : $20. ;
cards;
Canada,ANA NAN
France,NNNNN
Switzerland,NNNN
United States,NNNNN
;
run;
data want;
merge have codes;
by country;
if trim(format)= trim(code) then Improper_Flag='N';
else Improper_Flag='Y';
run;
Why is the format so important here? It would seem logical to me to ensure the post codes are within the list of Postcodes given rather than format, if they are within that list then they are clean and in the correct format, if they are not within that list they are incorrect. Checking the fomat doesn't check that the post code is valid?
If you have to then a simple routine to create a coded variable and merge that: (Do please note for the future how I have created test data in the form of a datastep):
data codes; infile datalines dlm=","; length country format $20; input country $ format $; datalines; Albania,NNNN Argentina,ANNNNAAA Canada,ANA NAN United States,NNNNN ; run; data have (drop=i chr); infile datalines dlm=","; length country zipcode format $20; input country $ zipcode $; do i=1 to 20; chr=char(zipcode,i); if chr=" " then substr(format,i,1)=chr; if chr ne " " and lengthn(compress(chr," ","a"))=0 then substr(format,i,1)="A"; if chr ne " " and lengthn(compress(chr," ","d"))=0 then substr(format,i,1)="N"; end; datalines; United States,55344 United States,TW18 4BQ Canada,V6C 2X8 France,31360 ; run; proc sql; create table WANT as select A.*, case when not exists(select distinct COUNTRY from CODES where COUNTRY=A.COUNTRY and FORMAT=A.FORMAT) then "N" else "Y" end as IMPROPER_FLAG from HAVE A; quit;
I assume that the OP doens't have access to compete lists of valid zip codes, hence this half way validation.
If there are many other validations in your process, I'm thinking a data equality product, like DataFlux, could be of interest.
Another way that may work (untested) is to use translate() to convert the zip codes to the pattern, and then do the lookup (join/merge) with your pattern data set, and the test for equality.
Here is an example using regular expressions, the code translates the A and N into the corresponding expression and then uses PRXMATCH to test
data ctry_format;
infile cards dlm=",";
input
country : $32.
zip_format : $32.
;
length prx_expression $ 128;
prx_expression = "/";
drop i;
do i = 1 to length(zip_format);
select (char(zip_format, i));
when ("N") do;
prx_expression = cats(prx_expression, "\d");
end;
when ("A") do;
prx_expression = cats(prx_expression, "[A-Z]");
end;
when (" ") do;
prx_expression = cats(prx_expression, "\x20");
end;
end;
end;
prx_expression = cats(prx_expression, "/");
cards;
Canada,ANA NAN
France,NNNNN
Switzerland,NNNN
United States,NNNNN
;
data ctry_have;
infile cards dlm=",";
input
country : $32.
zip : $32.
;
cards;
Canada,A6C 2X8
Canada,A6C-2X8
France,31360
Switzerland,123
United States,55344
United States,TW18 4BQ
;
data want;
merge ctry_format ctry_have;
by country;
prxmatch = ( prxmatch(prx_expression, trim(zip)) > 0 );
run;
Bruno
data have;
infile datalines dlm=",";
length country zip format $20;
input country $ zip $;
retain Rx;
if _N_=1 then Rx=rxparse("$U to A,$D to N");
format= rxchange(rx, 1000, zip);
drop Rx;
cards;
Canada,A6C 2X8
Canada,A6C-2X8
France,31360
Switzerland,123
United States,55344
United States,TW18 4BQ
;run;
data codes;
infile cards dlm=",";
input country : $20. code : $20. ;
cards;
Canada,ANA NAN
France,NNNNN
Switzerland,NNNN
United States,NNNNN
;
run;
data want;
merge have codes;
by country;
if trim(format)= trim(code) then Improper_Flag='N';
else Improper_Flag='Y';
run;
It looks like Perl Regular Expression thing.
data ctry_format;
infile cards dlm=",";
input
country : $32.
zip_format : $32.
;
length prx $ 200;
prx =zip_format ;
prx =tranwrd(strip(prx ),'A','[a-z]');
prx =tranwrd(strip(prx ),'N','\d');
prx =tranwrd(strip(prx ),' ','\s');
cards;
Canada,ANA NAN
France,NNNNN
Switzerland,NNNN
United States,NNNNN
;
run;
data ctry_have;
infile cards dlm=",";
input
country : $32.
zip : $32.
;
cards;
Canada,A6C 2X8
Canada,A6C-2X8
France,31360
Switzerland,123
United States,55344
United States,TW18 4BQ
;
data want;
merge ctry_format ctry_have;
by country;
if not missing(prx) then do;
if prxmatch(cats('/^',prx,'$/i'),strip(zip)) then Improper_Flag='N';
else Improper_Flag='Y';
end;
run;
Thanks a lot Xia! Your approach works too!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.