BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
sasmaverick
Obsidian | Level 7

I have two datasets. One with all the countries and respective postal code formats as follows (note: N means a number, A means an alphabet):

 

dataset1:

 

Country        Format

Albania         NNNN

Argentina      ANNNNAAA

Canada         ANA NAN

United States NNNNN

........and so on

 

Other dataset with some counties and postal codes as follows:

dataset2:

 

Country                Zip_Code

United States       55344

United States      TW18 4BQ

Canada               V6C 2X8

France                 31360

 

What I need to accomplish is I need to lookup countries and zip codes in dataset 2 to their respective format in dataset 1 and flag if it is in improper format. Desired output as below:

Country                Zip_Code   Improper_Flag

United States       55344           N

United States      TW18 4BQ    Y

Canada               V6C 2X8       N

France                 31360          N

 

Any help is greatly appreciated. Thank you

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
mohamed_zaki
Barite | Level 11
data have;
infile datalines dlm=",";
length country zip format $20;
input country $ zip $;
retain Rx;
if _N_=1 then Rx=rxparse("$U to A,$D to N"); 
format= rxchange(rx, 1000, zip); 
drop Rx;
cards;
Canada,A6C 2X8
Canada,A6C-2X8
France,31360
Switzerland,123
United States,55344
United States,TW18 4BQ
;run;

data codes;
  infile cards dlm=",";
  input country : $20. code : $20. ;
  cards;
Canada,ANA NAN
France,NNNNN
Switzerland,NNNN
United States,NNNNN
;
run;
data want;
  merge have codes;
  by country;
 if trim(format)= trim(code) then Improper_Flag='N';
 else Improper_Flag='Y';
run;

View solution in original post

7 REPLIES 7
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Why is the format so important here?  It would seem logical to me to ensure the post codes are within the list of Postcodes given rather than format, if they are within that list then they are clean and in the correct format, if they are not within that list they are incorrect.  Checking the fomat doesn't check that the post code is valid?

If you have to then a simple routine to create a coded variable and merge that:  (Do please note for the future how I have created test data in the form of a datastep):

data codes;
  infile datalines dlm=",";
  length country format $20;
  input country $ format $;
datalines;
Albania,NNNN
Argentina,ANNNNAAA
Canada,ANA NAN
United States,NNNNN
;
run;

data have (drop=i chr);
  infile datalines dlm=",";
  length country zipcode format $20;
  input country $ zipcode $;
  do i=1 to 20;
    chr=char(zipcode,i);
    if chr=" " then substr(format,i,1)=chr;
    if chr ne " " and lengthn(compress(chr," ","a"))=0 then substr(format,i,1)="A";
    if chr ne " " and lengthn(compress(chr," ","d"))=0 then substr(format,i,1)="N"; 
  end;    
datalines;
United States,55344
United States,TW18 4BQ
Canada,V6C 2X8
France,31360
;
run;

proc sql;
  create table WANT as
  select  A.*,
          case when not exists(select distinct COUNTRY from CODES where COUNTRY=A.COUNTRY and FORMAT=A.FORMAT) then "N"
               else "Y" end as IMPROPER_FLAG
  from    HAVE A;
quit;
LinusH
Tourmaline | Level 20

I assume that the OP doens't have access to compete lists of valid zip codes, hence this half way validation.

If there are many other validations in your process, I'm thinking a data equality product, like DataFlux, could be of interest.

Another way that may work (untested) is to use translate() to convert the zip codes to the pattern, and then do the lookup (join/merge) with your pattern data set, and the test for equality.

Data never sleeps
BrunoMueller
SAS Super FREQ

Here is an example using regular expressions, the code translates the A and N into the corresponding expression and then uses PRXMATCH to test

 

data ctry_format;
  infile cards dlm=",";
  input
    country : $32.
    zip_format : $32.
  ;
  length prx_expression $ 128;
  prx_expression = "/";

  drop i;
  do i = 1 to length(zip_format);
    select (char(zip_format, i));
      when ("N") do;
        prx_expression = cats(prx_expression, "\d");
      end;

      when ("A") do;
        prx_expression = cats(prx_expression, "[A-Z]");
      end;

      when (" ") do;
        prx_expression = cats(prx_expression, "\x20");
      end;
    end;
  end;

  prx_expression = cats(prx_expression, "/");
  cards;
Canada,ANA NAN
France,NNNNN
Switzerland,NNNN
United States,NNNNN
;

data ctry_have;
  infile cards dlm=",";
  input
    country : $32.
    zip : $32.
  ;
  cards;
Canada,A6C 2X8
Canada,A6C-2X8
France,31360
Switzerland,123
United States,55344
United States,TW18 4BQ
;

data want;
  merge ctry_format ctry_have;
  by country;
  prxmatch = ( prxmatch(prx_expression, trim(zip)) > 0 );
run;

Bruno

mohamed_zaki
Barite | Level 11
data have;
infile datalines dlm=",";
length country zip format $20;
input country $ zip $;
retain Rx;
if _N_=1 then Rx=rxparse("$U to A,$D to N"); 
format= rxchange(rx, 1000, zip); 
drop Rx;
cards;
Canada,A6C 2X8
Canada,A6C-2X8
France,31360
Switzerland,123
United States,55344
United States,TW18 4BQ
;run;

data codes;
  infile cards dlm=",";
  input country : $20. code : $20. ;
  cards;
Canada,ANA NAN
France,NNNNN
Switzerland,NNNN
United States,NNNNN
;
run;
data want;
  merge have codes;
  by country;
 if trim(format)= trim(code) then Improper_Flag='N';
 else Improper_Flag='Y';
run;
sasmaverick
Obsidian | Level 7
Thanks a lot! Works for me
Ksharp
Super User

It looks like Perl Regular Expression thing.

 

 

data ctry_format;
  infile cards dlm=",";
  input
    country : $32.
    zip_format : $32.
  ;
  length prx $ 200;
  prx =zip_format ;
  prx =tranwrd(strip(prx ),'A','[a-z]');
  prx =tranwrd(strip(prx ),'N','\d');
  prx =tranwrd(strip(prx ),' ','\s');
  cards;
Canada,ANA NAN
France,NNNNN
Switzerland,NNNN
United States,NNNNN
;
run;
data ctry_have;
  infile cards dlm=",";
  input
    country : $32.
    zip : $32.
  ;
  cards;
Canada,A6C 2X8
Canada,A6C-2X8
France,31360
Switzerland,123
United States,55344
United States,TW18 4BQ
;

data want;
  merge ctry_format ctry_have;
  by country;
if not missing(prx) then do;
  if prxmatch(cats('/^',prx,'$/i'),strip(zip)) then Improper_Flag='N';
   else Improper_Flag='Y';
end;
run;
sasmaverick
Obsidian | Level 7

Thanks a lot Xia! Your approach works too!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 3003 views
  • 3 likes
  • 6 in conversation