Hello,
I have a large dataset with client information. Zip codes have the following format: 1234 ZT. That is 4 digits, followed by a space and 2 letters.
I would like to extract the first four digits AND delete any observations that have the wrong format.
I have this so far:
data work.clients;
set work.clients;
PC4_=substr(POSTCODE,1,4);
PC4=.;
PC4=input(PC4_,8.);
run;
This works well, but how do I delete the observations with the wrong format?
Thanks,
data want;
set have;
if
length(postcode) = 7 and
notdigit(substr(postcode,1,4)) = 0 and
substr(postcode,5,1) = ' '
;
PC4 = input(substr(postcode,1,4),4.);
run;
You might add code in the subsetting if that checks for allowed letters in substr(postcode,6,2).
data want;
set have;
if
length(postcode) = 7 and
notdigit(substr(postcode,1,4)) = 0 and
substr(postcode,5,1) = ' '
;
PC4 = input(substr(postcode,1,4),4.);
run;
You might add code in the subsetting if that checks for allowed letters in substr(postcode,6,2).
A regular expression should do the trick here
data have;
infile datalines dlm=",";
input code $;
datalines;
1234 ZT
5678 AB
Q123 AB
2345 4R
;
run;
data want(drop=pattern);
set have;
if _n_=1 then pattern=prxparse("/\d{4}\s[A-Z]{2}/");
if not prxmatch(pattern, code) then delete;
pc4=substr(code,1,4);
retain pattern;
run;
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.