Hello,
I have a large dataset with client information. Zip codes have the following format: 1234 ZT. That is 4 digits, followed by a space and 2 letters.
I would like to extract the first four digits AND delete any observations that have the wrong format.
I have this so far:
data work.clients;
set work.clients;
PC4_=substr(POSTCODE,1,4);
PC4=.;
PC4=input(PC4_,8.);
run;
This works well, but how do I delete the observations with the wrong format?
Thanks,
data want;
set have;
if
length(postcode) = 7 and
notdigit(substr(postcode,1,4)) = 0 and
substr(postcode,5,1) = ' '
;
PC4 = input(substr(postcode,1,4),4.);
run;
You might add code in the subsetting if that checks for allowed letters in substr(postcode,6,2).
data want;
set have;
if
length(postcode) = 7 and
notdigit(substr(postcode,1,4)) = 0 and
substr(postcode,5,1) = ' '
;
PC4 = input(substr(postcode,1,4),4.);
run;
You might add code in the subsetting if that checks for allowed letters in substr(postcode,6,2).
A regular expression should do the trick here
data have;
infile datalines dlm=",";
input code $;
datalines;
1234 ZT
5678 AB
Q123 AB
2345 4R
;
run;
data want(drop=pattern);
set have;
if _n_=1 then pattern=prxparse("/\d{4}\s[A-Z]{2}/");
if not prxmatch(pattern, code) then delete;
pc4=substr(code,1,4);
retain pattern;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.