Hello,
I have a large dataset with client information. Zip codes have the following format: 1234 ZT. That is 4 digits, followed by a space and 2 letters.
I would like to extract the first four digits AND delete any observations that have the wrong format.
I have this so far:
data work.clients;
set work.clients;
PC4_=substr(POSTCODE,1,4);
PC4=.;
PC4=input(PC4_,8.);
run;
This works well, but how do I delete the observations with the wrong format?
Thanks,
data want;
set have;
if
length(postcode) = 7 and
notdigit(substr(postcode,1,4)) = 0 and
substr(postcode,5,1) = ' '
;
PC4 = input(substr(postcode,1,4),4.);
run;
You might add code in the subsetting if that checks for allowed letters in substr(postcode,6,2).
data want;
set have;
if
length(postcode) = 7 and
notdigit(substr(postcode,1,4)) = 0 and
substr(postcode,5,1) = ' '
;
PC4 = input(substr(postcode,1,4),4.);
run;
You might add code in the subsetting if that checks for allowed letters in substr(postcode,6,2).
A regular expression should do the trick here
data have;
infile datalines dlm=",";
input code $;
datalines;
1234 ZT
5678 AB
Q123 AB
2345 4R
;
run;
data want(drop=pattern);
set have;
if _n_=1 then pattern=prxparse("/\d{4}\s[A-Z]{2}/");
if not prxmatch(pattern, code) then delete;
pc4=substr(code,1,4);
retain pattern;
run;
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.