Hi,
I have the following example, which is only a partial extract of all of the observations for the variable "address":
Dataset:
Two variables are in the project.address dataset - account_number and Address. Note: the example below just contains a partial extract of observations in the Address variable.
Address
FULLNAMEMackenzie M. ThompsonADDRESS60 Boroughbridge RoadTOWNBIRKHALLPOSTCODEAB35 3LW
FULLNAMEEvelyn F. HillADDRESS18 St James BoulevardTOWNHORSEBROOKPOSTCODEST19 5EQ
FULLNAMEElise R. HayADDRESS15 Sloe LaneTOWNCROSTHWAITEPOSTCODELA8 6TP
FULLNAMEAntonia M. HendersonADDRESS3 Exning RoadTOWNHARLEYHOLMPOSTCODEML12 2JZ
Can someone please help me extract the house number and post code, using substring/find/scan/index for the "address" variable, ie. house_number = substr(address, , ) where:
House number
Are the numbers from the third word, split by the space delimiter (this can be either 1 or 2 numbers).
Would the best approach be to find the number(s) coming after 'ADDRESS' in the third word based on the space delimiter?
Post code
Last three or four characters and numbers from the end of the second last word, plus (using the catx function?) the last three characters and numbers from the last word.
Would the best approach be to find what comes after 'POSTCODE' and use substr to find a length of 8 (as this would include 1 or 2 numbers for the start of the post code)?
No problem 🙂
housenum from the inside out:
scan(address, 3😞 The third word in the string
compress(scan(address, 3), '', 'kd'😞 keep only digits ('kd') from the third word of the string.
Input(..., 8.): Convert the above found string to a numeric value.
p: index(address,"POSTCODE") finds the position in the string where the substring POSTCODE first appears.
postcode: substr(address, p+9, length(address)-p) extracts a substring from address starting from p+9 (9 because POSTCODE has 8 characters and we want the next character from there). From that character we read forward length(address)-p
please try the perl regular expressions as below
data want;
input text&:$500.;
housenumber=prxchange('s/(.*address)(\d{1,2})(.*)/$2/i',-1,strip(text));
postcode=prxchange('s/(.*postcode)(.*)/$2/i',-1,strip(text));
cards;
FULLNAMEMackenzie M. ThompsonADDRESS60 Boroughbridge RoadTOWNBIRKHALLPOSTCODEAB35 3LW
FULLNAMEEvelyn F. HillADDRESS18 St James BoulevardTOWNHORSEBROOKPOSTCODEST19 5EQ
FULLNAMEElise R. HayADDRESS15 Sloe LaneTOWNCROSTHWAITEPOSTCODELA8 6TP
FULLNAMEAntonia M. HendersonADDRESS3 Exning RoadTOWNHARLEYHOLMPOSTCODEML12 2JZ
;
Assuming your data is representable and you insist on not using Regular Expressions
data have;
input string $200.;
datalines;
FULLNAMEMackenzie M. ThompsonADDRESS60 Boroughbridge RoadTOWNBIRKHALLPOSTCODEAB35 3LW
FULLNAMEEvelyn F. HillADDRESS18 St James BoulevardTOWNHORSEBROOKPOSTCODEST19 5EQ
FULLNAMEElise R. HayADDRESS15 Sloe LaneTOWNCROSTHWAITEPOSTCODELA8 6TP
FULLNAMEAntonia M. HendersonADDRESS3 Exning RoadTOWNHARLEYHOLMPOSTCODEML12 2JZ
;
data want(drop=p);
set have;
housenum=input(compress(scan(string, 3), '', 'kd'), 8.);
p=index(string,"POSTCODE");
postcode=substr(string, p+9, length(string)-p);
run;
Considering that the data is in a dataset then you can try set statement as below
data want;
set mydata;
text2=tranwrd(tranwrd(lowcase(text),'address','|'),'postcode','#');
housenumber=scan(scan(lowcase(strip(text2)),2,'|'),1,' ');
postcode=upcase(scan(lowcase(strip(text2)),2,'#'));
run;
Thanks for your reply, I've just run your code and it creates two new variables, address and postcode. However, there are just blank observations for the corresponding original examples - is there anything I could change in the code (my dataset just contains an "Account_number" and "address" variable (partial extract was shown in my original post))?
Could you please post the code you attempted, you need to replace the text with the variable which has address info per your data.
The variable containing all of the information is called variable.
data project;
set project.address;
text2=tranwrd(tranwrd(lowcase(text),'ADDRESS','|'),'POSTCODE','#');
housenumber=scan(scan(lowcase(strip(text2)),2,'|'),1,' ');
postcode=upcase(scan(lowcase(strip(text2)),2,'#'));
run;
So what variable contains the string you post above?
Simply replace 'string' with 'address'
data have;
input address $200.;
datalines;
FULLNAMEMackenzie M. ThompsonADDRESS60 Boroughbridge RoadTOWNBIRKHALLPOSTCODEAB35 3LW
FULLNAMEEvelyn F. HillADDRESS18 St James BoulevardTOWNHORSEBROOKPOSTCODEST19 5EQ
FULLNAMEElise R. HayADDRESS15 Sloe LaneTOWNCROSTHWAITEPOSTCODELA8 6TP
FULLNAMEAntonia M. HendersonADDRESS3 Exning RoadTOWNHARLEYHOLMPOSTCODEML12 2JZ
;
data want(drop=p);
set have;
housenum=input(compress(scan(address, 3), '', 'kd'), 8.);
p=index(address,"POSTCODE");
postcode=substr(address, p+9, length(address)-p);
run;
I do not suggest that you use datalines in your actual case. Simply use my code as a template and do:
- insert your actual data set name instead of have in the set statement
- Insert the appropriate variable name (if your actual variable is named Address, then you're good.
Thanks for your code, I've excluded your first part (with the datalines) and it works. Do you mean explaining what is happening for each line of code i.e. housenum (in particular what '' is doing from input(compress(scan(address, 3), '', 'kd'), 8.);), p (what is the purpose of the variable p, as it gives an output of 70 and 65 for the first two observations, for example) and postcode are doing in plain English, so I can understand what processes are happening? Thanks!
No problem 🙂
housenum from the inside out:
scan(address, 3😞 The third word in the string
compress(scan(address, 3), '', 'kd'😞 keep only digits ('kd') from the third word of the string.
Input(..., 8.): Convert the above found string to a numeric value.
p: index(address,"POSTCODE") finds the position in the string where the substring POSTCODE first appears.
postcode: substr(address, p+9, length(address)-p) extracts a substring from address starting from p+9 (9 because POSTCODE has 8 characters and we want the next character from there). From that character we read forward length(address)-p
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.