Solved: Substring to extract house number and post code

jeremy4 · Posted 09-02-2019 05:18 AM

Hi,

I have the following example, which is only a partial extract of all of the observations for the variable "address":

Dataset:

Two variables are in the project.address dataset - account_number and Address. Note: the example below just contains a partial extract of observations in the Address variable.

Address

FULLNAMEMackenzie M. ThompsonADDRESS60 Boroughbridge RoadTOWNBIRKHALLPOSTCODEAB35 3LW

FULLNAMEEvelyn F. HillADDRESS18 St James BoulevardTOWNHORSEBROOKPOSTCODEST19 5EQ

FULLNAMEElise R. HayADDRESS15 Sloe LaneTOWNCROSTHWAITEPOSTCODELA8 6TP

FULLNAMEAntonia M. HendersonADDRESS3 Exning RoadTOWNHARLEYHOLMPOSTCODEML12 2JZ

Can someone please help me extract the house number and post code, using substring/find/scan/index for the "address" variable, ie. house_number = substr(address, , ) where:

House number

Are the numbers from the third word, split by the space delimiter (this can be either 1 or 2 numbers).

Would the best approach be to find the number(s) coming after 'ADDRESS' in the third word based on the space delimiter?

Post code

Last three or four characters and numbers from the end of the second last word, plus (using the catx function?) the last three characters and numbers from the last word.

Would the best approach be to find what comes after 'POSTCODE' and use substr to find a length of 8 (as this would include 1 or 2 numbers for the start of the post code)?

PeterClemmensen · Posted 09-02-2019 07:06 AM

No problem 🙂

housenum from the inside out:

scan(address, 3😞 The third word in the string

compress(scan(address, 3), '', 'kd'😞 keep only digits ('kd') from the third word of the string.

Input(..., 8.): Convert the above found string to a numeric value.

p: index(address,"POSTCODE") finds the position in the string where the substring POSTCODE first appears.

postcode: substr(address, p+9, length(address)-p) extracts a substring from address starting from p+9 (9 because POSTCODE has 8 characters and we want the next character from there). From that character we read forward length(address)-p

The DATA to DATA Step Macro
Blog: SASnrd

View solution in original post

Jagadishkatam · Posted 09-02-2019 05:33 AM

please try the perl regular expressions as below

data want;
input text&:$500.;
housenumber=prxchange('s/(.*address)(\d{1,2})(.*)/$2/i',-1,strip(text));
postcode=prxchange('s/(.*postcode)(.*)/$2/i',-1,strip(text));
cards;
FULLNAMEMackenzie M. ThompsonADDRESS60 Boroughbridge RoadTOWNBIRKHALLPOSTCODEAB35 3LW
FULLNAMEEvelyn F. HillADDRESS18 St James BoulevardTOWNHORSEBROOKPOSTCODEST19 5EQ
FULLNAMEElise R. HayADDRESS15 Sloe LaneTOWNCROSTHWAITEPOSTCODELA8 6TP
FULLNAMEAntonia M. HendersonADDRESS3 Exning RoadTOWNHARLEYHOLMPOSTCODEML12 2JZ
;

Thanks,
Jag

PeterClemmensen · Posted 09-02-2019 05:35 AM

Assuming your data is representable and you insist on not using Regular Expressions

data have;
input string $200.;
datalines;
FULLNAMEMackenzie M. ThompsonADDRESS60 Boroughbridge RoadTOWNBIRKHALLPOSTCODEAB35 3LW
FULLNAMEEvelyn F. HillADDRESS18 St James BoulevardTOWNHORSEBROOKPOSTCODEST19 5EQ
FULLNAMEElise R. HayADDRESS15 Sloe LaneTOWNCROSTHWAITEPOSTCODELA8 6TP
FULLNAMEAntonia M. HendersonADDRESS3 Exning RoadTOWNHARLEYHOLMPOSTCODEML12 2JZ
;

data want(drop=p);
    set have;
    housenum=input(compress(scan(string, 3), '', 'kd'), 8.);
    p=index(string,"POSTCODE");
    postcode=substr(string, p+9, length(string)-p);
run;

The DATA to DATA Step Macro
Blog: SASnrd

jeremy4 · Posted 09-02-2019 05:46 AM

Thanks for your reply, how would the code change if datalines were not used (there are thousands of observations in my dataset), as my example was just a partial extract of observations from the "address" variable, so could it be changed to something along the lines of house_number = substr(address, , )? Thanks!

Jagadishkatam · Posted 09-02-2019 05:49 AM

Considering that the data is in a dataset then you can try set statement as below

data want;
set mydata;
text2=tranwrd(tranwrd(lowcase(text),'address','|'),'postcode','#');
housenumber=scan(scan(lowcase(strip(text2)),2,'|'),1,' ');
postcode=upcase(scan(lowcase(strip(text2)),2,'#'));
run;

Thanks,
Jag

jeremy4 · Posted 09-02-2019 05:54 AM

Thanks for your reply, I've just run your code and it creates two new variables, address and postcode. However, there are just blank observations for the corresponding original examples - is there anything I could change in the code (my dataset just contains an "Account_number" and "address" variable (partial extract was shown in my original post))?

Jagadishkatam · Posted 09-02-2019 06:13 AM

Could you please post the code you attempted, you need to replace the text with the variable which has address info per your data.

Thanks,
Jag

jeremy4 · Posted 09-02-2019 06:16 AM

The variable containing all of the information is called variable.

data project;
set project.address;
text2=tranwrd(tranwrd(lowcase(text),'ADDRESS','|'),'POSTCODE','#');
housenumber=scan(scan(lowcase(strip(text2)),2,'|'),1,' ');
postcode=upcase(scan(lowcase(strip(text2)),2,'#'));
run;

PeterClemmensen · Posted 09-02-2019 06:21 AM

So what variable contains the string you post above?

The DATA to DATA Step Macro
Blog: SASnrd

jeremy4 · Posted 09-02-2019 06:22 AM

It's called Address - I've updated my original post to include extra information.

PeterClemmensen · Posted 09-02-2019 06:26 AM

Simply replace 'string' with 'address'

data have;
input address $200.;
datalines;
FULLNAMEMackenzie M. ThompsonADDRESS60 Boroughbridge RoadTOWNBIRKHALLPOSTCODEAB35 3LW
FULLNAMEEvelyn F. HillADDRESS18 St James BoulevardTOWNHORSEBROOKPOSTCODEST19 5EQ
FULLNAMEElise R. HayADDRESS15 Sloe LaneTOWNCROSTHWAITEPOSTCODELA8 6TP
FULLNAMEAntonia M. HendersonADDRESS3 Exning RoadTOWNHARLEYHOLMPOSTCODEML12 2JZ
;

data want(drop=p);
    set have;
    housenum=input(compress(scan(address, 3), '', 'kd'), 8.);
    p=index(address,"POSTCODE");
    postcode=substr(address, p+9, length(address)-p);
run;

The DATA to DATA Step Macro
Blog: SASnrd

jeremy4 · Posted 09-02-2019 06:29 AM

Thanks for your reply - my example was just a partial extract of the variable, Address, which contains thousands of observations and means that using datalines may not be the best method for me. Is there a method that looks at an observation from the Address variable, instead of using datalines? Thanks!

PeterClemmensen · Posted 09-02-2019 06:32 AM

I do not suggest that you use datalines in your actual case. Simply use my code as a template and do:

- insert your actual data set name instead of have in the set statement

- Insert the appropriate variable name (if your actual variable is named Address, then you're good.

The DATA to DATA Step Macro
Blog: SASnrd

jeremy4 · Posted 09-02-2019 06:37 AM

Thanks for your code, I've excluded your first part (with the datalines) and it works. Do you mean explaining what is happening for each line of code i.e. housenum (in particular what '' is doing from input(compress(scan(address, 3), '', 'kd'), 8.);), p (what is the purpose of the variable p, as it gives an output of 70 and 65 for the first two observations, for example) and postcode are doing in plain English, so I can understand what processes are happening? Thanks!

PeterClemmensen · Posted 09-02-2019 07:06 AM

No problem 🙂

housenum from the inside out:

scan(address, 3😞 The third word in the string

compress(scan(address, 3), '', 'kd'😞 keep only digits ('kd') from the third word of the string.

Input(..., 8.): Convert the above found string to a numeric value.

p: index(address,"POSTCODE") finds the position in the string where the substring POSTCODE first appears.

postcode: substr(address, p+9, length(address)-p) extracts a substring from address starting from p+9 (9 because POSTCODE has 8 characters and we want the next character from there). From that character we read forward length(address)-p

The DATA to DATA Step Macro
Blog: SASnrd

Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Re: Substring to extract house number and post code

Registration is open

SAS Training: Just a Click Away