Extract part of strings - before or after delimiter

alw12194 · Posted 06-04-2021 04:01 PM

Hi! Sorry if this has already been posted but cannot seem to find a previous post on this.

I have a variable in a large data set (~50,000 obs so too many to manually correct) that can take on 1 of 3 possible structures:

1. name (example: Hartsfield airport)

2. xxxx - name (example: 12345 - Hartsfield airport)

3. name - xxxx (example: Hartsfield airport - 12345)

Here is some code to create a sample in SAS.

data trial;
input have $;
datalines;
"12345 - hartsfield airport"
"hartsfield airport - 12345"
"hartsfield airport"
;

The strings are of many differing lengths and the numbers can be any combo and length so my sample may be oversimplified. I am wanting just the name out of the variable's string. There is no rhyme or reason for there being numbers and a dash before or after the name so I cannot subset the datasets based on another variable and use a simple strip(scan(have, 1, "-")) or strip(scan(have, -1, "-")) to extract just the name. I really wish it was that simple. Is there a way to modify the strip and scan functions or another function I should use to extract name?

I have thought about using something like the FIND, INDEX, or ANYDIGIT function to do a subset and then apply the strip/scan functions but because the numbers can take on any value and the strings have varying lengths, I don't think those will work. Any advice is appreciated. Thanks!

ballardw · Posted 06-04-2021 04:11 PM

@alw12194 wrote:

Hi! Sorry if this has already been posted but cannot seem to find a previous post on this.

I have a variable in a large data set (~50,000 obs so too many to manually correct) that can take on 1 of 3 possible structures:

1. name (example: Hartsfield airport)

2. xxxx - name (example: 12345 - Hartsfield airport)

3. name - xxxx (example: Hartsfield airport - 12345)

Here is some code to create a sample in SAS.
data trial;
input have $;
datalines;
"12345 - hartsfield airport"
"hartsfield airport - 12345"
"hartsfield airport"
;
The strings are of many differing lengths and the numbers can be any combo and length so my sample may be oversimplified. I am wanting just the name out of the variable's string. There is no rhyme or reason for there being numbers and a dash before or after the name so I cannot subset the datasets based on another variable and use a simple strip(scan(have, 1, "-")) or strip(scan(have, -1, "-")) to extract just the name. I really wish it was that simple. Is there a way to modify the strip and scan functions or another function I should use to extract name?

I have thought about using something like the FIND, INDEX, or ANYDIGIT function to do a subset and then apply the strip/scan functions but because the numbers can take on any value and the strings have varying lengths, I don't think those will work. Any advice is appreciated. Thanks!

One way that may get you started:

data trial;
   input have  $ 1-30;
datalines;
"12345 - hartsfield airport"
"hartsfield airport - 12345"
"hartsfield airport"
;

data want;
   set trial;
   newname= strip(compress(have,,'dp'));
run;

The compress removes characters in a list. In this case I did not provide a list of characters but use the options d, to add digits, and p, to add punctuation characters to the "list". Depending on where spaces occur you might have leading space or two so strip removes them.

Caution: if your name contains digits this wouldn't be appropriate.

Reeza · Posted 06-04-2021 05:55 PM

As BallardW indicates it depends on how dirty your data is. I suspect some airport names have a hyphen so scanning on that delimiter is problematic. But airport names are public and there's likely a list out there somewhere. Perhaps instead you want to find that list and then do a lookup or merge of some sort? If you think the data is somewhat clean, as in the airport names are spelled correctly and consistently that may be the most efficient method in terms of your time. If you need to factor in someone who's typing in Harfield airporto, Hartsfield airport that becomes a different issue again.

Reeza · Posted 06-04-2021 05:58 PM

It looks like there are hyphens in the airport name but no numbers (from a cursory scan) so BallardW solution will get you close.
https://en.wikipedia.org/wiki/List_of_airports_in_the_United_States

ErikLund_Jensen · Posted 06-06-2021 06:35 AM

Hi @alw12194

The following code splits the input string into elements, that are "anything" separated by blanks, so the elements can be processed one by one and omitted according to the criterias. Then only elements that passes the tests are included in the output string. It might not be elegant code, but is makes it easy to experiment with different criterias.

The code discards elements consisting of numbers only + elements that are stand-alone hyphens. It seems that hyphens in Airport names are not surrounded by blanks, so the will not be omitted.

data have;
	infile datalines truncover;
	input name $char100.;
	datalines;
12345 - hartsfield airport
hartsfield airport - 12345
hartsfield airport
;

data want (drop=element i err:);
	set have;
	length element newname $100;

	* Split name in elements;
	do i = 1 to countw(name,' ');
		element = scan(name,i,' ');

		* Define conditions for omitting elements;
		err1 = not notdigit(trim(element));
		err2 = (element = '-');

		* Add accepted elements to newname;
		if max(err1,err2) = 0 then newname = catx(' ',newname,element);
	end;
run;

Extract part of strings - before or after delimiter

Re: Extract part of strings - before or after delimiter

Re: Extract part of strings - before or after delimiter

Re: Extract part of strings - before or after delimiter

Re: Extract part of strings - before or after delimiter

Extract part of strings - before or after delimiter

Re: Extract part of strings - before or after delimiter

Re: Extract part of strings - before or after delimiter

Re: Extract part of strings - before or after delimiter

Re: Extract part of strings - before or after delimiter

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away