BookmarkSubscribeRSS Feed
alw12194
Calcite | Level 5

Hi! Sorry if this has already been posted but cannot seem to find a previous post on this. 

 

I have a variable in a large data set (~50,000 obs so too many to manually correct) that can take on 1 of 3 possible structures:

1. name (example: Hartsfield airport) 

2. xxxx - name (example: 12345 - Hartsfield airport) 

3. name - xxxx (example: Hartsfield airport - 12345)

Here is some code to create a sample in SAS.  

data trial;
input have $;
datalines;
"12345 - hartsfield airport"
"hartsfield airport - 12345"
"hartsfield airport"
;

 

The strings are of many differing lengths and the numbers can be any combo and length so my sample may be oversimplified. I am wanting just the name out of the variable's string. There is no rhyme or reason for there being numbers and a dash before or after the name so I cannot subset the datasets based on another variable and use a simple strip(scan(have, 1, "-"))  or strip(scan(have, -1,  "-")) to extract just the name. I really wish it was that simple.  Is there a way to modify the strip and scan functions or another function I should use to extract name? 

 

I have thought about using something like the FIND, INDEX, or ANYDIGIT function to do a subset and then apply the strip/scan functions but because the numbers can take on any value and the strings have varying lengths, I don't think those will work. Any advice is appreciated. Thanks! 

 

 

4 REPLIES 4
ballardw
Super User

@alw12194 wrote:

Hi! Sorry if this has already been posted but cannot seem to find a previous post on this. 

 

I have a variable in a large data set (~50,000 obs so too many to manually correct) that can take on 1 of 3 possible structures:

1. name (example: Hartsfield airport) 

2. xxxx - name (example: 12345 - Hartsfield airport) 

3. name - xxxx (example: Hartsfield airport - 12345)

Here is some code to create a sample in SAS.  

data trial;
input have $;
datalines;
"12345 - hartsfield airport"
"hartsfield airport - 12345"
"hartsfield airport"
;

 

The strings are of many differing lengths and the numbers can be any combo and length so my sample may be oversimplified. I am wanting just the name out of the variable's string. There is no rhyme or reason for there being numbers and a dash before or after the name so I cannot subset the datasets based on another variable and use a simple strip(scan(have, 1, "-"))  or strip(scan(have, -1,  "-")) to extract just the name. I really wish it was that simple.  Is there a way to modify the strip and scan functions or another function I should use to extract name? 

 

I have thought about using something like the FIND, INDEX, or ANYDIGIT function to do a subset and then apply the strip/scan functions but because the numbers can take on any value and the strings have varying lengths, I don't think those will work. Any advice is appreciated. Thanks! 

 

 


One way that may get you started:

data trial;
   input have  $ 1-30;
datalines;
"12345 - hartsfield airport"
"hartsfield airport - 12345"
"hartsfield airport"
;

data want;
   set trial;
   newname= strip(compress(have,,'dp'));
run;

The compress removes characters in a list. In this case I did not provide a list of characters but use the options d, to add digits, and p, to add punctuation characters to the "list". Depending on where spaces occur you might have leading space or two so strip removes them.

 

Caution: if your name contains digits this wouldn't be appropriate.

Reeza
Super User
As BallardW indicates it depends on how dirty your data is. I suspect some airport names have a hyphen so scanning on that delimiter is problematic. But airport names are public and there's likely a list out there somewhere. Perhaps instead you want to find that list and then do a lookup or merge of some sort? If you think the data is somewhat clean, as in the airport names are spelled correctly and consistently that may be the most efficient method in terms of your time. If you need to factor in someone who's typing in Harfield airporto, Hartsfield airport that becomes a different issue again.
Reeza
Super User
It looks like there are hyphens in the airport name but no numbers (from a cursory scan) so BallardW solution will get you close.
https://en.wikipedia.org/wiki/List_of_airports_in_the_United_States
ErikLund_Jensen
Rhodochrosite | Level 12

Hi @alw12194 

 

The following code splits the input string into elements, that are "anything" separated by blanks, so the elements can be processed one by one and omitted according to the criterias. Then only elements that passes the tests are included in the output string. It might not be elegant code, but is makes it easy to experiment with different criterias.

 

The code discards elements consisting of numbers only + elements that are stand-alone hyphens. It seems that hyphens in Airport names are not surrounded by blanks, so the will not be omitted.

 

data have;
	infile datalines truncover;
	input name $char100.;
	datalines;
12345 - hartsfield airport
hartsfield airport - 12345
hartsfield airport
;

data want (drop=element i err:);
	set have;
	length element newname $100;

	* Split name in elements;
	do i = 1 to countw(name,' ');
		element = scan(name,i,' ');

		* Define conditions for omitting elements;
		err1 = not notdigit(trim(element));
		err2 = (element = '-');

		* Add accepted elements to newname;
		if max(err1,err2) = 0 then newname = catx(' ',newname,element);
	end;
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 4 replies
  • 2491 views
  • 3 likes
  • 4 in conversation