Hi! Sorry if this has already been posted but cannot seem to find a previous post on this.
I have a variable in a large data set (~50,000 obs so too many to manually correct) that can take on 1 of 3 possible structures:
1. name (example: Hartsfield airport)
2. xxxx - name (example: 12345 - Hartsfield airport)
3. name - xxxx (example: Hartsfield airport - 12345)
Here is some code to create a sample in SAS.
data trial;
input have $;
datalines;
"12345 - hartsfield airport"
"hartsfield airport - 12345"
"hartsfield airport"
;
The strings are of many differing lengths and the numbers can be any combo and length so my sample may be oversimplified. I am wanting just the name out of the variable's string. There is no rhyme or reason for there being numbers and a dash before or after the name so I cannot subset the datasets based on another variable and use a simple strip(scan(have, 1, "-")) or strip(scan(have, -1, "-")) to extract just the name. I really wish it was that simple. Is there a way to modify the strip and scan functions or another function I should use to extract name?
I have thought about using something like the FIND, INDEX, or ANYDIGIT function to do a subset and then apply the strip/scan functions but because the numbers can take on any value and the strings have varying lengths, I don't think those will work. Any advice is appreciated. Thanks!
@alw12194 wrote:
Hi! Sorry if this has already been posted but cannot seem to find a previous post on this.
I have a variable in a large data set (~50,000 obs so too many to manually correct) that can take on 1 of 3 possible structures:
1. name (example: Hartsfield airport)
2. xxxx - name (example: 12345 - Hartsfield airport)
3. name - xxxx (example: Hartsfield airport - 12345)
Here is some code to create a sample in SAS.
data trial; input have $; datalines; "12345 - hartsfield airport" "hartsfield airport - 12345" "hartsfield airport" ;
The strings are of many differing lengths and the numbers can be any combo and length so my sample may be oversimplified. I am wanting just the name out of the variable's string. There is no rhyme or reason for there being numbers and a dash before or after the name so I cannot subset the datasets based on another variable and use a simple strip(scan(have, 1, "-")) or strip(scan(have, -1, "-")) to extract just the name. I really wish it was that simple. Is there a way to modify the strip and scan functions or another function I should use to extract name?
I have thought about using something like the FIND, INDEX, or ANYDIGIT function to do a subset and then apply the strip/scan functions but because the numbers can take on any value and the strings have varying lengths, I don't think those will work. Any advice is appreciated. Thanks!
One way that may get you started:
data trial; input have $ 1-30; datalines; "12345 - hartsfield airport" "hartsfield airport - 12345" "hartsfield airport" ; data want; set trial; newname= strip(compress(have,,'dp')); run;
The compress removes characters in a list. In this case I did not provide a list of characters but use the options d, to add digits, and p, to add punctuation characters to the "list". Depending on where spaces occur you might have leading space or two so strip removes them.
Caution: if your name contains digits this wouldn't be appropriate.
Hi @alw12194
The following code splits the input string into elements, that are "anything" separated by blanks, so the elements can be processed one by one and omitted according to the criterias. Then only elements that passes the tests are included in the output string. It might not be elegant code, but is makes it easy to experiment with different criterias.
The code discards elements consisting of numbers only + elements that are stand-alone hyphens. It seems that hyphens in Airport names are not surrounded by blanks, so the will not be omitted.
data have;
infile datalines truncover;
input name $char100.;
datalines;
12345 - hartsfield airport
hartsfield airport - 12345
hartsfield airport
;
data want (drop=element i err:);
set have;
length element newname $100;
* Split name in elements;
do i = 1 to countw(name,' ');
element = scan(name,i,' ');
* Define conditions for omitting elements;
err1 = not notdigit(trim(element));
err2 = (element = '-');
* Add accepted elements to newname;
if max(err1,err2) = 0 then newname = catx(' ',newname,element);
end;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.