Hi SAS coders,
I am working on this dataset where I am trying to extract the county estimates from a variable string. I've tried using all the extraction functions (SCAN, SUBSTR, etc.) and when I do extract the estimate, it always shows some counties including "Esti" instead of the percentage.
Here is a screenshot to show what I mean. I want to be able to extract all the estimates without this error. I know it's because there are extra spaces between the county names, but I don't know how to extract without the spaces ruining the output.
Here is a dataset that has county names and estimates and with the counties with the extra space.
*This string is named Regional*
LARIMER County/Regional Estimate 14% (95% C.I.: 11.9 - 16.2)
LAS ANIMASCounty/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)
FREMONT County/Regional Estimate 15.5% (95% C.I.: 10.3 - 20.7)
CONEJOS County/Regional Estimate 9.9% (95% C.I.: 2.6 - 17.1)
EAGLE County/Regional Estimate 14.5% (95% C.I.: 9.9 - 19.1)
LA PLATA County/Regional Estimate 25.1% (95% C.I.: 19.7 - 30.5)
The code I have used to show that specific output is
DATA want;
SET Have;
LENGTH CountyEstimate $4.;
CountyEstimate =COMPRESS(SCAN(Regional,5, 'e %'));
RUN;
Please advise.
Your code is always extracting the 5th word which of course can't work if county names can consist of one or several words.
This has already been dealt with in proposed answer in your last question here and variations of the answers provided should also work for the case at hand.
If you would deal with a string where the sub-strings you're after are always on the same position (like always the second word) then your current code would work.
If you use the scan() function with counting words from end to start then you're actually dealing with such a consistent structure. For your current case the following should work:
data have;
infile datalines truncover;
input regional $200.;
datalines;
LARIMER County/Regional Estimate 14% (95% C.I.: 11.9 - 16.2)
LAS ANIMASCounty/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)
FREMONT County/Regional Estimate 15.5% (95% C.I.: 10.3 - 20.7)
CONEJOS County/Regional Estimate 9.9% (95% C.I.: 2.6 - 17.1)
EAGLE County/Regional Estimate 14.5% (95% C.I.: 9.9 - 19.1)
LA PLATA County/Regional Estimate 25.1% (95% C.I.: 19.7 - 30.5)
;
data want;
set have;
length CountyEstimate 8.;
format CountyEstimate percent8.1;
CountyEstimate = input(scan(strip(regional),-6,' '),percent.);
run;
proc print data=want;
run;
A two step process:
Example:
data have;
string='LA PLATA County/Regional Estimate 25.1% (95% C.I.: 19.7 - 30.5)';
run;
data want;
set have;
where=findw(string,'Estimate',' ');
estimate=scan(substr(string,where+9),1,' %');
run;
Feel free to combine the two steps into one, if you would like. Feel free to turn the answer into a numeric variable, if you want.
Your code is always extracting the 5th word which of course can't work if county names can consist of one or several words.
This has already been dealt with in proposed answer in your last question here and variations of the answers provided should also work for the case at hand.
If you would deal with a string where the sub-strings you're after are always on the same position (like always the second word) then your current code would work.
If you use the scan() function with counting words from end to start then you're actually dealing with such a consistent structure. For your current case the following should work:
data have;
infile datalines truncover;
input regional $200.;
datalines;
LARIMER County/Regional Estimate 14% (95% C.I.: 11.9 - 16.2)
LAS ANIMASCounty/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)
FREMONT County/Regional Estimate 15.5% (95% C.I.: 10.3 - 20.7)
CONEJOS County/Regional Estimate 9.9% (95% C.I.: 2.6 - 17.1)
EAGLE County/Regional Estimate 14.5% (95% C.I.: 9.9 - 19.1)
LA PLATA County/Regional Estimate 25.1% (95% C.I.: 19.7 - 30.5)
;
data want;
set have;
length CountyEstimate 8.;
format CountyEstimate percent8.1;
CountyEstimate = input(scan(strip(regional),-6,' '),percent.);
run;
proc print data=want;
run;
Thank you, that makes a lot more sense. Also to add, how do I format it without the % at the end?
Thank you!
@kcvaldez98 wrote:
Thank you, that makes a lot more sense. Also to add, how do I format it without the % at the end?
Remove or change the FORMAT statement.
@kcvaldez98 wrote:
Thank you, that makes a lot more sense. Also to add, how do I format it without the % at the end?
Thank you!
Below some options. Variable option_2 is the same than in the previous answer just without the percent format applied.
Given your source values are in percent I consider option_2 as "correct" - but that's of course just how I see things.
data want;
set have;
length option_1 $6;
option_1 = scan(strip(regional),-6,' %');
/*format option_2 percent8.1;*/
option_2 = input(scan(strip(regional),-6,' '),percent.);
option_3 = option_2*100;
option_4 = input(scan(strip(regional),-6,' %'),best32.);
run;
proc print data=want;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.