Solved: Re: Extract a String Help

kcvaldez98 · Posted 10-31-2023 05:06 PM

Hi SAS coders,

I am working on this dataset where I am trying to extract the county estimates from a variable string. I've tried using all the extraction functions (SCAN, SUBSTR, etc.) and when I do extract the estimate, it always shows some counties including "Esti" instead of the percentage.

Here is a screenshot to show what I mean. I want to be able to extract all the estimates without this error. I know it's because there are extra spaces between the county names, but I don't know how to extract without the spaces ruining the output.

Here is a dataset that has county names and estimates and with the counties with the extra space.

*This string is named Regional*

LARIMER County/Regional Estimate 14% (95% C.I.: 11.9 - 16.2)
LAS ANIMASCounty/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)
FREMONT County/Regional Estimate 15.5% (95% C.I.: 10.3 - 20.7)
CONEJOS County/Regional Estimate 9.9% (95% C.I.: 2.6 - 17.1)
EAGLE County/Regional Estimate 14.5% (95% C.I.: 9.9 - 19.1)
LA PLATA County/Regional Estimate 25.1% (95% C.I.: 19.7 - 30.5)

The code I have used to show that specific output is

DATA want;
SET Have;

LENGTH CountyEstimate $4.;
CountyEstimate =COMPRESS(SCAN(Regional,5, 'e %'));
RUN;

Please advise.

Patrick · Posted 10-31-2023 05:42 PM

Your code is always extracting the 5th word which of course can't work if county names can consist of one or several words.

This has already been dealt with in proposed answer in your last question here and variations of the answers provided should also work for the case at hand.

If you would deal with a string where the sub-strings you're after are always on the same position (like always the second word) then your current code would work.

If you use the scan() function with counting words from end to start then you're actually dealing with such a consistent structure. For your current case the following should work:

data have;
  infile datalines truncover;
  input regional $200.;
  datalines;
LARIMER County/Regional Estimate 14% (95% C.I.: 11.9 - 16.2)
LAS ANIMASCounty/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)
FREMONT County/Regional Estimate 15.5% (95% C.I.: 10.3 - 20.7)
CONEJOS County/Regional Estimate 9.9% (95% C.I.: 2.6 - 17.1)
EAGLE County/Regional Estimate 14.5% (95% C.I.: 9.9 - 19.1)
LA PLATA County/Regional Estimate 25.1% (95% C.I.: 19.7 - 30.5)
;

data want;
  set have;
  length CountyEstimate 8.;
  format CountyEstimate percent8.1;
  CountyEstimate = input(scan(strip(regional),-6,' '),percent.);
run;

proc print data=want;
run;

View solution in original post

PaigeMiller · Posted 10-31-2023 05:26 PM

A two step process:

Use the FINDW() function to find the location of the word Estimate.
Use the SCAN and SUBSTR function to find first "word" after the location found above plus 9 characters. This assumes there is always a space after the word Estimate.

Example:

data have;
    string='LA PLATA County/Regional Estimate 25.1% (95% C.I.: 19.7 - 30.5)';
run;

data want;
    set have;
    where=findw(string,'Estimate',' ');
    estimate=scan(substr(string,where+9),1,' %');
run;

Feel free to combine the two steps into one, if you would like. Feel free to turn the answer into a numeric variable, if you want.

--
Paige Miller

Patrick · Posted 10-31-2023 05:42 PM

Your code is always extracting the 5th word which of course can't work if county names can consist of one or several words.

This has already been dealt with in proposed answer in your last question here and variations of the answers provided should also work for the case at hand.

If you would deal with a string where the sub-strings you're after are always on the same position (like always the second word) then your current code would work.

If you use the scan() function with counting words from end to start then you're actually dealing with such a consistent structure. For your current case the following should work:

data have;
  infile datalines truncover;
  input regional $200.;
  datalines;
LARIMER County/Regional Estimate 14% (95% C.I.: 11.9 - 16.2)
LAS ANIMASCounty/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)
FREMONT County/Regional Estimate 15.5% (95% C.I.: 10.3 - 20.7)
CONEJOS County/Regional Estimate 9.9% (95% C.I.: 2.6 - 17.1)
EAGLE County/Regional Estimate 14.5% (95% C.I.: 9.9 - 19.1)
LA PLATA County/Regional Estimate 25.1% (95% C.I.: 19.7 - 30.5)
;

data want;
  set have;
  length CountyEstimate 8.;
  format CountyEstimate percent8.1;
  CountyEstimate = input(scan(strip(regional),-6,' '),percent.);
run;

proc print data=want;
run;

kcvaldez98 · Posted 11-01-2023 10:15 AM

Thank you, that makes a lot more sense. Also to add, how do I format it without the % at the end?

Thank you!

PaigeMiller · Posted 11-01-2023 10:31 AM

@kcvaldez98 wrote:

Thank you, that makes a lot more sense. Also to add, how do I format it without the % at the end?

Remove or change the FORMAT statement.

--
Paige Miller

Patrick · Posted 11-01-2023 06:49 PM

@kcvaldez98 wrote:

Thank you, that makes a lot more sense. Also to add, how do I format it without the % at the end?

Thank you!

Below some options. Variable option_2 is the same than in the previous answer just without the percent format applied.

Given your source values are in percent I consider option_2 as "correct" - but that's of course just how I see things.

data want;
  set have;
  length option_1 $6;
  option_1 = scan(strip(regional),-6,' %');

  /*format option_2 percent8.1;*/
  option_2 = input(scan(strip(regional),-6,' '),percent.);

  option_3 = option_2*100;

  option_4 = input(scan(strip(regional),-6,' %'),best32.);
run;

proc print data=want;
run;

Extract a String Help

Re: Extract a String Help

Re: Extract a String Help

Re: Extract a String Help

Re: Extract a String Help

Re: Extract a String Help

Re: Extract a String Help

Registration is open