Re: New User-Trouble with Extracting

kcvaldez98 · Posted 10-26-2023 02:36 PM

Hi SAS coders!

I am trying to extract the county from a string variable that is meshed in with the estimate and CI.

So the variable (called Regional) looks like this:

LARIMER County/Regional Estimate 14% (95% C.I.: 11.9 - 16.2)

LAS ANIMASCounty/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)

I am trying to extract the LARIMER and LAS ANIMAS into their own variable and call it "COUNTY"

I've used SCAN, SUBSTR, COMPRESS, all the extracting functions and they always come out not ideal. My end result should be

LARIMER

LAS ANIMAS

The code I was using is:

DATA WANT;

SET OLD;

COUNTY = COMPRESS(SCAN(Regional, 1,'C'),'');

RUN;

The end result of this code is

LARIMER

LASANIMAS

I am trying to produce the output where LASANIMAS has a space so it should look like LAS ANIMAS.

Also some of the county names that I am working with start with C like "Cheyenne" and because I put the delimiter as "C" the county name cuts off so it produces an output of "heyenne"

I don't know what I am doing wrong.

PaigeMiller · Posted 10-26-2023 02:56 PM

Find the location of the word COUNTY. On your first row, that word begins at position 9. So then take the text in positions 1 through 8 (and optionally trim the blank off the end)

position=find(regional,'county','i');
county_name=substr(regional,1,position-1);

--
Paige Miller

ballardw · Posted 10-26-2023 06:56 PM

Don't forget to set a length for your county name long enough to hold the longest expected value.

mkeintz · Posted 10-26-2023 08:20 PM

It appears that REGIONAL has the county name, followed by the string "County", followed by a forward slash, and then other information irrelevant to extracting the county name. And the string "County" may or may not be preceded by a blank.

Then

(1) extract the text prior to the "/": = scan(regional,1,'/')

(2) then eliminate the last 6 non-blank characters: =substr(of above,1,length(of above)-6).

data want;
  set have;
  length county $25;
  county=substr(scan(regional,1,'/')
               ,1
               ,length(scan(regional,1,'/'))-6
               );
run;

Now if the word "County" was always spelled in lowercase except the first letter, you could avoid using the substr function in (2). Instead, you could apply a somewhat simpler expression:

data want;
  set have;
  length county $25;
  county=tranwrd(scan(regional,1,'/'),'County','');
run;

This, of course, assumes that the string 'County' does not occur twice (once before the slash, and once as part of the county's proper name) in the REGION variable.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 10-26-2023 09:22 PM

If you wanted the keep the spaces why did you remove them? Remove the COMPRESS() function call.

Are you sure none of the county names have an uppercase letter C? If they do your SCAN() function call is also not what you want.

Patrick · Posted 10-26-2023 09:28 PM

And below yet another option using a regular expression.

data have;
  infile datalines truncover;
  input have_str $100.;
  datalines;
LARIMER County/Regional Estimate 14% (95% C.I.: 11.9 - 16.2)
LAS ANIMASCounty/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)
xxx xxx/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)
;

data want(drop=_:);
  set have;
  length county $40.;
  _prxid = prxparse('/^.*(?=county\/)/oi');
  call prxsubstr(_prxid, have_str, _pos, _len);
  if _len>0 then county = substr(have_str, _pos, _len);
run;

proc print data=want;
run;

New User-Trouble with Extracting