topic Re: New User-Trouble with Extracting in New SAS User

New User-Trouble with Extracting

kcvaldez98 — Thu, 26 Oct 2023 18:36:22 GMT

Hi SAS coders!

I am trying to extract the county from a string variable that is meshed in with the estimate and CI.

So the variable (called Regional) looks like this:

LARIMER County/Regional Estimate 14% (95% C.I.: 11.9 - 16.2)

LAS ANIMASCounty/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)

I am trying to extract the LARIMER and LAS ANIMAS into their own variable and call it "COUNTY"

I've used SCAN, SUBSTR, COMPRESS, all the extracting functions and they always come out not ideal. My end result should be

LARIMER

LAS ANIMAS

The code I was using is:

DATA WANT;

SET OLD;

COUNTY = COMPRESS(SCAN(Regional, 1,'C'),'');

RUN;

The end result of this code is

LARIMER

LASANIMAS

I am trying to produce the output where LASANIMAS has a space so it should look like LAS ANIMAS.

Also some of the county names that I am working with start with C like "Cheyenne" and because I put the delimiter as "C" the county name cuts off so it produces an output of "heyenne"

I don't know what I am doing wrong.

Re: New User-Trouble with Extracting

PaigeMiller — Thu, 26 Oct 2023 18:56:56 GMT

Find the location of the word COUNTY. On your first row, that word begins at position 9. So then take the text in positions 1 through 8 (and optionally trim the blank off the end)

position=find(regional,'county','i');
county_name=substr(regional,1,position-1);

Re: New User-Trouble with Extracting

ballardw — Thu, 26 Oct 2023 22:56:04 GMT

Don't forget to set a length for your county name long enough to hold the longest expected value.

Re: New User-Trouble with Extracting

mkeintz — Fri, 27 Oct 2023 00:20:40 GMT

It appears that REGIONAL has the county name, followed by the string "County", followed by a forward slash, and then other information irrelevant to extracting the county name. And the string "County" may or may not be preceded by a blank.

Then

(1) extract the text prior to the "/": = scan(regional,1,'/')

(2) then eliminate the last 6 non-blank characters: =substr(of above,1,length(of above)-6).

data want;
  set have;
  length county $25;
  county=substr(scan(regional,1,'/')
               ,1
               ,length(scan(regional,1,'/'))-6
               );
run;

Now if the word "County" was always spelled in lowercase except the first letter, you could avoid using the substr function in (2). Instead, you could apply a somewhat simpler expression:

data want;
  set have;
  length county $25;
  county=tranwrd(scan(regional,1,'/'),'County','');
run;

This, of course, assumes that the string 'County' does not occur twice (once before the slash, and once as part of the county's proper name) in the REGION variable.

Re: New User-Trouble with Extracting

Tom — Fri, 27 Oct 2023 01:22:08 GMT

If you wanted the keep the spaces why did you remove them? Remove the COMPRESS() function call.

Are you sure none of the county names have an uppercase letter C? If they do your SCAN() function call is also not what you want.

Re: New User-Trouble with Extracting

Patrick — Fri, 27 Oct 2023 01:39:36 GMT

And below yet another option using a regular expression.

data have;
  infile datalines truncover;
  input have_str $100.;
  datalines;
LARIMER County/Regional Estimate 14% (95% C.I.: 11.9 - 16.2)
LAS ANIMASCounty/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)
xxx xxx/Regional Estimate 16.7% (95% C.I.: 9.6 - 23.8)
;

data want(drop=_:);
  set have;
  length county $40.;
  _prxid = prxparse('/^.*(?=county\/)/oi');
  call prxsubstr(_prxid, have_str, _pos, _len);
  if _len>0 then county = substr(have_str, _pos, _len);
run;

proc print data=want;
run;