BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pandhandj
Obsidian | Level 7

Guys, could i ask for some guidane again please?

 

I have 122k lines of data where 1 of the variables is a user defined text entry.  Somewhere within the variable is the string Captain or General or Colonel, etc (i have to find 144 differerent strings in total).

 

I need to find each instance of Captain in the text and then put Captain into a Variable called Type.

 

I have been trying:

 

if Text contains 'Captain' then Type = 'Captain' - this gives an arithmetic error.  I dont think i can use contains with IF!

 

How do i find a string in any position in a text string and then put that string into  my newvar?

 

thanks guys,

1 ACCEPTED SOLUTION

Accepted Solutions
PeterClemmensen
Tourmaline | Level 20

Use the INDEX function like this

 

data want;
   set have;

   if index(textvar, 'Captain') > 0 then type = 'Captain';
   else if index(textvar, 'General') > 0 then type = 'General';
   else if index(textvar, 'Colonel') > 0 then type = 'Colonel';
run;

View solution in original post

11 REPLIES 11
PeterClemmensen
Tourmaline | Level 20

Use the INDEX function like this

 

data want;
   set have;

   if index(textvar, 'Captain') > 0 then type = 'Captain';
   else if index(textvar, 'General') > 0 then type = 'General';
   else if index(textvar, 'Colonel') > 0 then type = 'Colonel';
run;
PeterClemmensen
Tourmaline | Level 20

BTW I have assumed that no observation contains both the string 'Captain' and 'General' eg 🙂

monikka1991
Obsidian | Level 7

Just a thought ..

 

 

Regardless of case insensitive characters , use find function with i option .

 

 if find(textvar, 'CAPTAIN',"i") > 0 then type = 'Captain';

 

data have;
input textvar $20.;
datalines;
Captain of Ship
captain of Ship
CAPTAIN india
The Captain
;
run;


data want;
set have;
if find(textvar, 'CAPTAIN',"i") > 0 then type = 'Captain';
run;

 

Astounding
PROC Star

Regardless of which function you choose, you would be better off switching to the version that finds words rather than strings:  FINDW or INDEXW.

 

You don't want your logic to locate strings like these:

 

generally

captaincy

 

In either case, though, you may encounter false positives such as:

 

In general, these problems were handled.

pandhandj
Obsidian | Level 7

my thanks to draycut for the code and to astounding for the inedxw suggestion.

 

i'm cooking with gas now!

ballardw
Super User

You may have a potential issue for compound words with Lieutenant-General or Major-General. If you are searching for Lieutenant or Major you will not want "-" to be included as a delimiter for a word search and do want it for the search for General.

 

Since this is free form text you may have fun for your intended purpose if you only want General as a title or rank and not in context of "General orders" or similar.

 

Also Sergeant-Major is a potential if searching for Major, and Lieutenant-Colonel. If there is Navy data then Lieutenant-Commander is another issue.

 

And don't forget abbreviations such as CPT, LT, LTC, MAJ, GEN, COL, etc. since military runs on TLAs (three-letter acronyms).

pandhandj
Obsidian | Level 7
Guys,

i have come across an issue with my logic (not the code above).

When i run the indexw IF statements above, it places "General", "Captain", etc in my TYPE variable, i can then count each instance of "General", "Captain", etc. However, i cant work out how to get a 0 count for any strings it does not find.
I tried this:

if indexw(userdata, 'Captain') > 0 then do; type = 'Captain'; NotFound = 0;end;
else if indexw(userdata, 'Captain') = 0 then do; type = '.'; NotFound = 'Captain';end;

but it fills the var NotFound with captain.

Can anyone point me in the general direction? (see what i did there!)

thanks.
Astounding
PROC Star

I'm not sure that you're picturing the right end result.  Here's one possible change:

 

type='Captain';

if indexw(userdata, 'Captain') > 0 then Found=1;

else Found=0;

 

If you're going to check the same data for "General", you will need to take extra steps.  Either output multiple observations (one for each TYPE value), or create multiple flags such as CaptainFound instead of Found.

 

pandhandj
Obsidian | Level 7

"I'm not sure that you're picturing the right end result." - i think you might be right!

i dont think my original idea will work for my new criteria.

 

I get the feeling that i'm going to have to go back to the start on this one...

pandhandj
Obsidian | Level 7

trying to think round my issue...

 

i have 144 string i need to search for.  At present my results table gives me a count for the 75 strings that are present in my userdata variable.  

 

Will it be possible to code something that compares the 75 strings in my results table and compare it to my list of 144 searchable strings and then adds the 69 strings that were not found to my results table with a count of 0?

pandhandj
Obsidian | Level 7

Just for completeness... I managed to work this out.

I made a new data set of my 144 search strings, compared it to the results from the indexw code, which gave me the search strings which the indexw code didnt find. I then just set the count for these strings to zero. I now have a complete list of all search strings and how many times the were found.

thanks to all for their help with this one.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 22250 views
  • 6 likes
  • 5 in conversation