- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Guys, could i ask for some guidane again please?
I have 122k lines of data where 1 of the variables is a user defined text entry. Somewhere within the variable is the string Captain or General or Colonel, etc (i have to find 144 differerent strings in total).
I need to find each instance of Captain in the text and then put Captain into a Variable called Type.
I have been trying:
if Text contains 'Captain' then Type = 'Captain' - this gives an arithmetic error. I dont think i can use contains with IF!
How do i find a string in any position in a text string and then put that string into my newvar?
thanks guys,
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use the INDEX function like this
data want;
set have;
if index(textvar, 'Captain') > 0 then type = 'Captain';
else if index(textvar, 'General') > 0 then type = 'General';
else if index(textvar, 'Colonel') > 0 then type = 'Colonel';
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use the INDEX function like this
data want;
set have;
if index(textvar, 'Captain') > 0 then type = 'Captain';
else if index(textvar, 'General') > 0 then type = 'General';
else if index(textvar, 'Colonel') > 0 then type = 'Colonel';
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
BTW I have assumed that no observation contains both the string 'Captain' and 'General' eg 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just a thought ..
Regardless of case insensitive characters , use find function with i option .
if find(textvar, 'CAPTAIN',"i") > 0 then type = 'Captain';
data have;
input textvar $20.;
datalines;
Captain of Ship
captain of Ship
CAPTAIN india
The Captain
;
run;
data want;
set have;
if find(textvar, 'CAPTAIN',"i") > 0 then type = 'Captain';
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Regardless of which function you choose, you would be better off switching to the version that finds words rather than strings: FINDW or INDEXW.
You don't want your logic to locate strings like these:
generally
captaincy
In either case, though, you may encounter false positives such as:
In general, these problems were handled.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
my thanks to draycut for the code and to astounding for the inedxw suggestion.
i'm cooking with gas now!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You may have a potential issue for compound words with Lieutenant-General or Major-General. If you are searching for Lieutenant or Major you will not want "-" to be included as a delimiter for a word search and do want it for the search for General.
Since this is free form text you may have fun for your intended purpose if you only want General as a title or rank and not in context of "General orders" or similar.
Also Sergeant-Major is a potential if searching for Major, and Lieutenant-Colonel. If there is Navy data then Lieutenant-Commander is another issue.
And don't forget abbreviations such as CPT, LT, LTC, MAJ, GEN, COL, etc. since military runs on TLAs (three-letter acronyms).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
i have come across an issue with my logic (not the code above).
When i run the indexw IF statements above, it places "General", "Captain", etc in my TYPE variable, i can then count each instance of "General", "Captain", etc. However, i cant work out how to get a 0 count for any strings it does not find.
I tried this:
if indexw(userdata, 'Captain') > 0 then do; type = 'Captain'; NotFound = 0;end;
else if indexw(userdata, 'Captain') = 0 then do; type = '.'; NotFound = 'Captain';end;
but it fills the var NotFound with captain.
Can anyone point me in the general direction? (see what i did there!)
thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure that you're picturing the right end result. Here's one possible change:
type='Captain';
if indexw(userdata, 'Captain') > 0 then Found=1;
else Found=0;
If you're going to check the same data for "General", you will need to take extra steps. Either output multiple observations (one for each TYPE value), or create multiple flags such as CaptainFound instead of Found.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
"I'm not sure that you're picturing the right end result." - i think you might be right!
i dont think my original idea will work for my new criteria.
I get the feeling that i'm going to have to go back to the start on this one...
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
trying to think round my issue...
i have 144 string i need to search for. At present my results table gives me a count for the 75 strings that are present in my userdata variable.
Will it be possible to code something that compares the 75 strings in my results table and compare it to my list of 144 searchable strings and then adds the 69 strings that were not found to my results table with a count of 0?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Just for completeness... I managed to work this out.
I made a new data set of my 144 search strings, compared it to the results from the indexw code, which gave me the search strings which the indexw code didnt find. I then just set the count for these strings to zero. I now have a complete list of all search strings and how many times the were found.
thanks to all for their help with this one.