Hi,
Let's say my data set is set up like this:
study_id structure_id
1 BLADDER
2 .
3 BOWEL
4 BOWEL_SMALL
5 GTV
6 GTV_small
I want to delete records where structure_id contains "BOWEL" or "GTV".
I tried using this code:
data dvh_edited; set dvh; found=indexc(structure_id, "BOWEL", "GTV"); if found ^=0 then delete; run;
But with that code it is looking for any letter in "BOWEL" or "GTV" so it's tagging many records. I'm not that familiar with PROC SQL so I'd prefer to do this in a data step if it's fairly simple. Thanks!
Use the FIND function
if find(structure_id, "BOWEL", "i") > 0 or find(structure_id,"GTV","i") > 0 then delete;
data have;
input study_id structure_id : $13.;
cards;
1 BLADDER
2 .
3 BOWEL
4 BOWEL_SMALL
5 GTV
6 GTV_small
;
data want;
set have;
found=findw(strip(structure_id), "BOWEL",'_') or findw(strip(structure_id), "GTV",'_');
if found ^=0 then delete;
run;
It's easy if you are willing to code each word separately:
if findw(structure_id, 'GTV', , '_ ,' , 'I') then delete;
if findw(structure_id, 'BOWEL', , '_ ,' , 'I') then delete;
Assuming I got all the pieces of this in the right places:
The third parameter is starting position so can default to beginning of the string.
The fourth parameter is a list of word delimiters.
The fifth parameter asks that case be ignored, so these are both to be deleted: GTV and gtv
Note that this is probably more what you want, rather than INDEX or FIND. You probably don't want to delete observations that contain "HGTV", or "EMBOWEL".
Hmmm...... 3 responses in the first 10 minutes. Might be time for me to shut down the computer and go to lunch.
FINDW finds words, if the input text string is GTVsmall without the underscore, FINDW will not find it. I don't see a problem using FIND instead of FINDW.
I think the answer is to know something about your data. You can construct problem strings for most simple solutions (which includes mine, just for the record):
WatchingTV
Embowelling
Bowel/Intestine
Good point.
Unfortunately the structure_id variable is human-entered freetext so I actually do run into the problem you stated. There are some variables "GTVp" that I also want to eliminate, and the findw code earlier doesn't seem to filter variants like that.
@jahanm wrote:
Unfortunately the structure_id variable is human-entered freetext so I actually do run into the problem you stated. There are some variables "GTVp" that I also want to eliminate, and the findw code earlier doesn't seem to filter variants like that.
Regular expressions allow you to search for text patterns - you just need to be able to formulate the rules for these patterns.
Below RegEx will match any string with bowel or gtv in it. But using a RegEx would also allow you to search for a pattern like a word consisting of GTV plus max. one additional letter.
data have;
input study_id structure_id : $13.;
cards;
1 BLADDER
2 .
3 BOWEL
4 BOWEL_SMALL
5 GTV
6 GTV_small
;
data want;
set have;
if prxmatch('/bowel|gtv/oi',structure_id)>0 then delete;
run;
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.