- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a data file which has two string variables that contain multiple words each:
V1 V2
abc def ghi jkl abc
def ghi jkl abc abc
abcdef abc
abc abc def
Each string can contain one or more words. What I'm trying to do is identify records where any of the full words in V2 is found (also as a full word) in V1. So the first, second and last records would be matches, but the third record would not because the value in V2 does not match to a full word value found in V1.
I did come up with a way to do this using multiple scan functions but it's really clunky and inefficient. I'm hoping for a more efficient and automated approach.
Any help is much appreciated.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is how I would do it
data have;
input V1 $ 1-16 V2 $ 17-25;
datalines;
abc def ghi jkl abc
def ghi jkl abc abc
abcdef abc
abc abc def
;
data want (drop=i);
set have;
do i = 1 to countw(V2);
if findw(V1, scan(V2, i)) then do;
output;
leave;
end;
end;
run;
Result:
V1 V2 abc def ghi jkl abc def ghi jkl abc abc abc abc def
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I did come up with a way to do this using multiple scan functions but it's really clunky and inefficient. I'm hoping for a more efficient and automated approach.
I think you need a DO loop in a DATA step, along with the COUNTW and SCAN and FIND functions. So, it's not clear what you did or whether it can be improved. Can you show us what you have tried?
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Basically a lot of scan functions:
if scan(v1, 1, ' ') = scan(v2, 1, ' ') or
scan(v1, 2, ' ') = scan(v2, 1, ' ') etc.
This is not a great approach because the strings can have up to 9 words (and it's super clunky).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is how I would do it
data have;
input V1 $ 1-16 V2 $ 17-25;
datalines;
abc def ghi jkl abc
def ghi jkl abc abc
abcdef abc
abc abc def
;
data want (drop=i);
set have;
do i = 1 to countw(V2);
if findw(V1, scan(V2, i)) then do;
output;
leave;
end;
end;
run;
Result:
V1 V2 abc def ghi jkl abc def ghi jkl abc abc abc abc def
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, @PeterClemmensen , this is exactly what I had in mind.
Paige Miller