I have 2 datasets - dataA is a Sentence, dataB is a Name field.
DataA has observations like - "This Apple was bought yesterday", "The Orange is delicious", "Those Carrots are spoilt".
DataB has observations like - "Strawberries", "Oranges", "Potatoes", "Aple", "Grapes", "Carot".
(Note: All the characters are in UPCASE).
I want to fuzzy match both datasets to see if DataB matches any of the strings in DataA. If it matches, I want DataA to have a new Column which says Match = "YES".
Can we match both the datasets without merging?
This answer from @Ksharp on a previous thread is likely a good start for what you need: SAS matching on substring
Its a good answer, but it only works if both datasets are of correct spelling. Is there any way I can incorporate this with combination of some fuzzy functions like COMGED, SPEDIS, SOUNDEX etc?
Like this?
proc sql;
select unique PHRASE /* unique required in case several words are in the phrase */
, (WORD ne ' ') as MATCH
from PHRASES
left join
WORDS
on trim(PHRASE) ? trim(WORD);
Naxt code uses your test data:
data texts;
infile cards;
input row $char50.;
cards;
This Apple was bought yesterday
The Orange is delicious
Those Carrots are spoilt
;
run;
data words;
infile cards;
input word $char15.;
cards;
Strawberries
Oranges
Potatoes
Aple
Grapes
Carot
;
run;
/* match words with rows - a fuzzy match */
proc sql noprint;
create table temp as
select a.*, b.*
from words as a, texts as b;
run;
/***
Base SAS has several functions that you can use
to compare how close words are to each other.
The SPEDIS function, which I assume is an acronym for "spelling distance."
Other SAS functions include: COMPLEV, COMPGED and SOUNDEX.
***/
data tmp2;
set temp;
wn = countw(row);
flag=0;
do i=1 to wn;
cmpl1 = complev(word,scan(row,i));
if cmpl1 =1 then flag=1;
end;
*if flag=1;
keep word row flag;
run;
Thankyou for the response. Currently even I am joining both the datasets and trying to perform the actions.
But I was trying to know any method without joining the 2 datasets because each datasets have more than 1lakh records and joining causes data to explode.
But the below code is of help. I used to break each strings into into separate observations and then perform the analysis like with combinations of COMPGED, COMPLEV, SPEEDIS, SOUNDEX. But this involves lot of data exploding. Your code can atleast help me avoid 2 times data exploding.
data tmp2;
set temp;
wn = countw(row);
flag=0;
do i=1 to wn;
cmpl1 = complev(word,scan(row,i));
if cmpl1 =1 then flag=1;
end;
*if flag=1;
keep word row flag;
run;
COMPGED and the other fuzzy functions are really CPU-heavy.
You might try to vet the data before trying to match. Like:
if first(word)=:scan(row,i)) then if complev(word,scan(row,i))=1 then flag=1;
or
if length(scan(row,i)) > 5 then if complev(word,scan(row,i))=1 then flag=1;
so COMPLEV is not called so often.
Neglecting CASE issues and maybe some others too,
try next code:
data texts;
infile cards;
input row $char50.;
cards;
This Apple was bought yesterday
The Orange is delicious
Those Carrots are spoilt
;
run;
data words;
infile cards;
input word $char15.;
cards;
Strawberries
Oranges
Potatoes
Aple
Grapes
Carot
;
run;
proc sql noprint;
create table temp as
select a.*, b.*
from words as a
join texts as b
on row contains substr(word,1,3) ;
run;
I hope this will let you some more ideas how to overcome explosion of data.
The test data is too small and I believe it does not represent the real one.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.