BookmarkSubscribeRSS Feed
Pumpp
Obsidian | Level 7

I have 2 datasets - dataA is a Sentence, dataB is a Name field.

DataA has observations like - "This Apple was bought yesterday", "The Orange is delicious", "Those Carrots are spoilt".

DataB has observations like - "Strawberries", "Oranges", "Potatoes", "Aple", "Grapes", "Carot".

(Note: All the characters are in UPCASE).

I want to fuzzy match both datasets to see if DataB matches any of the strings in DataA. If it matches, I want DataA to have a new Column which says Match = "YES".

Can we match both the datasets without merging?

 

8 REPLIES 8
ketpt42
Quartz | Level 8

This answer from @Ksharp on a previous thread is likely a good start for what you need: SAS matching on substring 

Pumpp
Obsidian | Level 7

Its a good answer, but it only works if both datasets are of correct spelling. Is there any way I can incorporate this with combination of some fuzzy functions like COMGED, SPEDIS, SOUNDEX etc?

ChrisNZ
Tourmaline | Level 20

Like this?

proc sql; 
  select unique PHRASE            /* unique required in case several words are in the phrase */
              , (WORD ne ' ') as MATCH 
  from PHRASES
         left join
       WORDS
         on trim(PHRASE) ? trim(WORD);
Shmuel
Garnet | Level 18

Naxt code uses your test data:

data texts;
  infile cards;
  input row $char50.;
cards;
This Apple was bought yesterday
The Orange is delicious
Those Carrots are spoilt
;
run;
data words;
  infile cards;
  input word $char15.;
cards;
Strawberries
Oranges
Potatoes
Aple
Grapes
Carot
;
run;
/* match words with rows - a fuzzy match */

proc sql noprint;
  create table temp as
  select a.*, b.* 
  from words as a, texts as b;
run;
/***
Base SAS has several functions that you can use 
to compare how close words are to each other. 
The SPEDIS function, which I assume is an acronym for "spelling distance." 
Other SAS functions include:  COMPLEV, COMPGED and SOUNDEX.
***/
data tmp2;
 set temp;
     wn = countw(row);
     flag=0;
	 do i=1 to wn;
	    cmpl1 = complev(word,scan(row,i));
	    if cmpl1 =1  then flag=1; 
	 end;
	 *if flag=1;
	 keep word row flag;
run;
Pumpp
Obsidian | Level 7

Thankyou for the response. Currently even I am joining both the datasets and trying to perform the actions.

But I was trying to know any method without joining the 2 datasets because each datasets have more than 1lakh records and joining causes data to explode.

But the below code is of help. I used to break each strings into into separate observations and then perform the analysis like with combinations of COMPGED, COMPLEV, SPEEDIS, SOUNDEX. But this involves lot of data exploding. Your code can atleast help me avoid 2 times data exploding.

data tmp2;
 set temp;
     wn = countw(row);
     flag=0;
	 do i=1 to wn;
	    cmpl1 = complev(word,scan(row,i));
	    if cmpl1 =1  then flag=1; 
	 end;
	 *if flag=1;
	 keep word row flag;
run;

 

 

ChrisNZ
Tourmaline | Level 20

COMPGED and the other fuzzy functions are really CPU-heavy.

 

You might try to vet the data before trying to match. Like:

if first(word)=:scan(row,i)) then if complev(word,scan(row,i))=1 then flag=1;

or

 

if length(scan(row,i)) > 5 then if complev(word,scan(row,i))=1 then flag=1;

 

so COMPLEV is not called so often.

Shmuel
Garnet | Level 18

Neglecting CASE issues and maybe some others too,

try next code:

data texts;
  infile cards;
  input row $char50.;
cards;
This Apple was bought yesterday
The Orange is delicious
Those Carrots are spoilt
;
run;
data words;
  infile cards;
  input word $char15.;
cards;
Strawberries
Oranges
Potatoes
Aple
Grapes
Carot
;
run;
proc sql noprint;
  create table temp as
  select a.*, b.* 
  from words as a
  join texts as b
  on row contains substr(word,1,3) ;
run;

I hope this will let you some more ideas how to overcome explosion of data.

The test data is too small and I believe it does not represent the real one.

 

 

Shmuel
Garnet | Level 18
Replace the sql step with:

proc sql noprint;
create table temp as
select a.*, b.*
from words as a, texts as b
where index(lowcase(row),strip(substr(lowcase(word),1,3)));
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 763 views
  • 1 like
  • 4 in conversation