Re: Matching two datasets without merging

Pumpp · Posted 07-08-2020 01:17 AM

I have 2 datasets - dataA is a Sentence, dataB is a Name field.

DataA has observations like - "This Apple was bought yesterday", "The Orange is delicious", "Those Carrots are spoilt".

DataB has observations like - "Strawberries", "Oranges", "Potatoes", "Aple", "Grapes", "Carot".

(Note: All the characters are in UPCASE).

I want to fuzzy match both datasets to see if DataB matches any of the strings in DataA. If it matches, I want DataA to have a new Column which says Match = "YES".

Can we match both the datasets without merging?

ketpt42 · Posted 07-08-2020 01:41 AM

This answer from @Ksharp on a previous thread is likely a good start for what you need: SAS matching on substring

Pumpp · Posted 07-08-2020 02:59 AM

Its a good answer, but it only works if both datasets are of correct spelling. Is there any way I can incorporate this with combination of some fuzzy functions like COMGED, SPEDIS, SOUNDEX etc?

ChrisNZ · Posted 07-08-2020 01:54 AM

Like this?

proc sql; 
  select unique PHRASE            /* unique required in case several words are in the phrase */
              , (WORD ne ' ') as MATCH 
  from PHRASES
         left join
       WORDS
         on trim(PHRASE) ? trim(WORD);

High-Performance SAS Coding - Third Edition

Shmuel · Posted 07-08-2020 02:39 AM

Naxt code uses your test data:

data texts;
  infile cards;
  input row $char50.;
cards;
This Apple was bought yesterday
The Orange is delicious
Those Carrots are spoilt
;
run;
data words;
  infile cards;
  input word $char15.;
cards;
Strawberries
Oranges
Potatoes
Aple
Grapes
Carot
;
run;
/* match words with rows - a fuzzy match */

proc sql noprint;
  create table temp as
  select a.*, b.* 
  from words as a, texts as b;
run;
/***
Base SAS has several functions that you can use 
to compare how close words are to each other. 
The SPEDIS function, which I assume is an acronym for "spelling distance." 
Other SAS functions include:  COMPLEV, COMPGED and SOUNDEX.
***/
data tmp2;
 set temp;
     wn = countw(row);
     flag=0;
	 do i=1 to wn;
	    cmpl1 = complev(word,scan(row,i));
	    if cmpl1 =1  then flag=1; 
	 end;
	 *if flag=1;
	 keep word row flag;
run;

Pumpp · Posted 07-08-2020 04:14 AM

Thankyou for the response. Currently even I am joining both the datasets and trying to perform the actions.

But I was trying to know any method without joining the 2 datasets because each datasets have more than 1lakh records and joining causes data to explode.

But the below code is of help. I used to break each strings into into separate observations and then perform the analysis like with combinations of COMPGED, COMPLEV, SPEEDIS, SOUNDEX. But this involves lot of data exploding. Your code can atleast help me avoid 2 times data exploding.

data tmp2;
 set temp;
     wn = countw(row);
     flag=0;
	 do i=1 to wn;
	    cmpl1 = complev(word,scan(row,i));
	    if cmpl1 =1  then flag=1; 
	 end;
	 *if flag=1;
	 keep word row flag;
run;

ChrisNZ · Posted 07-08-2020 05:26 AM

COMPGED and the other fuzzy functions are really CPU-heavy.

You might try to vet the data before trying to match. Like:

if first(word)=:scan(row,i)) then if complev(word,scan(row,i))=1 then flag=1;

or

if length(scan(row,i)) > 5 then if complev(word,scan(row,i))=1 then flag=1;

so COMPLEV is not called so often.

High-Performance SAS Coding - Third Edition

Shmuel · Posted 07-08-2020 06:56 AM

Neglecting CASE issues and maybe some others too,

try next code:

data texts;
  infile cards;
  input row $char50.;
cards;
This Apple was bought yesterday
The Orange is delicious
Those Carrots are spoilt
;
run;
data words;
  infile cards;
  input word $char15.;
cards;
Strawberries
Oranges
Potatoes
Aple
Grapes
Carot
;
run;
proc sql noprint;
  create table temp as
  select a.*, b.* 
  from words as a
  join texts as b
  on row contains substr(word,1,3) ;
run;

I hope this will let you some more ideas how to overcome explosion of data.

The test data is too small and I believe it does not represent the real one.

Shmuel · Posted 07-08-2020 08:24 AM

Replace the sql step with:

proc sql noprint;
create table temp as
select a.*, b.*
from words as a, texts as b
where index(lowcase(row),strip(substr(lowcase(word),1,3)));
run;

Registration is open

SAS Training: Just a Click Away