topic Re: Matching two datasets without merging in SAS Programming

Matching two datasets without merging

Pumpp — Wed, 08 Jul 2020 05:17:38 GMT

I have 2 datasets - dataA is a Sentence, dataB is a Name field.

DataA has observations like - "This Apple was bought yesterday", "The Orange is delicious", "Those Carrots are spoilt".

DataB has observations like - "Strawberries", "Oranges", "Potatoes", "Aple", "Grapes", "Carot".

(Note: All the characters are in UPCASE).

I want to fuzzy match both datasets to see if DataB matches any of the strings in DataA. If it matches, I want DataA to have a new Column which says Match = "YES".

Can we match both the datasets without merging?

Re: Matching two datasets without merging

ketpt42 — Wed, 08 Jul 2020 05:41:40 GMT

This answer from @Ksharp on a previous thread is likely a good start for what you need: SAS matching on substring

Re: Matching two datasets without merging

ChrisNZ — Wed, 08 Jul 2020 05:54:06 GMT

Like this?

proc sql; 
  select unique PHRASE            /* unique required in case several words are in the phrase */
              , (WORD ne ' ') as MATCH 
  from PHRASES
         left join
       WORDS
         on trim(PHRASE) ? trim(WORD);

Re: Matching two datasets without merging

Shmuel — Wed, 08 Jul 2020 06:39:53 GMT

Naxt code uses your test data:

data texts;
  infile cards;
  input row $char50.;
cards;
This Apple was bought yesterday
The Orange is delicious
Those Carrots are spoilt
;
run;
data words;
  infile cards;
  input word $char15.;
cards;
Strawberries
Oranges
Potatoes
Aple
Grapes
Carot
;
run;
/* match words with rows - a fuzzy match */

proc sql noprint;
  create table temp as
  select a.*, b.* 
  from words as a, texts as b;
run;
/***
Base SAS has several functions that you can use 
to compare how close words are to each other. 
The SPEDIS function, which I assume is an acronym for "spelling distance." 
Other SAS functions include:  COMPLEV, COMPGED and SOUNDEX.
***/
data tmp2;
 set temp;
     wn = countw(row);
     flag=0;
	 do i=1 to wn;
	    cmpl1 = complev(word,scan(row,i));
	    if cmpl1 =1  then flag=1; 
	 end;
	 *if flag=1;
	 keep word row flag;
run;

Re: Matching two datasets without merging

Pumpp — Wed, 08 Jul 2020 06:59:49 GMT

Its a good answer, but it only works if both datasets are of correct spelling. Is there any way I can incorporate this with combination of some fuzzy functions like COMGED, SPEDIS, SOUNDEX etc?

Re: Matching two datasets without merging

Pumpp — Wed, 08 Jul 2020 08:14:23 GMT

Thankyou for the response. Currently even I am joining both the datasets and trying to perform the actions.

But I was trying to know any method without joining the 2 datasets because each datasets have more than 1lakh records and joining causes data to explode.

But the below code is of help. I used to break each strings into into separate observations and then perform the analysis like with combinations of COMPGED, COMPLEV, SPEEDIS, SOUNDEX. But this involves lot of data exploding. Your code can atleast help me avoid 2 times data exploding.

data tmp2;
 set temp;
     wn = countw(row);
     flag=0;
	 do i=1 to wn;
	    cmpl1 = complev(word,scan(row,i));
	    if cmpl1 =1  then flag=1; 
	 end;
	 *if flag=1;
	 keep word row flag;
run;

Re: Matching two datasets without merging

ChrisNZ — Wed, 08 Jul 2020 09:26:51 GMT

COMPGED and the other fuzzy functions are really CPU-heavy.

You might try to vet the data before trying to match. Like:

if first(word)=:scan(row,i)) then if complev(word,scan(row,i))=1 then flag=1;

if length(scan(row,i)) > 5 then if complev(word,scan(row,i))=1 then flag=1;

so COMPLEV is not called so often.

Re: Matching two datasets without merging

Shmuel — Wed, 08 Jul 2020 10:56:39 GMT

Neglecting CASE issues and maybe some others too,

try next code:

data texts;
  infile cards;
  input row $char50.;
cards;
This Apple was bought yesterday
The Orange is delicious
Those Carrots are spoilt
;
run;
data words;
  infile cards;
  input word $char15.;
cards;
Strawberries
Oranges
Potatoes
Aple
Grapes
Carot
;
run;
proc sql noprint;
  create table temp as
  select a.*, b.* 
  from words as a
  join texts as b
  on row contains substr(word,1,3) ;
run;

I hope this will let you some more ideas how to overcome explosion of data.

The test data is too small and I believe it does not represent the real one.

Re: Matching two datasets without merging

Shmuel — Wed, 08 Jul 2020 12:24:28 GMT

Replace the sql step with:

proc sql noprint;
create table temp as
select a.*, b.*
from words as a, texts as b
where index(lowcase(row),strip(substr(lowcase(word),1,3)));
run;