Muddling through long irregular strings to find matches

texasmfp · Posted 05-01-2023 09:00 PM

I have two datasets that contain a date, a customer name and a narrative description of merchandise. For each line in data1 I am looking for a matching line in data2. Manually, it works, but it is time-consuming so that approach is for small datasets. Now I have thousands of lines. The primary obstacle is that the narrative descriptions are long, contain irrelevant words, and have inconsistent formats and errors. The key words within the description are a model # (which may or may not exist in either data1 or data 1), an overt indicator for the number of pieces (i.e., 51 PCE), and most importantly, dimensions. However, the dimensions are inconsistent: one dataset may use x between dimensions (1500mm x 350mm x 20mm) while another uses *. There are other issues such as no or inconsistent spaces in the dimensions. Nothing is in a set position, sometimes there is not a model, the unimportant words vary, and occasionally the dimensions switch between mm and cm. Plain and simple, the data is dirty.

My idea is to build a SAS program that, goes line by line in data1, trying each and every term within the description string and try to match against each and every term in every line of data2 and keep a count of the matches. I then rank lets say the Top 10 possible matches for each line. While this is a new knowledge area for me in terms of SAS, I bet I am not the first to encounter this obstacle. I would appreciate any insight.

Here is some sample data. For simplicity, I have deleted the date and customer name field:

Data1:

2QG850-19 natural topaz stone, size 1099 x 648 mm, 2cm thick (62 PCE), 100% brand new

Data2

EGNL01 natural topaz stone, thickness 2cm, size 2990x1280mm (25 PCE), 100% brand new

Natural topaz stone, size 578 mm x 648 mm, thickness 30 mm, 100% new

natural stone, size 991 mm x 610 mm, thickness 30 mm, 100% new

topaz stone, size 1905 mm x 921 mm, thickness 30 mm, 100% new

SASKiwi · Posted 05-02-2023 12:13 AM

You need to define what you mean by a match. It looks like you have four attributes to compare: description, size, thickness and newness. What is an acceptable match when you compare all four attributes? Presumably an exact match counts as would a match that ignores text case. What is then left is fuzzy matching - how many attributes match and how close does the match have to be?

SAS has a product called SAS Data Quality that is designed exactly for this type of task, but I doubt it would be available to you and then there is the learning curve to consider. Most likely a custom Base SAS solution is the way to go. For a start, defining the four attributes should be easy since they are delimited by commas:

data Want;
  string = 'EGNL01 natural topaz stone, thickness 2cm, size 2990x1280mm (25 PCE), 100% brand new';
  Description = scan(string,1,',');
  do i = 1 to 4;
    word = scan(string,i,',');
    if indexw(upcase(word),'THICKNESS') then Thickness = word; 
    else if indexw(upcase(word),'SIZE') then Size = word;
    else if indexw(upcase(word),'NEW') then Newness = word;
  end;
  put _all_;
run;

Oligolas · Posted 05-02-2023 08:09 AM

Hi,

give a try to the SAS COMPGED and COMPLEV in-built functions for determining the edit distance between the strings and see if it helps you

Example:

DATA ds1;
infile datalines dsd;
input String1 : $200.;
datalines;
"2QG850-19 natural topaz stone, size 1099 x 648 mm, 2cm thick (62 PCE), 100% brand new"
;
RUN;

DATA ds2;
infile datalines dsd;
input String2 : $200.;
datalines;
"EGNL01 natural topaz stone, thickness 2cm, size 2990x1280mm (25 PCE), 100% brand new"
"Natural topaz stone, size 578 mm x 648 mm, thickness 30 mm, 100% new"
"natural stone, size 991 mm x 610 mm, thickness 30 mm, 100% new"
"topaz stone, size 1905 mm x 921 mm, thickness 30 mm, 100% new"
"ABC"
"I have two datasets that contain a date, a customer name and a narrative description of merchandise"
;
RUN;

PROC SQL;
   CREATE TABLE have AS
      SELECT *
      FROM ds1,ds2
   ;
QUIT;

DATA want;
   set have;
   GED=compged(string1, string2); /*Returns the generalized edit distance between two strings*/
   LEV=complev(string1, string2,'LN');/*Returns the Levenshtein edit distance between two strings*/
   SUM=GED+LEV;
RUN;

PROC SORT data=want; by SUM ; RUN;

________________________
- Cheers -

Muddling through long irregular strings to find matches

Re: Muddling through long irregular strings to find matches

Re: Muddling through long irregular strings to find matches

Muddling through long irregular strings to find matches

Re: Muddling through long irregular strings to find matches

Re: Muddling through long irregular strings to find matches

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away