BookmarkSubscribeRSS Feed
margautz
Fluorite | Level 6

Dear All,

 

I have a question about DataFlux (Data Management Studio 2.4).

 

I am doing a search within two lists. I have to find which names are in both lists, with different "Sensitivity XX%" on the name to find out possible matching. 

 

I noticed that if I use the match code tool, it reads the fields from the left to the right and creates the match code.

E.g.:

 

1- BANQUE POPULAIRE BOURGOGNE FRANCHE COMTE

2- BANQUE POPULAIRE ALSACE LORRAINE CHAMPAGNE

3- BANQUE POPULAIRE LORRAINE CHAMPAGNE

At 65% the lines 1 and 2 are equal, but the 3 not.

 

If I change the order of the words

1- BANQUE POPULAIRE BOURGOGNE FRANCHE COMTE

2- BANQUE POPULAIRE LORRAINE CHAMPAGNE ALSACE 

3- BANQUE POPULAIRE LORRAINE CHAMPAGNE

At 65% the lines 2 and 3 are equal, but the 1 not.

 

Is it possible to create a match code (or a workaround) that does not take in account the order of the words lowering the sensitivy of the match? I mean, I do not expect that BANQUE POPULAIRE equals POPULAIRE BANQUE with a sensitivy=100%, but maybe with a lower sensitivity...yes.

 

Thank you for your help and for your time.

 

Best regards,

Margherita

3 REPLIES 3
RonAgresta
SAS Employee

Hi,

 

You can use the Customize component in DM Studio to see exactly what is happening at each step of the match code generation. Access it through the "Administration" riser in DM Studio, then expand "Quality Knowledge Bases," open the locale you are using, find the match definition and open it, and finally add some sample values in the lower left corner of the application and step through the definition actions to see where changes are being made. You can use Customize to modify the behavior of the match code generation but make sure you make a copy of the definition (or the QKB first).

 

Ron

 

margautz
Fluorite | Level 6

Hello Ron,

 

thank you for your answer, indeed to write the question, I used the Customazie component to undestand the behaviour.

 

I know that I can change the code for the match code generation, but I would prefer to avoid it.

I thought that I can split/parse the names and do a match on each part, but I do not know upfront how many fields/words I need. Or maybe I can exclude some words (recurrent ones) from the match code generation, but I do not know “how”. Or…I maybe other workarounds.

 

I noticed that if I lower the sensitivity to 50%, there will be the match, but it is too much. I tried it and with my two lists it will match BANQUE POPULAIRE ALSACE LORRAINE CHAMPAGNE with 30 different cases that start with BANQUE POPULAIRE.

 

Thank you anyway for your advice.

 

Best regards,

Margherita

 

RonAgresta
SAS Employee

Some take the approach of parsing on white space, removing "noise" words," alphabetizing the remaining words, generating match codes for those, and then clustering.


Ron

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1498 views
  • 0 likes
  • 2 in conversation