topic Re: data flux match code in SAS Data Management

data flux match code

margautz — Tue, 08 Nov 2016 13:52:26 GMT

Dear All,

I have a question about DataFlux (Data Management Studio 2.4).

I am doing a search within two lists. I have to find which names are in both lists, with different "Sensitivity XX%" on the name to find out possible matching.

I noticed that if I use the match code tool, it reads the fields from the left to the right and creates the match code.

E.g.:

1- BANQUE POPULAIRE BOURGOGNE FRANCHE COMTE

2- BANQUE POPULAIRE ALSACE LORRAINE CHAMPAGNE

3- BANQUE POPULAIRE LORRAINE CHAMPAGNE

At 65% the lines 1 and 2 are equal, but the 3 not.

If I change the order of the words

1- BANQUE POPULAIRE BOURGOGNE FRANCHE COMTE

2- BANQUE POPULAIRE LORRAINE CHAMPAGNE ALSACE

3- BANQUE POPULAIRE LORRAINE CHAMPAGNE

At 65% the lines 2 and 3 are equal, but the 1 not.

Is it possible to create a match code (or a workaround) that does not take in account the order of the words lowering the sensitivy of the match? I mean, I do not expect that BANQUE POPULAIRE equals POPULAIRE BANQUE with a sensitivy=100%, but maybe with a lower sensitivity...yes.

Thank you for your help and for your time.

Best regards,

Margherita

Re: data flux match code

RonAgresta — Wed, 09 Nov 2016 15:58:10 GMT

Hi,

You can use the Customize component in DM Studio to see exactly what is happening at each step of the match code generation. Access it through the "Administration" riser in DM Studio, then expand "Quality Knowledge Bases," open the locale you are using, find the match definition and open it, and finally add some sample values in the lower left corner of the application and step through the definition actions to see where changes are being made. You can use Customize to modify the behavior of the match code generation but make sure you make a copy of the definition (or the QKB first).

Ron

Re: data flux match code

margautz — Mon, 14 Nov 2016 11:21:24 GMT

Hello Ron,

thank you for your answer, indeed to write the question, I used the Customazie component to undestand the behaviour.

I know that I can change the code for the match code generation, but I would prefer to avoid it.

I thought that I can split/parse the names and do a match on each part, but I do not know upfront how many fields/words I need. Or maybe I can exclude some words (recurrent ones) from the match code generation, but I do not know “how”. Or…I maybe other workarounds.

I noticed that if I lower the sensitivity to 50%, there will be the match, but it is too much. I tried it and with my two lists it will match BANQUE POPULAIRE ALSACE LORRAINE CHAMPAGNE with 30 different cases that start with BANQUE POPULAIRE.

Thank you anyway for your advice.

Best regards,

Margherita

Re: data flux match code

RonAgresta — Mon, 14 Nov 2016 15:00:52 GMT

Some take the approach of parsing on white space, removing "noise" words," alphabetizing the remaining words, generating match codes for those, and then clustering.

Ron