SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

DataFlux match job review and best practices

Reply
New Contributor
Posts: 2

DataFlux match job review and best practices

We are a fairly inexperienced team working with DataFlux 2.7 to create a job that will read a long csv file containing primarily individual dependent names (already parsed into first, middle, last) as well as generic fields for alias names. We do not know if the alias names provided are first or last and may be a combination of both at times.  Our business requirements request match codes for various combinations of these names.  We have noticed we get more reliable match codes when we can combine a last name with a first name and run the name match on the combination vs. trying to match the individual parts.  The output will be a text file of the original names and many, many match codes, 2 for each combination of name parts combined that have been requested via the requirements. 

 

So, with the many different fields available to us, we are finding the job we need to create is becoming quite complex:

1.  If alias 1 is not null, combine with dependent last name and create match codes at 90 and 75 sensitivities for name that looks like "alias as first name + dependent last name" else, pass back null match codes

2.  If alias 1 is not null, combine with dependent first name and create match codes at 90 and 75 sensitivites for a name that looks like "dependent first name + alias as a last name" else, pass back null match codes

3.  repeat these for 10 fields containing aliases but only generate a match code when alias is filled in for each of the 10.  

4.  additional matching based on dependent names only, etc. 

 

We are trying to create a job that performs well and trying to understand best practices for using Expression Node and branching.  We have a few older jobs that call out to other jobs and wondering if this is an approach to consider. 

 

Any hints, best practices, or possibility to visit with someone with ideas to assist our design is appreciated. 

SAS Super FREQ
Posts: 90

Re: DataFlux match job review and best practices

Hi,

 

It has been my experience with name matching that you should try to generate match codes on the full name so you are headed in the right direction with that. You can use the Customize component to see exactly what the match code algorithm does to single token names, full names, and combinations. A few other thoughts:

 

  • If you're attempting to speed processing by only generating match codes for "valid" name parts, you can call the match code algorithm explicitly in an expression code wrapping it inside logic that tests for the presence of data in each row/column. This may remove some need to use branching.
  • If you do use a lot of branching, you can allocate more memory to those nodes to help performance.
  • Have you looked at using cross-field clustering? This will allow the clustering node to look across columns for potential matching values. It would for example, be able to match a person's name from one column with a similar name in one or more alias columns.
  • I don't know that using embedded jobs buys you anything in this case except for modularity. I don't believe you will see performance improvements with that approach.

Ron

New Contributor
Posts: 2

Re: DataFlux match job review and best practices

Ron,

Thank you for responding to my questions and confirming the matching on full names.  We will continue using full name matches for this job.

 

We have decided to use the branching to call an embedded job to determine match codes for the alias-dependent name combinations when the alias is not null.  This creates a fairly easy-to-support job albeit several branches for the various aliases. 

 

If this job does not test well, I will follow-up for more information on how to use the expression as you had described in the original response.  So far, we are not seeing issues in testing. 

 

I appreciate the help!

Tracy Bauer

Ask a Question
Discussion stats
  • 2 replies
  • 223 views
  • 1 like
  • 2 in conversation