SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

DataFlux Standardization issues - turns Jo Ann into Ann Jo

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

DataFlux Standardization issues - turns Jo Ann into Ann Jo

Hi,

I am having issues with the standardization node in DF. Names like Jo Ann or Mary Jo are standardizing as Ann Jo and Jo Mary, it doesn't seem to matter which scheme I use with the definition. We are still using CI 26 by the way. DM Studio 2.6 on SAS 9.2.

 

I have had similar issues in the past  (since we upgraded to 2.6) with addresses standardizing oddly e.g. 'Meadow Farm Road' standardizing as 'Meadow Road (Farm)' or 'PO BOX' as' Box (PO)'. The PO addresses I just send through a branch that only does POs  using a definition and schema . Using a schema seems to take care of the other weirdness.

 

Can anyone tell me why this is happening or what I can do to stop it! Do you need more info to answer?

 

Thanks!


Accepted Solutions
Solution
‎09-27-2016 04:56 PM
SAS Super FREQ
Posts: 90

Re: DataFlux Standardization issues - turns Jo Ann into Ann Jo

I think I understand now what you are seeing. Some hints:

 

  • You can certainly apply individual standardization schemes but many of them are used in specific ways inside standardization definitions. Standardization definitions incorporate things like casing, parsing, and regular expressions in addition to standardization schemes to correct your data. If a definition exists for the data type you are working with, it's generally better to use a definition over a scheme.
  • In the case of your example, becuase you are attempting to standardize just given name information, you will want to use the Standardization (Parsed) node. In it you can map your given name to the expected Given Name token and then appropriate standardization will be done for the given name. In some cases, when you send just a given name into the "Name" definition (which is really built for full names), the definition will reorder compound given names, thinking a surname is involved somehow.

 

I attached some screenshots to show how using a standardization definition with the Standardization (Parsed) node will get you the results you are looking for.


Ron


S1.pngS2.pngS3.png

View solution in original post


All Replies
SAS Super FREQ
Posts: 90

Re: DataFlux Standardization issues - turns Jo Ann into Ann Jo

Hi,

 

I'm curious to know what your expectations are when you standardize people's names. Are you trying to get the casing correct or standardizing prefixes or suffixes like Mister/Mr or JR/Jr.?

 

To help answer your question, can you share the following:

 

  • QKB version
  • QKB definition in use
  • Sample input value
  • Expected output value

Ron

Occasional Contributor
Posts: 6

Re: DataFlux Standardization issues - turns Jo Ann into Ann Jo

Hi, Thanks for the response.

 

I am really just expecting to change case, I'd really like to be able to correct typos like Elizabth -> Elizabeth, but was not really expecting that to work. I am expecting LOU ANN to become Lou Ann or LOU ANN not Ann Lou. It does not do this on all compound type names, just some.

 

I am using QKB C1 26. Definition is NAME for all, schemes are:

1. EN Given Name Common Compound (Matching-Low Sens..)

2. EN Given Name Common Compounds (Matching)

3. EN Given Name Spelling

4. ENUSE Given Name Spelling (wtih Freq)

5. EN Given Names Propsercase

6. EN Given Name (Matching-Combination Matching)

 

 

Here is a sample of of what I am seeing:

First column is the incoming name to be standardised.

 

Laura BethLaura BethLaura BethLaura BethLaura BethLaura BethLaura Beth
Lee AnnAnn LeeAnn LeeAnn LeeAnn LeeAnn LeeAnn Lee
Lee AnneAnne LeeAnne LeeAnne LeeAnne LeeAnne LeeAnne Lee
Lila BethLila BethLila BethLila BethLila BethLila BethLila Beth
Lily BelleLily BelleLily BelleLily BelleLily BelleLily BelleLily Belle
Liu XiangXiang LiuXiang LiuXiang LiuXiang LiuXiang LiuXiang Liu
Lou AnnAnn LouAnn LouAnn LouAnn LouAnn LouAnn Lou
Mary AliceAlice MaryAlice MaryAlice MaryAlice MaryAlice MaryAlice Mary

Thanks,

Cathryn

Solution
‎09-27-2016 04:56 PM
SAS Super FREQ
Posts: 90

Re: DataFlux Standardization issues - turns Jo Ann into Ann Jo

I think I understand now what you are seeing. Some hints:

 

  • You can certainly apply individual standardization schemes but many of them are used in specific ways inside standardization definitions. Standardization definitions incorporate things like casing, parsing, and regular expressions in addition to standardization schemes to correct your data. If a definition exists for the data type you are working with, it's generally better to use a definition over a scheme.
  • In the case of your example, becuase you are attempting to standardize just given name information, you will want to use the Standardization (Parsed) node. In it you can map your given name to the expected Given Name token and then appropriate standardization will be done for the given name. In some cases, when you send just a given name into the "Name" definition (which is really built for full names), the definition will reorder compound given names, thinking a surname is involved somehow.

 

I attached some screenshots to show how using a standardization definition with the Standardization (Parsed) node will get you the results you are looking for.


Ron


S1.pngS2.pngS3.png
Occasional Contributor
Posts: 6

Re: DataFlux Standardization issues - turns Jo Ann into Ann Jo

Thank you. That worked. I should have joined this forum long ago, our support people took 2 weeks and told me the wrong answer!

 

Occasional Contributor
Posts: 6

Re: DataFlux Standardization issues - turns Jo Ann into Ann Jo

One more little thing though, some non-English-y names still come out with odd capitalization, e.g. Salah al din ->Salah AL D I N ,

Yu Ku -> Yu K U but not Yu Huan,  and Bat Yam -> B A T Yam.  I can fix this down the road with an extra step if I have to; but I am wondering if there is a better way to deal with these sorts of name? Thanks again.

SAS Super FREQ
Posts: 90

Re: DataFlux Standardization issues - turns Jo Ann into Ann Jo

We're getting into advanced topics now!

 

There's a component in DM Studio called Customize. Using that, you can see where the standardization definition transformation is making the change you see. In the case of "Salah al din" for example, there's a standardization scheme called "EN Given Names (Abbreviations Standardization)" that is being applied to the name. It takes "din" and changes it to "D I N" for some (I'm sure a very good) reason (at least in most cases). So to change the behavior, you could modify the standardization definition and remove the scheme altogether (not advised) or you could edit the scheme to adjust this behavior by removing the transformations that don't make sense for your scenario. Back up the scheme file and definition first if you plan to make changes.

 

Attached images show the step in Customize that made the unwanted change and the scheme value itself.

 

Ron


S4.pngS5.png
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 542 views
  • 1 like
  • 2 in conversation