topic Re: SAS Dataflux Data Management Studio 2.4 in SAS Data Management

SAS Dataflux Data Management Studio 2.4

stevennair — Fri, 15 Nov 2013 13:31:46 GMT

Dear all,

I have 2 sets of data from two different sources.

I would like to merge these 2 sets of data.

However, the basis for merging the data is by company name.

As you may know, company names are, however, not standardized (for example, ABC Ltd, ABC Limited,etc).

Thus, I have used SAS Dataflux Data Management Studio 2.4 with the hope that it will assist me.

Since I am new to the software, I find it hard. In fact, I have already spent few days to understand this software.

Attached is a sample of my data (in excel).

The first column (Comp_name) refers to company name from database 1 while the second column (Org_name) refers to company name from database 2.

What I would like to have is:

1) If a company exists in both databases, then company name as per column 2 (Org_name) appears in third column.

2) If a company exists only in first database, then company name as per column 1 (Comp_name) appears in third column

3)If a company exists only in second database, then company name as per column 1 (Comp_name) appears in third column.

I hope any of you with good knowledge in Dataflux could assist me. My sincerest advance thank you for your time.

Regards,

Steven

Re: SAS Dataflux Data Management Studio 2.4

skillman — Fri, 15 Nov 2013 14:16:16 GMT

Steven,

Simply add an expression node with the following code:

/*pre-expression*/

string column_name_3

/*expression*/

if not isNull(column_name_1) and not isNull(column_name_2)

then column_name_3 = column_name_2

else

if not isNull(column_name_1) and isNull(column_name_2)

then column_name_3 = column_name_1

else

if isNull(column_name_1) and not isNull(column_name_2)

then column_name_3 = column_name_2

This checks for non-null values in company_name_1 and company_name_2 and populates the correct company name in the company_name_3 column.

-shawn

Re: SAS Dataflux Data Management Studio 2.4

stevennair — Wed, 20 Nov 2013 08:45:14 GMT

Hi Skillman,

Thanks for replying and appreciate that.

Sorry for my earlier post that did not mention my intention clearly.

What I really wanted was sas dataflux to do the matching in column 1 and column 2 since the company names are not standardized (for eg: ABC Ltd with ABC Limited, etc).

After matching, only then copy the the matched companies to column 3.

thanks.

regards,

steven

Re: SAS Dataflux Data Management Studio 2.4

MikeFrost — Tue, 26 Nov 2013 15:09:33 GMT

Steven,

Data Management Studio has a few nodes in Data Jobs, when used in conjunction, can help you solve your problem. The first node to be aware of is the Matchcode node which is under the Entity Resolution section. This node generates fuzzy representations of input data which can be used for matching purposes. In your case, you have provided a single field of organization names, so you would generate a matchcode using the “Organization” definition (if your locale and definition list is blank, then you will need to download a Quality Knowledge Base from the SAS website to enable this functionality). Since you have only provided a single field, you will need to push that sensitivity up to 95 to prevent over matching as much as possible. One word of warning, you will get lots of overmatching when matching on a single field. Matchcode definitions have been designed to use with combinations of fields such as Name and Address or Name and Phone. If you have other attributes available for you to match on, go ahead and include those as well. You will need to generate matchcodes for each of your data sources that you are joining together. In your example, this would be both columns of your dataset.

Now that you have a fuzzy representation of the data, you need to bring the data together for matching purposes. This now comes down you your personal preference on how you want the data brought together. The first approach is to Union (under Data Integration) the sources together. This will bring the matchcode and the company names together in a single column. You can then use a Clustering (under Entity Resolution) step on matchcode to group like items together and assign them a common cluster id and sorting the records by that cluster id. Finally, you can use a Surviving Record Identification (under Entity Resolution). This step allows you to choose a best record per the rules you have set.

Another option is to Join (under Data Integration) your two data sources together where the matchcodes from both sources match each other. Then carry forward the Company name from data source A and the Org name from data source B. By default, this is an inner join so this node will only output matching fields. In your case, you wanted records that didn’t match as well, so you would need to perform a full outer join. For non-matching records, the field name will be null for the other table. Then you can pass these through an expression statement that will create a new string column for your final company name and issues an if statement to pull the right data:

     string company
     if isnull(comp_name)
       company = org_name
     if isnull(org_name)
       company = comp_name
     else
       company = comp_name

This approach will result in overmatching as well as records being repeated records. For instance FX Alliance Inc in your sample is a Fuzzy match to FX Solutions and FX Alliance. This is an overmatch because Solutions and Alliance are both noise words in the Organization definition, thus are stripped and look the same (see warning above).

Regards,
Mike