<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Data Brilliance Unleashed: SAS Data Quality against Databricks - Precision, Performance, Perfection in SAS Viya</title>
    <link>https://communities.sas.com/t5/SAS-Viya/Data-Brilliance-Unleashed-SAS-Data-Quality-against-Databricks/m-p/927475#M2372</link>
    <description>&lt;P&gt;Today the quality of data is paramount. Every decision, every insight hinges on the reliability and accuracy of the underlying data.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Many of the tools and techniques for improving data quality are locked behind deep technical knowledge and programming skills in different programming languages. What if I told you, it does not have to be that way?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Let me lead you into the world of Entity Resolution, where you will learn about techniques such as Data Identification, Parsing, Standardization, Matching, Clustering and Surviving Records. &lt;SPAN&gt;Entity Resolution attempts to identify different representations of the same data and provides a normalized, standardized master record, which will improve downstream analysis.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Entity Resolution has many different use cases, many of them within &lt;A href="http://Fraud%20Detection" target="_blank" rel="noopener"&gt;Fraud Detection&lt;/A&gt; and &lt;A href="http://Anti%20Money%20Laundering" target="_blank" rel="noopener"&gt;Anti Money Laundering&lt;/A&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;How do we identify the same entity within data and then enhance this entity with information from "contributor" rows, &lt;STRONG&gt;all without writing a single row of code and only using out-of-the-box functionality&lt;/STRONG&gt;? Let us get started!&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I will be using a table found in Databricks that has some obvious data quality issues, which I will not be able to solve with the tools available in Databricks, unless I have very deep knowledge in SQL and Python programming as well as specific Python packages. You all know by now, how easy it is to connect to Databricks, as explained in my good colleague Cecily Hoffritz's blog:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/SAS-and-Databricks-Your-Practical-Guide-to-Data-Access-and/ta-p/923733" target="_blank" rel="noopener"&gt;SAS and Databricks: Your Practical Guide to Data Access and Analysis - SAS Support Communities&lt;/A&gt; &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The table I will be working with contains customers, both individuals and organizations. To keep it simple, I will focus on the individual customers and how I can find the entities, remove duplicates, standardize the data, and create my master records.&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_1-1715153512036.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96282i4971A88AA0935381/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_1-1715153512036.png" alt="patric_1-1715153512036.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;To help me on this journey I will be using the &lt;A href="https://support.sas.com/en/software/quality-knowledge-base-support.html" target="_blank" rel="noopener"&gt;SAS Quality Knowledge Base&lt;/A&gt; (QKB) and &lt;A href="https://www.sas.com/en_us/software/studio.html" target="_blank" rel="noopener"&gt;SAS Studio&lt;/A&gt; for engineering. The QKB contains massive amounts of predefined data quality rules and logic and in SAS Studio I can create visual modern data pipelines.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Identification&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;The first step is to identify whether a customer is an individual or an organization, for which I use the &lt;A href="https://go.documentation.sas.com/doc/en/sasstudiocdc/v_050/webeditorcdc/webeditorflows/p0bge0vxhrdgpdn1njohwlej05qd.htm#n0g2liyq671xkjn13j93zhdi7g0k" target="_blank" rel="noopener"&gt;Clean Data&lt;/A&gt; Step and its Identification Analysis.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-left" image-alt="patric_4-1715154016579.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96286i752CD1AA22D619F1/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_4-1715154016579.png" alt="patric_4-1715154016579.png" /&gt;&lt;/span&gt;&lt;EM&gt;Using the Name column to identify whether it is an Individual or Organization&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_5-1715154031609.png" style="width: 289px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96289i6CDFB180AB4DD3BC/image-dimensions/289x355?v=v2" width="289" height="355" role="button" title="patric_5-1715154031609.png" alt="patric_5-1715154031609.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt; &lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Once identified, I use the &lt;A href="https://go.documentation.sas.com/doc/en/webeditorcdc/v_042/webeditorflows/n1ivpt3accqtyzn1e2gad97a8gow.htm" target="_blank" rel="noopener"&gt;Branch Rows&lt;/A&gt; Step to split individuals and organizations into separate tables.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_6-1715154124844.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96290iBF171110A7FB554C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_6-1715154124844.png" alt="patric_6-1715154124844.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_7-1715154155195.png" style="width: 999px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96291i0708120A9D2E9FD4/image-size/large?v=v2&amp;amp;px=999" role="button" title="patric_7-1715154155195.png" alt="patric_7-1715154155195.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Parsing&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;To make any sense of this data, I would need a way of Parsing (dividing) it into tokens (components). I &amp;nbsp;apply different rules on these tokens, as the information in them will vary and must be treated differently.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Fortunately, I have a SAS Studio component (step) called &lt;A href="https://go.documentation.sas.com/doc/en/webeditorcdc/v_042/webeditorflows/n097cs77nmmffan1tt3jxjpiuejd.htm" target="_blank" rel="noopener"&gt;Parse Data&lt;/A&gt;, that will do just this for me! &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_8-1715154205095.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96292iC9F73330D8FF604A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_8-1715154205095.png" alt="patric_8-1715154205095.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;As my data is in Swedish, I will be using the &lt;A href="https://support.sas.com/documentation/onlinedoc/qkb/33/QKBCI33/Help/qkb-help.html#qkb-generaldoc/qkb-locstructure.html?TocPath=About%2520SAS%2520Quality%2520Knowledge%2520Base%257C_____1" target="_blank" rel="noopener"&gt;Swedish Locale&lt;/A&gt; from the QKB to Parse the columns &lt;STRONG&gt;Name &lt;/STRONG&gt;and &lt;STRONG&gt;Address&lt;/STRONG&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;From the Name column, I keep the Given Name (Name_GIVENNAME) and Family Name (Name_FAMILYNAME) tokens. Extra tokens such as titles and additional Info, prefix and suffix can also be added, but for this Entity Resolution example they are not needed.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_9-1715154274801.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96293iFB15A357481B432F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_9-1715154274801.png" alt="patric_9-1715154274801.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;From the Address column, I keep Street Name (Address_STREETNAME), City (Address_CITY), Postal Number and Street Number Tokens. Notice that the Street Name has standardization issues, and there are missing values for City and Postal Number. Let us fix that!&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_10-1715154314026.png" style="width: 999px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96294iBC32D8ADDC0B10DC/image-size/large?v=v2&amp;amp;px=999" role="button" title="patric_10-1715154314026.png" alt="patric_10-1715154314026.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;SPAN&gt;Standardization&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;For the standardization issues I will use the &lt;A href="https://go.documentation.sas.com/doc/en/sasstudiocdc/v_050/webeditorcdc/webeditorflows/p0bge0vxhrdgpdn1njohwlej05qd.htm#n0g2liyq671xkjn13j93zhdi7g0k" target="_blank" rel="noopener"&gt;Clean Data&lt;/A&gt; Step and its &lt;A href="https://support.sas.com/documentation/onlinedoc/qkb/33/QKBCI33/Help/qkb-help.html#qkb-generaldoc/qkb-stddef.html?TocPath=Definition%2520Types%257C_____9" target="_blank" rel="noopener"&gt;Standardization&lt;/A&gt; capabilities.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-left" image-alt="patric_11-1715154358069.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96296i54DC17A4923E966E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_11-1715154358069.png" alt="patric_11-1715154358069.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt; &lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_12-1715154368020.png" style="width: 200px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96298i8624C31813600823/image-size/small?v=v2&amp;amp;px=200" role="button" title="patric_12-1715154368020.png" alt="patric_12-1715154368020.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_13-1715154390660.png" style="width: 200px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96299iACBCFF35FE6E8B7E/image-size/small?v=v2&amp;amp;px=200" role="button" title="patric_13-1715154390660.png" alt="patric_13-1715154390660.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Now that I have identified, parsed, and standardized my data, I am halfway through my Entity Resolution process.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;SPAN&gt;Match Code Creation &lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;Next step is to create &lt;A href="https://support.sas.com/documentation/onlinedoc/qkb/33/QKBCI33/Help/qkb-help.html#qkb-generaldoc/qkb-mtchdef.html?TocPath=Definition%2520Types%257C_____6" target="_blank" rel="noopener"&gt;Match Codes&lt;/A&gt;. A match code is an encrypted string that represents portions of the original input string. During the match code creation, techniques such as Phonetic Rules, Noise word removal, Standardization and Normalization are used.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;You can use different sensitivity levels to determine the amount of information stored in the match code. Use Lower sensitivity levels to sort data into general categories, or higher sensitivity for a closer match.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-left" image-alt="patric_14-1715154436359.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96300i14393AA3F5439B91/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_14-1715154436359.png" alt="patric_14-1715154436359.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_15-1715154459043.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96301i2FEE8CED24BFD937/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_15-1715154459043.png" alt="patric_15-1715154459043.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Phonetic Rules are applied in the matching process and will translate Patric, Patrick, Patrikk to Patrik. &lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;SPAN&gt;Clustering&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;With the match codes generated I &amp;nbsp;move on to clustering the data. I will be creating two different clusters where one of them is based on the Match Codes from the Name column, and the other one is based on the combination of Match Codes from Street Name and Street Number. The result of this clustering is that I get a Name cluster and an Address cluster. I will be putting rows with the same Match Code into the same cluster.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;For clustering I use a &lt;A href="https://go.documentation.sas.com/doc/en/webeditorcdc/v_042/webeditorsteps/n1rlf6jqdi2dt4n1ur1avntzf2q0.htm" target="_blank" rel="noopener"&gt;Custom Step&lt;/A&gt; called &lt;A href="https://github.com/sassoftware/sas-studio-custom-steps/tree/main/DQ%20-%20Clustering" target="_blank" rel="noopener"&gt;DQ – Clustering&lt;/A&gt; that is publicly available on the &lt;A href="https://github.com/sassoftware/sas-studio-custom-steps" target="_blank" rel="noopener"&gt;SAS github&lt;/A&gt; for Custom Steps.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_16-1715154505291.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96302i2D4054276F18E5EB/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_16-1715154505291.png" alt="patric_16-1715154505291.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_17-1715154528033.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96303i3213171D7317A1A3/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_17-1715154528033.png" alt="patric_17-1715154528033.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Once the clustering process has executed, we can clearly see that we have 2 different persons (NAME_CLUSTER), living on the same address(ADDRESS_CLUSTER).&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_18-1715154553507.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96304i804A27C32F181B9A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_18-1715154553507.png" alt="patric_18-1715154553507.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;SPAN&gt;Master Record Creation&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;Now it is time for me to create a master record for each person and enhance it with even better information from “Contributor” rows. Looking closer at the Address column, I see that there is missing information, some rows do not have a Postal Number while others are missing information about City. This means that this is cherry picking time where I pick the cherries (values) that suit the best!&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;To aid me in the cherry-picking process, I &amp;nbsp;use a &lt;A href="https://go.documentation.sas.com/doc/en/webeditorcdc/v_042/webeditorsteps/n1rlf6jqdi2dt4n1ur1avntzf2q0.htm" target="_blank" rel="noopener"&gt;Custom Step&lt;/A&gt; from the &lt;A href="https://github.com/sassoftware/sas-studio-custom-steps/tree/main/DQ%20-%20Surviving%20Record" target="_blank" rel="noopener"&gt;SAS Github&lt;/A&gt; site called&amp;nbsp;&amp;nbsp; &lt;A href="https://github.com/sassoftware/sas-studio-custom-steps/tree/main/DQ%20-%20Surviving%20Record" target="_blank" rel="noopener"&gt;DQ – Surviving Record&lt;/A&gt;. This Step allows me to apply rules in the process of creating a surviving (master) record for each entity (Person). As an example, I will let the majority (high occurrence) of “contributor” rows decide which Given Name will be used for each cluster. Also, I am not allowing any missing values in the columns Address_CITY and Address_POSTALNUMBER.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-left" image-alt="patric_19-1715154591089.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96305iC3DBA68D4648D3FC/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_19-1715154591089.png" alt="patric_19-1715154591089.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_20-1715154607343.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96306iD4F5E0BBBB090D76/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_20-1715154607343.png" alt="patric_20-1715154607343.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_21-1715154615929.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96307i14C892680EB20658/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_21-1715154615929.png" alt="patric_21-1715154615929.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Based on the rules in the DQ – Surviving Record step, values are selected for each cluster, and a Master Record from each cluster will be created.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_22-1715154654008.png" style="width: 999px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96308i8A442A1AA6714818/image-size/large?v=v2&amp;amp;px=999" role="button" title="patric_22-1715154654008.png" alt="patric_22-1715154654008.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;And once the cherry-picking process is done, I can review the fruit of my labor in my two master records.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_23-1715154691464.png" style="width: 999px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96309i844B954E46BB9FD9/image-size/large?v=v2&amp;amp;px=999" role="button" title="patric_23-1715154691464.png" alt="patric_23-1715154691464.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;SPAN&gt;Summary&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;We have now traveled through the Entity Resolution process, all the way from our Databricks table with poor data quality, to a refined result with master records, ready to be used for downstream analytics. And I did not have to write a single row of code!&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Of course, In SAS Studio I can access all these capabilities with code as well, if that is what I prefer. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;But in the era of democratization, why not democratize functionality as well as data?&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;Learn more about SAS and Databricks&lt;/STRONG&gt;&lt;/H4&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Harness-the-analytical-power-of-your-Databricks-platform-with/ta-p/921540" target="_blank" rel="noopener"&gt; Harness the analytical power of your Databricks platform with SAS&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Data-everywhere-and-anyhow-Gain-insights-from-across-the-clouds/ta-p/921121" target="_blank" rel="noopener"&gt; Data everywhere and anyhow! Gain insights from across the clouds with SAS&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Elevated-efficiency-and-reduced-cost-SAS-in-the-era-of-Cloud/ta-p/921943" target="_blank" rel="noopener"&gt; Elevated efficiency and reduced cost: SAS in the era of Cloud Adoption&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/SAS-and-Databricks-Your-Practical-Guide-to-Data-Access-and/ta-p/923733" target="_blank" rel="noopener"&gt; SAS and Databricks: Your Practical Guide to Data Access and Analysis&lt;/A&gt; &lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Data-to-Databricks-No-need-to-recode-get-your-existing-SAS-jobs/ta-p/924639" target="_blank" rel="noopener"&gt; Data to Databricks? No need to recode - get your existing SAS jobs to SAS Viya in the cloud&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt; &lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Maximize-Coding-and-Data-Freedom-with-SAS-Python-and-Databricks/ta-p/925503" target="_blank" rel="noopener"&gt;Maximize Coding and Data Freedom with SAS, Python and Databricks&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;</description>
    <pubDate>Wed, 08 May 2024 09:02:13 GMT</pubDate>
    <dc:creator>patric</dc:creator>
    <dc:date>2024-05-08T09:02:13Z</dc:date>
    <item>
      <title>Data Brilliance Unleashed: SAS Data Quality against Databricks - Precision, Performance, Perfection</title>
      <link>https://communities.sas.com/t5/SAS-Viya/Data-Brilliance-Unleashed-SAS-Data-Quality-against-Databricks/m-p/927475#M2372</link>
      <description>&lt;P&gt;Today the quality of data is paramount. Every decision, every insight hinges on the reliability and accuracy of the underlying data.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Many of the tools and techniques for improving data quality are locked behind deep technical knowledge and programming skills in different programming languages. What if I told you, it does not have to be that way?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Let me lead you into the world of Entity Resolution, where you will learn about techniques such as Data Identification, Parsing, Standardization, Matching, Clustering and Surviving Records. &lt;SPAN&gt;Entity Resolution attempts to identify different representations of the same data and provides a normalized, standardized master record, which will improve downstream analysis.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Entity Resolution has many different use cases, many of them within &lt;A href="http://Fraud%20Detection" target="_blank" rel="noopener"&gt;Fraud Detection&lt;/A&gt; and &lt;A href="http://Anti%20Money%20Laundering" target="_blank" rel="noopener"&gt;Anti Money Laundering&lt;/A&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;How do we identify the same entity within data and then enhance this entity with information from "contributor" rows, &lt;STRONG&gt;all without writing a single row of code and only using out-of-the-box functionality&lt;/STRONG&gt;? Let us get started!&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;I will be using a table found in Databricks that has some obvious data quality issues, which I will not be able to solve with the tools available in Databricks, unless I have very deep knowledge in SQL and Python programming as well as specific Python packages. You all know by now, how easy it is to connect to Databricks, as explained in my good colleague Cecily Hoffritz's blog:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/SAS-and-Databricks-Your-Practical-Guide-to-Data-Access-and/ta-p/923733" target="_blank" rel="noopener"&gt;SAS and Databricks: Your Practical Guide to Data Access and Analysis - SAS Support Communities&lt;/A&gt; &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The table I will be working with contains customers, both individuals and organizations. To keep it simple, I will focus on the individual customers and how I can find the entities, remove duplicates, standardize the data, and create my master records.&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_1-1715153512036.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96282i4971A88AA0935381/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_1-1715153512036.png" alt="patric_1-1715153512036.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;To help me on this journey I will be using the &lt;A href="https://support.sas.com/en/software/quality-knowledge-base-support.html" target="_blank" rel="noopener"&gt;SAS Quality Knowledge Base&lt;/A&gt; (QKB) and &lt;A href="https://www.sas.com/en_us/software/studio.html" target="_blank" rel="noopener"&gt;SAS Studio&lt;/A&gt; for engineering. The QKB contains massive amounts of predefined data quality rules and logic and in SAS Studio I can create visual modern data pipelines.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Identification&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;The first step is to identify whether a customer is an individual or an organization, for which I use the &lt;A href="https://go.documentation.sas.com/doc/en/sasstudiocdc/v_050/webeditorcdc/webeditorflows/p0bge0vxhrdgpdn1njohwlej05qd.htm#n0g2liyq671xkjn13j93zhdi7g0k" target="_blank" rel="noopener"&gt;Clean Data&lt;/A&gt; Step and its Identification Analysis.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-left" image-alt="patric_4-1715154016579.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96286i752CD1AA22D619F1/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_4-1715154016579.png" alt="patric_4-1715154016579.png" /&gt;&lt;/span&gt;&lt;EM&gt;Using the Name column to identify whether it is an Individual or Organization&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_5-1715154031609.png" style="width: 289px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96289i6CDFB180AB4DD3BC/image-dimensions/289x355?v=v2" width="289" height="355" role="button" title="patric_5-1715154031609.png" alt="patric_5-1715154031609.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt; &lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Once identified, I use the &lt;A href="https://go.documentation.sas.com/doc/en/webeditorcdc/v_042/webeditorflows/n1ivpt3accqtyzn1e2gad97a8gow.htm" target="_blank" rel="noopener"&gt;Branch Rows&lt;/A&gt; Step to split individuals and organizations into separate tables.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_6-1715154124844.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96290iBF171110A7FB554C/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_6-1715154124844.png" alt="patric_6-1715154124844.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_7-1715154155195.png" style="width: 999px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96291i0708120A9D2E9FD4/image-size/large?v=v2&amp;amp;px=999" role="button" title="patric_7-1715154155195.png" alt="patric_7-1715154155195.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;&lt;SPAN&gt;Parsing&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;To make any sense of this data, I would need a way of Parsing (dividing) it into tokens (components). I &amp;nbsp;apply different rules on these tokens, as the information in them will vary and must be treated differently.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Fortunately, I have a SAS Studio component (step) called &lt;A href="https://go.documentation.sas.com/doc/en/webeditorcdc/v_042/webeditorflows/n097cs77nmmffan1tt3jxjpiuejd.htm" target="_blank" rel="noopener"&gt;Parse Data&lt;/A&gt;, that will do just this for me! &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_8-1715154205095.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96292iC9F73330D8FF604A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_8-1715154205095.png" alt="patric_8-1715154205095.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;As my data is in Swedish, I will be using the &lt;A href="https://support.sas.com/documentation/onlinedoc/qkb/33/QKBCI33/Help/qkb-help.html#qkb-generaldoc/qkb-locstructure.html?TocPath=About%2520SAS%2520Quality%2520Knowledge%2520Base%257C_____1" target="_blank" rel="noopener"&gt;Swedish Locale&lt;/A&gt; from the QKB to Parse the columns &lt;STRONG&gt;Name &lt;/STRONG&gt;and &lt;STRONG&gt;Address&lt;/STRONG&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;From the Name column, I keep the Given Name (Name_GIVENNAME) and Family Name (Name_FAMILYNAME) tokens. Extra tokens such as titles and additional Info, prefix and suffix can also be added, but for this Entity Resolution example they are not needed.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_9-1715154274801.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96293iFB15A357481B432F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_9-1715154274801.png" alt="patric_9-1715154274801.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;From the Address column, I keep Street Name (Address_STREETNAME), City (Address_CITY), Postal Number and Street Number Tokens. Notice that the Street Name has standardization issues, and there are missing values for City and Postal Number. Let us fix that!&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_10-1715154314026.png" style="width: 999px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96294iBC32D8ADDC0B10DC/image-size/large?v=v2&amp;amp;px=999" role="button" title="patric_10-1715154314026.png" alt="patric_10-1715154314026.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;SPAN&gt;Standardization&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;For the standardization issues I will use the &lt;A href="https://go.documentation.sas.com/doc/en/sasstudiocdc/v_050/webeditorcdc/webeditorflows/p0bge0vxhrdgpdn1njohwlej05qd.htm#n0g2liyq671xkjn13j93zhdi7g0k" target="_blank" rel="noopener"&gt;Clean Data&lt;/A&gt; Step and its &lt;A href="https://support.sas.com/documentation/onlinedoc/qkb/33/QKBCI33/Help/qkb-help.html#qkb-generaldoc/qkb-stddef.html?TocPath=Definition%2520Types%257C_____9" target="_blank" rel="noopener"&gt;Standardization&lt;/A&gt; capabilities.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-left" image-alt="patric_11-1715154358069.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96296i54DC17A4923E966E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_11-1715154358069.png" alt="patric_11-1715154358069.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt; &lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_12-1715154368020.png" style="width: 200px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96298i8624C31813600823/image-size/small?v=v2&amp;amp;px=200" role="button" title="patric_12-1715154368020.png" alt="patric_12-1715154368020.png" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_13-1715154390660.png" style="width: 200px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96299iACBCFF35FE6E8B7E/image-size/small?v=v2&amp;amp;px=200" role="button" title="patric_13-1715154390660.png" alt="patric_13-1715154390660.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Now that I have identified, parsed, and standardized my data, I am halfway through my Entity Resolution process.&lt;/SPAN&gt;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;SPAN&gt;Match Code Creation &lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;Next step is to create &lt;A href="https://support.sas.com/documentation/onlinedoc/qkb/33/QKBCI33/Help/qkb-help.html#qkb-generaldoc/qkb-mtchdef.html?TocPath=Definition%2520Types%257C_____6" target="_blank" rel="noopener"&gt;Match Codes&lt;/A&gt;. A match code is an encrypted string that represents portions of the original input string. During the match code creation, techniques such as Phonetic Rules, Noise word removal, Standardization and Normalization are used.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;You can use different sensitivity levels to determine the amount of information stored in the match code. Use Lower sensitivity levels to sort data into general categories, or higher sensitivity for a closer match.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-left" image-alt="patric_14-1715154436359.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96300i14393AA3F5439B91/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_14-1715154436359.png" alt="patric_14-1715154436359.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_15-1715154459043.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96301i2FEE8CED24BFD937/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_15-1715154459043.png" alt="patric_15-1715154459043.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Phonetic Rules are applied in the matching process and will translate Patric, Patrick, Patrikk to Patrik. &lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;SPAN&gt;Clustering&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;With the match codes generated I &amp;nbsp;move on to clustering the data. I will be creating two different clusters where one of them is based on the Match Codes from the Name column, and the other one is based on the combination of Match Codes from Street Name and Street Number. The result of this clustering is that I get a Name cluster and an Address cluster. I will be putting rows with the same Match Code into the same cluster.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;For clustering I use a &lt;A href="https://go.documentation.sas.com/doc/en/webeditorcdc/v_042/webeditorsteps/n1rlf6jqdi2dt4n1ur1avntzf2q0.htm" target="_blank" rel="noopener"&gt;Custom Step&lt;/A&gt; called &lt;A href="https://github.com/sassoftware/sas-studio-custom-steps/tree/main/DQ%20-%20Clustering" target="_blank" rel="noopener"&gt;DQ – Clustering&lt;/A&gt; that is publicly available on the &lt;A href="https://github.com/sassoftware/sas-studio-custom-steps" target="_blank" rel="noopener"&gt;SAS github&lt;/A&gt; for Custom Steps.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_16-1715154505291.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96302i2D4054276F18E5EB/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_16-1715154505291.png" alt="patric_16-1715154505291.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_17-1715154528033.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96303i3213171D7317A1A3/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_17-1715154528033.png" alt="patric_17-1715154528033.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Once the clustering process has executed, we can clearly see that we have 2 different persons (NAME_CLUSTER), living on the same address(ADDRESS_CLUSTER).&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_18-1715154553507.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96304i804A27C32F181B9A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_18-1715154553507.png" alt="patric_18-1715154553507.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;SPAN&gt;Master Record Creation&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;Now it is time for me to create a master record for each person and enhance it with even better information from “Contributor” rows. Looking closer at the Address column, I see that there is missing information, some rows do not have a Postal Number while others are missing information about City. This means that this is cherry picking time where I pick the cherries (values) that suit the best!&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;To aid me in the cherry-picking process, I &amp;nbsp;use a &lt;A href="https://go.documentation.sas.com/doc/en/webeditorcdc/v_042/webeditorsteps/n1rlf6jqdi2dt4n1ur1avntzf2q0.htm" target="_blank" rel="noopener"&gt;Custom Step&lt;/A&gt; from the &lt;A href="https://github.com/sassoftware/sas-studio-custom-steps/tree/main/DQ%20-%20Surviving%20Record" target="_blank" rel="noopener"&gt;SAS Github&lt;/A&gt; site called&amp;nbsp;&amp;nbsp; &lt;A href="https://github.com/sassoftware/sas-studio-custom-steps/tree/main/DQ%20-%20Surviving%20Record" target="_blank" rel="noopener"&gt;DQ – Surviving Record&lt;/A&gt;. This Step allows me to apply rules in the process of creating a surviving (master) record for each entity (Person). As an example, I will let the majority (high occurrence) of “contributor” rows decide which Given Name will be used for each cluster. Also, I am not allowing any missing values in the columns Address_CITY and Address_POSTALNUMBER.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-left" image-alt="patric_19-1715154591089.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96305iC3DBA68D4648D3FC/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_19-1715154591089.png" alt="patric_19-1715154591089.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_20-1715154607343.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96306iD4F5E0BBBB090D76/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_20-1715154607343.png" alt="patric_20-1715154607343.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_21-1715154615929.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96307i14C892680EB20658/image-size/medium?v=v2&amp;amp;px=400" role="button" title="patric_21-1715154615929.png" alt="patric_21-1715154615929.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Based on the rules in the DQ – Surviving Record step, values are selected for each cluster, and a Master Record from each cluster will be created.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_22-1715154654008.png" style="width: 999px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96308i8A442A1AA6714818/image-size/large?v=v2&amp;amp;px=999" role="button" title="patric_22-1715154654008.png" alt="patric_22-1715154654008.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;And once the cherry-picking process is done, I can review the fruit of my labor in my two master records.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="patric_23-1715154691464.png" style="width: 999px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/96309i844B954E46BB9FD9/image-size/large?v=v2&amp;amp;px=999" role="button" title="patric_23-1715154691464.png" alt="patric_23-1715154691464.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;H2&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H2&gt;&lt;SPAN&gt;Summary&lt;/SPAN&gt;&lt;/H2&gt;
&lt;P&gt;&lt;SPAN&gt;We have now traveled through the Entity Resolution process, all the way from our Databricks table with poor data quality, to a refined result with master records, ready to be used for downstream analytics. And I did not have to write a single row of code!&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Of course, In SAS Studio I can access all these capabilities with code as well, if that is what I prefer. &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;But in the era of democratization, why not democratize functionality as well as data?&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;Learn more about SAS and Databricks&lt;/STRONG&gt;&lt;/H4&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Harness-the-analytical-power-of-your-Databricks-platform-with/ta-p/921540" target="_blank" rel="noopener"&gt; Harness the analytical power of your Databricks platform with SAS&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Data-everywhere-and-anyhow-Gain-insights-from-across-the-clouds/ta-p/921121" target="_blank" rel="noopener"&gt; Data everywhere and anyhow! Gain insights from across the clouds with SAS&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Elevated-efficiency-and-reduced-cost-SAS-in-the-era-of-Cloud/ta-p/921943" target="_blank" rel="noopener"&gt; Elevated efficiency and reduced cost: SAS in the era of Cloud Adoption&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/SAS-and-Databricks-Your-Practical-Guide-to-Data-Access-and/ta-p/923733" target="_blank" rel="noopener"&gt; SAS and Databricks: Your Practical Guide to Data Access and Analysis&lt;/A&gt; &lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;&lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Data-to-Databricks-No-need-to-recode-get-your-existing-SAS-jobs/ta-p/924639" target="_blank" rel="noopener"&gt; Data to Databricks? No need to recode - get your existing SAS jobs to SAS Viya in the cloud&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt; &lt;A href="https://communities.sas.com/t5/SAS-Communities-Library/Maximize-Coding-and-Data-Freedom-with-SAS-Python-and-Databricks/ta-p/925503" target="_blank" rel="noopener"&gt;Maximize Coding and Data Freedom with SAS, Python and Databricks&lt;/A&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;</description>
      <pubDate>Wed, 08 May 2024 09:02:13 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Viya/Data-Brilliance-Unleashed-SAS-Data-Quality-against-Databricks/m-p/927475#M2372</guid>
      <dc:creator>patric</dc:creator>
      <dc:date>2024-05-08T09:02:13Z</dc:date>
    </item>
  </channel>
</rss>

