Is your lakehouse turning into data dump? Bring some order into chaos with SAS Data Governance

4 Likes

The data platform landscape has obviously seen many new entrants in recent years. One of the most popular ones around is Databricks who are promoting the lakehouse concept, a storage that brings together the best qualities of a data lake and a data warehouse. The beauty of versatile storage is flexible data loading for almost any kind of data. Within that easiness also lies the risk of uncontrolled data hoarding, as we already witnessed in the heydays of Hadoop. With great data comes great responsibility and that responsibility is best implemented with data governance. SAS has helped many of our customers connect these two powerful data and analytics platforms together and embrace them with seamless data access, data governance, and data quality.

Let’s face it, data is only useful when you can properly access it. SAS Viya provides many powerful ways to access all the commonly used data sources today. While SAS Viya provides integration with most data sources, in this blog we focus on Databricks. It can be accessed through a specific Databricks data connection, the Spark connection or the JDBC connection. I’m calling them connections for simplicity, while in most cases there is a SAS/ACCESS interface under the hood.

The image above is from SAS Viya’s Data Explorer and shows how simple it is to define new data connections. My colleague Cecily explains this hands-on in her blog: SAS and Databricks: Your Practical Guide to Data Access and Analysis For clarity, SAS Viya’s connection to Spark is delivered with a JDBC driver for Databricks and enables out-of-the-box connectivity. Using this embedded driver, a Databricks connection can be achieved in either by defining a Spark LIBNAME statement that specifies the connection options OR by defining a Spark LIBNAME statement that specifies a JDBC URL for the target data source in the URL= option.

SAS/ACCESS interfaces in general offer very good performance but finding the optimal configuration may take some planning and testing, as creating a connection from SAS Viya to Databricks is possible through at least ODBC, JDBC and Spark interfaces. SAS continuously updates and improves our SAS/ACCESS interfaces to provide continued compatibility and optimized performance.

Once connected with the data, SAS Viya provides capabilities to track the lineage of data from various sources, such as Databricks. This helps organizations understand how data is used and where it comes from, supporting transparency and traceability. Lineage also helps to understand the effect of planned changes to the data process. A typical scenario would be a requested change to the data model of a source table. That change of course needs to be carried out through the data process all the way to the end result. The example below shows an example of a simple data process. By following the steps in the lineage flow we learn the following things:

Source tables are accessed in Databricks with the target table under the same schema
A 2-table join is executed by Databricks, and result data also remains in Databricks
The result table is then loaded into SAS Viya’s in-memory engine called CAS (Cloud Analytics Server)
A SAS Visual Analytics report has been created based on the CAS in-memory table

In addition to providing Lineage, SAS Viya offers robust access control mechanisms to ensure that only authorized users have access to Databricks (and any other) data that has been introduced to SAS Viya. SAS Viya allows organizations to define and enforce their data governance policies. This includes policies related to data quality, security, and compliance. With role-based access control and tight integration with enterprise authentication systems, SAS Viya can smoothen your access to Databricks. Single sign-on is available to authenticate connections to from SAS Viya to Databricks in Azure by utilizing a Microsoft Entra ID token that is obtained and utilized by SAS Viya’s credential services to allow seamless access.

SAS Viya supports data quality monitoring and profiling of any data, including Databricks. This makes it simple for Data Engineers and Data Stewards to assess and monitor the quality of data, identify any data anomalies, and have the necessary tools take corrective actions, for example with Clean Data and Parse Data steps in SAS Studio flows.

My colleague Patric has explained the data quality process in detail in his data quality blog here: Data Brilliance Unleashed: SAS Data Quality against Databricks - Precision, Performance, Perfection In his blog Patric takes you through the whole quality improvement process, including identification, splitting, standardization, match code creation, clustering and entity resolution, so it’s a wholeheartedly recommended read!

For those developers who prefer to do their data quality in code, SAS Studio includes a collection of data quality code snippets. They can be run as-is or embedded into SAS Studio flows as code steps. You can read more about efficient use of snippets here: Working with snippets

A data catalog is a central metadata repository that helps users discover, understand, and manage all their data assets. SAS Viya includes SAS Information Catalog to discover catalogued data from any supported data source, for example Databricks, making it much easier for the data users to find relevant data and understand its context. If you come from SAS9 background, you are most likely familiar with the concept of metadata. While SAS Viya does not have a similar Metadata Server as SAS9 to manage both technical and business metadata, rest assured, it’s still there. SAS Information Catalog is based on discovery agents set up by the platform administrator that work hard to gather metadata on the data assets connected to your environment.

As SAS Viya’s discovery agents gather the metadata, they also go through a data profiling process that extracts the descriptive data metrics and quality indicators on your data. SAS Information Catalog provides a centralized view of all your metadata, thus helping you to understand the characteristics of your data. A good understanding of the total data asset is key in building comprehensive data governance, and the asset dashboard in SAS Information Catalog does exactly that:

The above image is borrowed from the great blog post SAS Information Catalog: All your information assets under one roof by my colleague Rajeeve Narula. What is great about SAS Information Catalog, once you find the data you’re looking for, you can instantly view the analyzed data metrics with a one click drill-down, an example of a typical column level analysis below:

Much like the SAS Viya platform in general, SAS Information Catalog provides REST APIs accessible from SAS, Python, or shell scripts. SAS Information Catalog REST APIs enable searching and identification of files, tables and other assets based on specific criteria. Developers and data engineers can leverage these APIs to incorporate files and tables into data management tasks, as well as trigger actions or automate workflows. These REST APIs can gather insightful metadata and provide a comprehensive view of your data landscape. With the information gained, data users can explore their data ecosystem effectively, providing a high-level overview of assets and enabling informed decision making. Depending on your task, you can interact with several REST endpoints, examples of these in the image below:

For further insight how to utilize SAS Information Catalog’s REST APIs, have a look at my colleague Bogdan Teleuca’s informative blog post: Leveraging SAS® Information Catalog REST APIs: Programmatically Discovering Data.

Having firm control of your metadata is crucial, but it’s difficult to understand the big picture without a link to the real world. This is where a data glossary comes in with the ability to manage your business terms and link them with your data assets. SAS Information Catalog has a glossary component that enables you to manage business terms and more importantly build connections to the actual data assets. The glossary supports a collaborative approach to managing this information and allows you to:

Create and maintain and add attributes to term types
Create new terms and import delimited lists of terms
Establish relationships between terms and term types
Review terms and term types in the Glossary window
Assign terms to SAS Information Catalog assets
Search for terms in the Search field

A typical hierarchy of business terms in the SAS Viya Glossary looks like in the image below:

With solutions like above, you can bring control and governance to your data. The key thing about having a lot of data is finding the right way to make it work for you. If you’re not a data engineer like me, and do it for the kicks, in the real world there is always a use case to implement and a business goal to accomplish. Data without access, quality and governance is just idle ones and zeroes. No matter where you collect, store, and maintain your data, you will always need tools to manage your data in a controlled and governed manner. Having established control and monitoring procedures for your data lake, there is less chance of it gradually becoming a data dump. A solid and governed data foundation lets you both sleep better at night and get faster from data to value!

Is your lakehouse turning into data dump? Bring some order into chaos with SAS Data Governance

Learn more about SAS and Databricks

The 2025 SAS Hackathon has begun!

SAS AI and Machine Learning Courses