BookmarkSubscribeRSS Feed

Kick-start Your Data Exploration Journey with SAS Information Catalog

Started 3 weeks ago by
Modified 2 weeks ago by
Views 549

Different types of users get started with SAS Information Catalog at different points of the process. Administrative users, for example, begin by running discovery agents on the libraries that need to be included in the catalog. Other users are more focused on searching for the information assets that they need or on fully investigating the assets that they find.

 

As a part of data exploration, while working in SAS Visual Analytics you might want to generate descriptive measures for the selected table. In Visual Analytics starting LTS 2023.10 the Data pane no longer provides the option to view Measure details. The SAS Information Catalog helps you obtain the needed descriptive measures and much more to understand your data better. It gives you the ability to ingest, integrate, and enrich metadata from the assets that are distributed across your enterprise. In fact, everyone might find some relevant information which is useful in their line of work. In this blog, I will discuss just the steps to generate descriptive measures for your data.

 

Accessing SAS Information Catalog

 

Launch SAS Viya platform application. Click Applications menu (three horizontal bars) in the upper left corner and select Discover Information Assets. SAS Information Catalog opens into the Catalog Home window. Go to the Catalog Home window when you need to access the information assets that you have already discovered, search for assets, or manage the SAS Information Catalog application. The window is shown below-

 

ms_1_Welcomepage_Edited-1024x437.png

 

The information assets that you have recently discovered and the assets that you have marked as favorites are listed in the center of the Catalog Home window¹. The field used for entering search queries and the menu for search type selection and search help are located in the center of the Catalog Home window².

 

Ad Hoc Analysis

 

You can analyze data feature to run an analysis on the metadata of database assets that were not analyzed when they were added to the information catalog. Starting with the 2023.12 release, all authenticated users can perform an ad hoc analysis on any table that they can read. You can access this feature from SAS Information Catalog in multiple ways and are discussed below:

 

  1. the Collections section of the Catalog Home window (select the box next to a data set and click on Actions button to select Analyze data).

    ms_2_Collectionsection_Edited-1024x444.png

  2. the Search Results window.

  3. the Overview tab in the details window for a database asset (from the Catalog Home window, click on the data set name to access Overview tab and Analyze data button)

    ms_3_Overviewsection_Edited-1024x480.png

Analyze data can be run on CAS database assets by any authorized SAS Information Catalog user. You must have an Information Governance license to run analyze data on other types of database assets. COMMSDATA is a CAS table asset in the CASUSER library. This table has not yet been analyzed, but you can use ad hoc analysis to perform an analysis and access more detailed metadata for the asset. Using any of the three above mentioned approaches, click Analyze data to start the ad hoc analysis of the asset.

 

ms_4_Adhoc-request-message.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

 

You get an ad hoc request message that displays information about the job request for the analysis. Click Submit request button for a standard profile analysis. If you want to select a basic analysis or configure options for a profile analysis, you can click Show advanced options for this dataset.

 

ms_5_Adhoc_AnalysisMode-300x199.png

 

 

You can use Analysis mode field if you want to select either a Profile analysis or a Basic metadata analysis for the selected data set. The Profile analysis option generates the most information and is displayed by default. The Basic option generates a quick analysis that includes minimal table and column information. If you are satisfied with the options displayed for the data set analysis, you can click Hide advanced options for this dataset to return to initial version of the Analyze Data window. Now click on Submit request button. When the ad hoc analysis job completes successfully, the message at the top of the window changes. Click View the analysis link in the window to see the results.

 

ms_6_View-Analysis_Edited-1024x131.png

 

The details view of an information asset is opened in the Overview tab.

 

ms_7_Result_Analysis_Overview-1024x469.png

 

The top of this updated window contains general information about the physical properties, completeness, and status of the selected result. It also includes the Actions drop-down menu, which enables you to open the result in a related application such as SAS Visual Analytics, Model Studio, SAS Lineage Viewer or to close the asset details window. This detailed view of a selected information asset, in turn, is divided into three tabs that provide distinct Information about the asset:

 

Overview tab - displays statistics, a description, a data types panel, and properties.

Column Analysis tab - provides access to descriptive measures, metadata measures, and data quality measures.

Sample data tab - displays sample data for the selected information asset.

 

Overview Tab Components

 

The Overview tab contains several components. The Data Types Panel at the top displays the data type frequency and number of columns for each data type. You can click on the colored bar in the treemap to filter on the data type. This displays only columns that match the selected data type. Hovering mouse on the blue bar indicates that out of 128 columns there are 111-numeric and 17-character columns in the COMMSDATA table. If you have licensed SAS Information Governance, the information asset Overview tab contains the Information Privacy, Time Period Covered, Top Areas Covered, and Top Languages fields.

 

Information Privacy- Indicates whether the asset includes potentially private information about individuals. The asset can be designated as Sensitive, Private, or Candidate. Columns with no privacy concerns display the string (none). In COMMSDATA columns like Customer ID, Upsell_Xsell, Churn and Lifetime Value etc. has information privacy labelled as ‘None’. They do not convey any personal information but if you scroll down and examine other variables you find that many variables in the data set have information privacy value labelled as ‘Private’ or ‘Candidate’.

 

Time Period Covered- Displays information about the time interval covered in the asset.

Top Areas Covered- Displays information about the spatial area covered in the asset.

Top Languages - Reports the detected languages and the analyzed columns.

Summary - These descriptions enable you to review meaningful information about the whole asset without going through the information for each column individually.

 

Column Analysis Tab Components

 

Click the Column Analysis tab to see more detailed information about the table columns. You can access Descriptive Measures, Metadata Measures, and Data Quality Measures tables from column analysis tab. You can drill into a column to view the column graph. The Descriptive Measures table displays several statistics for numeric and character data items. For numeric data items it displays statistics like minimum, maximum, number of distinct values, mean, median, standard deviation, quartiles, skewness, kurtosis, number of missing values and outliers among others. For character data items it displays number of distinct values, number of missing values, number of blank rows in the selected column and number of values that do not match (mismatched) the actual type of the column etc. We have these statistics /metrics displayed for every single column in the table and it’s nice to see them all side by side so that you can compare them across the column. These statistics aid analysts to better understand data and decide about data pre-processing strategies, if any.

 

ms_8_Result_Column-Analysis-600x220.png

 

 

Correlation and outlier indicators are displayed next to the column names in this table. Correlations and outliers are marked by applicable icons such as the small arc with a dot in the middle is used for correlation and an unfilled dot separated from cloud of filled dots for outliers. There are many variables in COMMSDATA that contains outliers and / or are correlated with other variables. The correlations that you see at any given time are displayed for the currently selected row, which in this case is calls_care_3mavg_acct (Number Calls Care Center 3 Month Avg). You can scroll down through the table to see whether any other columns are correlated to the selected column. Alternatively, you can select the corresponding row and click correlation icon in the Properties pane (to the right) to see a list of all of the columns that are correlated with the selected column. In the figure above, when you click on the correlation icon for calls_care_3mavg_acct variable you see that it is strongly correlated with res_calls_3mavg_acct, calls_care_6mavg_acct, and res_calls_6mavg_acct. The magnitude of correlation is also displayed alongside.

 

Alternatively, you can also place the cursor over the correlation icon for one of the correlated columns to see the degree of correlation. Similarly, you can place your cursor over the outliers icon to see the number of outliers for the column in a data tip. Outlier indicators are also visible in the Outliers column in the Descriptive Measures view and the Properties pane for a selected column.

 

Now, let’s have a look at the Metadata Measures table-

 

ms_9_MetadataMeasures-1024x390.png

 

It displays the Name, Label, Type, Logical Type, Format, and Information Privacy for numeric as well as character columns. Logical data types are sometimes referred to as statistical data types. Possible values are Nominal, Interval, Unary, Binary, and ID. In COMMSDATA, the Logical Type for the first variable Customer_ID is identified as ID and Churn as Binary. The Actual Type (applies to character data item only) identifies the SQL type of the data contained in a column. Possible values are Boolean, Date, Integer, Real, and String. This table also helps identify a Primary Key Candidate variable, i.e., if a column has high enough uniqueness percentage to be used as a key column; available values are Yes and No.

 

Finally, we take a look at the Data Quality Measures table-

 

ms_10_DataQualityMeasure-1024x462.png

 

Data Quality Measures table displays Name, Completeness (% of values in the column that are complete), Uniqueness (%of values in the column that are unique), Most/Least Common Value, Semantic Type, and Information Privacy for both numeric and character data items. Semantic type indicates the classification of the column’s data determined by the Quality Knowledge Base (QKB) identification analysis definitions. The column’s semantic type considers both the column’s name (if present) and the column’s content to ultimately classify the data in the column. For example, a column that contains last names might get a semantic type of Family name. Note that these semantic type classifications are not perfect, but it gives you a good idea without much effort. Starting with release 2023.10, you can specify or change the semantic type assigned to a column. You must have a SAS Information Governance license to work with semantic types.

 

Data Quality Measures table also displays Pattern Count for character data items. Pattern count indicates the number of unique word or character patterns that occur in the column.

 

ms_11_PatternCount-1024x305.png

 

 

For example, the variable handset (Handset Mfg) has a pattern count of 5. This means handset variable indicates 5 unique words (i.e., handset manufacturers names) or has 5 unique character patterns. As shown in the figure above, one of the character patterns (Aaaaa) indicates a five-letter word that starts with a capital letter (example-Apple), similarly you see a character pattern (AA) that consists of 2 capital letters (example- LG) and so on.

 

Column Graphs

 

Click the link beneath any column listed in the Descriptive Measures, Metadata Measures, or Data Quality Measures tables in the Column Analysis tab to see a set of graphs for the column. (You can also right-click the column and select View column details in the pop-up menu.) The graphs that are displayed depend on whether the selected column is a numeric or a character column. For numeric columns you get graphs that include Frequency distribution, Deviation from Normality, Full distribution, Completeness and Uniqueness.

 

ms_12_ColumnGraph_numeric-1024x456.png

 

In the top left corner, you can see distinct values, completeness & uniqueness. So, you have 19,867 distinct values for 56,557 rows and 98% completeness (i.e., 2% missing values). Uniqueness is calculated to be 35% (which is 19,867 of 56,557). In the bottom left, you see a nice chart between kurtosis and skewness. Putting your cursor on the small dot in the chart displays the values of kurtosis and skewness for the selected column. On the top right you see a frequency distribution chart that shows most of the users have their average revenue for the past 3 months value to be 51.77 dollars. Finally, on the bottom right you see the full distribution of chosen column which is represented by a Box and Whisker plot. Next, we take a look at the character column.

 

For character data items graphs displayed are Completeness and Uniqueness, Mismatched, Frequency distribution, Pattern frequency. However, for a long text column it also includes a word cloud, Languages Detected and Sentiment Analysis.

 

ms_13_ColumnGraph_Character-1024x458.png

 

 

The above picture displays graph for a long text column namely verbatims (Survey Verbatims). This variable represents feedback from customers via call centers. The word cloud on the bottom left displays key words that are frequently used in the feedback. On the bottom right, you can find the Sentiment analysis graph (not captured in the graph above) that displays the sentiment polarity of the input document. You can quickly see the percentages for the content that registers as positive, negative, or neutral sentiments for the long text column.

 

Sample Data Tab Components

 

Click the Sample Data tab to see sample data for the selected information asset. By default, it displays 100 rows to give you a snapshot of your dataset. You can enter a different value (greater than 0 or less than or equal to 1000) for sample rows.

 

Conclusions

 

The SAS Information Catalog enables users with a powerful search engine to find the needed data assets. Many assets that are created by SAS Viya applications, such as tables, data plans, reports, and models, are added automatically to SAS Information Catalog. It helps to remove hurdles to data discovery and acquisition. You can quickly take the decision if a certain table is suitable to use for analysis or not by judging the data preparation efforts and data quality issues. It helps you identify if the data includes private information. Moreover, you can open the information assets that you find using SAS Information Catalog in related applications that help you to better understand how your data is used. For example, you can open an information asset that you locate using SAS Information Catalog in SAS Visual Analytics or Model Studio for further analysis.

 

Additional Resources

 

Comments

The article explains Information Catalog well, with screenshots and details. It is helpful for preparing the course SAS Viya Overview.

Version history
Last update:
2 weeks ago
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags