BookmarkSubscribeRSS Feed

How to Collect Metadata with the SAS Information Catalog

Started ‎03-16-2021 by
Modified ‎08-29-2021 by
Views 6,974

bt_1_24_bt_200_SAS_Information_Catalog_action-1024x230.png

 

The SAS Information Catalog is finally here! Read the article to understand how metadata is collected with a discovery agent (crawler) and added to the catalog. Learn what content can be crawled from a caslib and a SAS compute library and how to monitor the agents. The post gives you a preview of the SAS Viya version 2020.1.3 – February 17th, 2021 release.

 

Search the Catalog

In 7 Ways to Use the New SAS Information Catalog you might have read how catalog users can search the catalog, asses and understand the data assets. The post will focus on  the metadata collection process.

 

Collect Metadata

The collected metadata is stored in the information catalog. You add information to the catalog by running discovery agents on libraries. These agents "crawl" through caslibs or SAS compute libraries content. Agents are also known as bots or crawlers. They collect the metadata from physical tables or files inside the library and calculate many metrics in the process. 

Caslibs and SAS Compute libraries

SAS Information Catalog discovery agents can ingest metadata from global CAS libraries (caslib) or SAS compute libraries. As a consequence, data sources covered by a Data Connector, Data Connect Accelerator or SAS Access engine become discoverable. (SAS/Access for Hadoop needs some extra path options.) Let’s see two examples:

  1. Caslib.
  2. SAS Compute library.

Caslib discovery agent

Findings:

  • The caslib must be global (promoted).
  • The caslib must be visible to a SAS Administrator (at least Read authorization).
  • The agent crawls only physical files or tables from CASLIBs. It doesn't catalogue (yet) loaded in-memory tables.

Examples:

  • If your CASLIB is of Database type (ODBC) you can crawl the physical tables.


bt_2_350_CASLIB_ODBC_Crawlable_files_tables.png

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

 

  • If your CASLIB is of Filesystem type (PATH) you can crawl the physical files: txt, csv, jmp, xls, xlsx, sas7bdat, sashdat, dta, sav, parquet, orc. See the documentation for more info.

bt_3_350_CASLIB_PATH_Crawlable_files_tables.png

 

To discover new assets with the SAS Information Catalog:

  • Add a new discovery agent.
  • Select the library created.
  • Fill in a description (optional a physical region).
  • Save and run the discovery agent.
  • Monitor the run.
  • Consult the new assets added.

Create a caslib

You can create global caslibs:

  • In SAS Information Catalog, there is an Import Data button, which will re-direct you to SAS Environment Manager.
  • In SAS Data Explorer or SAS Environment Manager. See Making Data Available to CAS in SAS Data Explorer: User’s Guide.
  • By writing your own CASL code in SAS Studio.
  • Use a SAS client, write python, using the SAS SWAT package, etc.

 

Add a new discovery agent

In SAS Information Catalog, create a discovery agent. Choose or search the caslib to be discovered:

bt_4_320_SAS_Information_Catalog_discovery_agent_caslib_description-1024x492.png

 

Run the discovery agent

When the job status is Idle, the job has completed. 

  • If the discovery agent run is successful you might see new assets added to the catalogue. Might, meaning new assets not yet catalogued or assets which have been catalogued and have been modified since the last run.
  • If the discovery agent run fails, there the failed result will only be visible in SAS Environment Manager.

 

Monitor the discovery agent

You can trace the execution in SAS Environment Manager, the Jobs and Flows sections:

  • The most recent discovery agent run (green) executed successfully, if there was at least one physical file or table inside that can be crawled
  • The first discovery agent run (red) failed when only in-memory tables were inside or files and tables other than of the type accepted.

 

bt_5_341_SAS_Environment_Manager_discovery_agent_jobs_status-1024x488.png

 

  • The discovery agent can also complete successful but with warnings (green). It may be the case some columns were too long and have been discarded, or there is an authorization issue which fully prevents the agent.

 

Each discovery agent is made of jobs. The jobs run chronologically in this order:

  1. A job with the same name as the discovery agent name kicks in. The jobs is internally called the CATALOG-TABLE-BOT.
  2. Crawl <discovery agent name> connection job creates a list with the physical files or tables to be analyzed.
  3. Analyze <discovery agent name> connection job calculates the catalog metrics and if successful stores them in a PGSQL database. These metrics are surfaced in the catalog interface.

 

Parameters

The jobs has a series of parameters. You cannot adjust them in the interface (yet):

  • Quality Knowledge Base (QKB) locale is applied, for data identification, by default: ENUSA.
  • The size of the data set sent to the profiling engine, by default sampleSize = 10000 records.
  • Limit the amount of data coming from the analyze scripts.
  • A threshold is used to decide when to sample for other metrics: rowSizeThreshold = 200000.
  • For sampling on large data sets, a percent is applied: samplePercent=20.
  • A cutoff for statistical values calculated: intervalCutoff = 20; bestChartCutoff = 20.
  • Whether or not SAS recommends a time series graph.
  • How many levels to calculate for topN and bottomN and for the frequency chart type in graphs: topnlevelsToGet=20.

The metrics produced by the agent can be seen in the 7 Ways to Use the New SAS Information Catalog.

SAS Compute library discovery agent

The other library type you can crawl is a SAS Compute library. You can create compute libraries in SAS Studio. See Working with Libraries in SAS Studio: User’s Guide. In the following example, you might define a BASE (V9) SAS library. Tips:

  • If you are using SAS Studio to create the library, the option Assign and connect to data sources at start-up must be checked.
  • To avoid a cross-site encoding error, add in-encoding=ANY and out-encoding=ANY in your libname options (input and output encoding boxes below).

 

bt_7_350_SAS_Studio_new_SAS_library.png

 

SAS Compute library discovery agent (behind the scenes)

The crawl process is similar to caslibs. Only this time, the whole process runs in run in SAS Compute:

 

When the third step, Analyze <discovery agent name> connection job kicks-in, the metrics are collected mostly with:

 

Watch the discovery agent at work in this 1' video:

 

 

Notes

License

While the product is called SAS Information Catalog, there are two licenses:

  • SAS Information Catalog (base features).
  • SAS Information Governance (advanced features).

 

Please note you need a SAS Information Governance license to:

  • Crawl SAS compute libraries.
  • Assign semantic types (data identification using the QKB).
  • Classify information privacy.

See the following resource for the complete set of features.

 

Metrics Storage

  • The metrics are stored in PostgreSQL in this and the future SAS Information Catalog version.
  • In a next release, SAS Information Governance (advanced features) will store the results in Janusgraph. Janusgraph is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Graph databases are particularly suitable for discovering relationships and paths.

 

Roles

By default, SAS Administrator group members can create discovery agents that capture and enrich metadata for information assets. 
Users can only search the catalog, explore details about assets, take actions on assets, and build collections of searches.
These groups and roles might evolve in a future release.

In the current role matrix, administrators perform the following tasks:
  • Identify or create libraries for assets that should be indexed in the catalog.
  • Add a discovery agent for each library.
  • Monitor discovery agent jobs in SAS Environment Manager.
  • Schedule discovery agents. 
  • Respond to notifications, such as a notification to rebuild the catalog index.
  • Assign a review status to an asset or an extra description.

What is New in the SAS Information Catalog SAS Viya 2020.1.4 Release

The latest stable release of SAS Viya (2020.1.4) added the following to the SAS Information Catalog: Locale selection for Discovery Agents, Information Privacy, Time Period, Area Covered for Assets. Read more about it here.

Conclusions

In the brand-new Information Catalog on SAS Viya version 2020.1.3, you can run discovery agents on libraries (caslibs or SAS compute libraries). These agents crawl through the libraries. They collect the metadata from the tables and calculate many metrics in the process. The agents run as jobs that can be monitored in SAS Environment Manager. They bring the metadata within the SAS Information Catalog. Then you can use the powerful search engine to help you find the data assets you need. See the videos in 7 Ways to Use the New SAS Information Catalog.

Resources

Acknowledgements: Nancy Rausch, Kumar Thangamuthu and Vincent Rejany.

 

Thank you for your time reading this post. If you liked the post, give it a thumbs up. Please comment and tell us what you think about the new SAS Information Catalog.

 

Find more articles from SAS Global Enablement and Learning here.

Version history
Last update:
‎08-29-2021 09:19 PM
Updated by:
Contributors

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started