Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
The SAS Information Catalog is finally here! The article gives you a preview of the SAS Viya version 2020.1.3 – February 17th, 2020 release. Read it to learn how to make good use of this brand-new SAS product. The SAS Information Catalog helps you uncover the needed data for your business purpose. Such a catalog gives you a place to ingest metadata from data sources. You can use the metadata to find relevant data for your business goals and understand the data sets you need.
There is no fixed playbook about how and who should use the SAS Information Catalog. In fact, everyone might find a piece which is useful in their line of work. Let's look at just:
SAS Information Catalog supports standard and syntax search. Both of these search methods return a list of results with the highest-scored results listed first. Let’s assume you are a data scientist and you need to forecast the household water consumption in a certain region. You need to find the most relevant data sets for your purpose: water meter data, consumption in cubic meters, meter location, etc. You don’t know the data sources and there is a huge amount of them. How do you find the needle in the haystack? The powerful magnet to attract the needle is the search.
For example, you can search for parts of words without using wildcard characters. Instead of the word “water” you can enter wat and still see items for water included in the search results. Standard search also supports fuzzy logic, which means that closely related strings such as watr also match with terms like water. Search for water*
Search for watr
Several results appear. The tables are listed top-down by relevance. You would get a similar top three ranking if you would search for tables with water data . Standard search supports free text entry, which enables you to enter any word or phrase to form a query. Then you can use the query to search the table or column level. This approach enables you to use conversational language to describe the information asset that you need without using specialized phrasing or syntax. Elasticsearch is at work behind the scenes. If you are looking for a specific column, for example the water volume in cubic meters (m3), you can try a fuzzy search: Search for *m3*
Even more relevant results will show. Only tables with a column called Daily_W_C_M3 are displayed.
Search for tables created between 12th and 14th of January. Try dateCreated: [2021-01-12 TO 2021-01-14]
Search for tables having a keyword in the table label. Try label:"water"
The table containing "water" in the table label is shown.
Search for assets that contain the keyword “water” or “cluster” in the name. name:"water"^3 OR name:"cluster" "Water" is here boosted and will receive three times the score of "cluster".
You can refine the query even more: name:"water"^3 OR name:"cluster" AND type:casTable
You got the idea; the search is pretty powerful. You have several options to refine and return useful results. Syntax search is based on the Lucene Query Syntax (LQS). More details in SAS Information Catalog 2020.1.3 production documentation and Apache Lucene - Query Parser Syntax.
SAS Information Catalog uses the Elasticsearch engine. The default configuration for Elasticsearch provides a good experience for most users. Administrators might want to change some options. For general information about Elasticsearch configuration options, see SAS Viya deployment notes on Elasticsearch in Elasticsearch documentation. Elasticsearch has been used for a while inside SAS Visual Investigator. Now for the first time, it is part of a Data Management product.
The search saved you time and narrowed the results. It is time to take a closer look and see what data you can use. Explore the results:
Open the selected search result and drill down into a screen that contains a table overview. The Overview tab contains summarized textual and graphical information derived from the item’s metadata:
The overview might contain (ideally) some collective knowledge, collected from other users:
This knowledge can give you the extra confidence and you decide it is a good candidate for your task. Notes:
In Column Analysis (Descriptive Measures), each column presents statistics about the content. In a few seconds you can assess fairly quickly the content and if it matches the intended use. Looking at the column metrics, the table contains:
The consumption is for a range of dates in 2014 and 2015 (Year minimum and maximum).
You can also drill down for more information about a selected column. A few examples: A numeric column:
A second string:
A column containing latitude and longitude:
A sample data tab enables you to browse a few sample rows, the same as in SAS Data Explorer.
The Column Analysis (Metadata Measures) helps you asses if:
The Column Analysis (Data Quality Measures) can answer these questions:
The same tab Column Analysis (Data Quality Measures) can inform you of private data in the data set. Data identification is at work behind the scenes! The semantic type tells you what private or, potentially private data you have in your data set. In this example, address, postal code, city, and coordinates are assessed as information privacy candidates.
In another table, these columns are assessed as information privacy private data. Depending on the local laws, organizations may not use personal data for a purpose other than the original intent without securing additional permission from the consumer. If you are forecasting water consumption, there might be no reason to process names and phones. Always best to check with your Data Protection Officer.
Ideally after you go and analyze the data, you might want to enrich the collective knowledge and share your data discovery with others:
Adding to the business description helps this syntax search:
Search for description:"cubic meters" OR description:"m3" and the results will return data sets with the keywords in the description. As mentioned, features such as knowledge sharing, collaboration, tagging are not part of this release. They will be part of a future release (no dates or version communicated yet). Finally, you can move to the next level and further explore and visualize, prepare, or manage the data, build models, or explore the lineage.
To collect metadata, SAS Information Catalog needs to discover (or “crawl”) these assets. Want to know how to crawl your own caslibs and SAS Compute libraries? Or monitor these agents? Read more in How to Collect Metadata with the SAS Information Catalog.
The latest stable release of SAS Viya (2020.1.4) added the following to the SAS Information Catalog: Information Privacy, Time Period, Area Covered for Assets, Locale selection for Discovery Agents. Read more about it here.
The brand-new SAS Information Catalog on SAS Viya 2020.1.3 comes with a powerful search engine to help you find the data assets you need. The catalog brings together a series of metrics calculated in different applications. The interface helps you assess the usability of the data, understand the content, drill-down into columns details and view sample data. It also reduces the time to take the decision if to use or not a certain table, as you can judge the data preparation effort, assess the data quality and identify private data.
The product SAS Information Catalog will be offered in two variants, basic and advanced :
Acknowledgements: Nancy Rausch, Vincent Rejany, Kumar Thangamuthu and Ashish Sharma.
Thank you for your time reading this article. If you liked the article, give it a thumbs up.
Please comment and tell us what you think about the new SAS Information Catalog.
Find more articles from SAS Global Enablement and Learning here.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.