Ingesting External Files with Ease and Grace...Intro into DIAP

3 Likes

Preamble

This article is about the "Data Ingestion Auto Pilot - DIAP" custom step which is freely available in the public GitHub SAS Studio Custom Step repository for download. DIAP automatically determines the separators (comma, semicolon, pipe, tab, exclamation mark, hash, blank) when ingesting a text file and can deal with XLSX, JMP, SHP, JSON, XML, CSV, TXT files. This list is constantly expanding.

Take a look at the following animation to get a first idea of how interacting with DIAP could look like. And don't forget to continue reading this article to learn more about additional DIAP features:

Running DIAP with Default Settings

What is Your First Reaction When You Have to Ingest External Files?

It has been interesting for me to observe (first within myself, but then also with others), that the task of ingesting external files often comes with a form of frustration and can form a stumbling block in the progress of a project.

Even though there are many out-of-the-box capabilities for importing external files (making it supposedly easy), in many cases though there is more to be done. It quickly becomes cumbersome when there are many different files to be read in. Sometimes reading in even only a few files can cause headaches, because there are impurities in the file (on the meta level, but also in the actual content) that can cause unexpected outcomes.

And in some cases, it might be required to go through a whole directory tree to read in all files.

Dealing with similar issues repeatedly, I was determined to create a tool for resolving it. My nature, that I must automate something when I must do something twice, took over - DIAP, the Data Ingestion Auto Pilot was born. The purpose is to take care of those re-occurring headaches in an automated way.

While DIAP helps to drastically reduce the time to get insight into new data (from weeks down to hours), it also provides all the functionality necessary to use it for regular scheduled imports, since it can be set up so it only imports those external files updated since the last import.

So, What are The DIAP Features?

DIAP only requires pointing to a root directory. After that, it traverses all directories and subdirectories, if desired, (similar to a file crawler) finding all available external files, and making all of them available in the SAS ecosystem. This could be in a CAS library or a Compute library.

Now, the beauty is DIAP attempts to read in all the files it can find while automatically taking care of all the things that usually cause headaches, like:

Creating

suitable target table name that adheres to SAS naming conventions.
suitable variable names taking care of "weird" characters (like: $@&%., and other), if desired.

Dealing with

missing variables names and/or missing values for an entire column.
unwanted characters in text files that can cause an error prone import (e.g., carriage returns in comment fields).
very wide files (10,000s of variables).

Trying to guess if data starts in row 1 or row 2.
Checking for "all numeric values" in a character variable, and allows to transform into a numeric variable (happens often with “quoted text files”)

While the naming of target tables and variable names happen automatically, there are many ways to influence the naming to adhere to the SAS naming conventions and personal preferences. And, DIAP is also equipped to deal with fixed width files (see further down a potential workflow around that).

And There is More - DIAP Reference Tables

DIAP is doing all of this while keeping track of the operations using reference tables that are automatically created, either locally, or in a central location. Those reference tables help to:

show which source files were loaded into which SAS tables at what date,
keep track of required variable name changes and to make sure that newly created variable names stay unique.

And as mentioned in the intro, DIAP automatically determines the separators (comma, semicolon, pipe, tab, exclamation mark, hash, blank) when ingesting a text file and can deal with XLSX, JMP, SHP, JSON, XML, CSV, TXT files. This list is constantly expanding.

Also, using the newly created “Guided Automated Execution” DIAP mode, it allows to specifically point to files across SAS Server and SAS Content at the same time and read them in as provided in the control file.

And for Viya 4, it is Very Easy to Get Your Hands on DIAP

DIAP is available as a custom step, which means the “step” file only needs to be uploaded somewhere into SAS Content (see detailed upload instructions here ), and it becomes automatically available under “Shared” Steps in SAS Studio.

The DIAP SAS Studio Custom Step can be downloaded here.

Read more about the public SAS Studio Custom Step repository here.

Interacting with DIAP

Following are a few screenshots of the user experience when using the custom step. Each screenshot shows one of available tabs in the custom step.

Essential Tab

The “Essential” tab contains all the parameters to set for an initial run. All the other parameters in the other tabs could be used in their default setting.

Settings in the Essential Tab

Just provide the "Root Directory" and the "Target Libname", and DIAP is ready to go.

Learn more about the different execution modes further down showcasing a potential workflow in an animation.

General Tab

Here are a few settings that control the general flow of DIAP. Most of them speak for themselves.

Settings in the General Tab

Source Specific Tab

These settings control the execution with regards on how to handle the source files waiting to be read in.

Setting in Source Specific Tab

The screenshot shows that DIAP allows many ways to exclude files from the ingestion process depending on the filename beginning or ending. It also shows how a check for certain characters can be initiated so the import process is flawless. In this case, as can be seen in the screenshot, DIAP is checking for carriage returns, that could occur in comment fields, and replace those characters with a Null character.

Target Specific Tab

Here are all kinds of settings available that control how the data is stored in the destination.

Available Settings in Target Specific Tab

Some of the other parameters on there speak for themselves, "Substitute String" will omit the provided characters in the target table name in case the original would be too long. Often times, by omitting these kinds of characters, the target table names become short enough without losing its readability.

The screenshot shows that DIAP allows many ways to influence the created table names and variable names.

Admin Tab

Some administrative settings control where to store the generated metadata, and where to store the log file of a run, and some other useful settings.

Setting in Admin Tab

About Tab

This tab contains all the necessary information for using DIAP. It also contains a description of all the parameters.

Information Excerpt in About Tab

For a deeper look into all the parameters, please check out the DIAP Custom Step documentation.

Take a Peek - DIAP in Action

DIAP has various execution modes. Please watch the following animations for inspiration.

How to Run DIAP with Default Settings

The following animation shows a potential workflow when running in default settings. Only set the parameters in the "Essentials" tab and hit run:

How to Run DIAP in Default Settings

How to Ingest Fixed Width Files

Reading in fixed-width files is made easy with DIAP. It requires a specific named control file ("_DIAP_Special_File_Definition.csv") to be available anywhere in the directory tree of the selected root directory, and with that DIAP will first follow the instructions in that control file, and after that read in the non fixed-width files automatically.

And, DIAP has the option to create a template control file for you (DIAP execution mode #4), which can then be placed somewhere in the directory tree where DIAP can find it.

The following animation provides more insight about a potential workflow:

Ingesting Fixed Width Files with DIAP

Ingesting External Files Using Guided Execution Mode (Semi-Automated)

The guided execution mode in DIAP allows to a more selective ingestion of external files with more control for the user. This mode uses a control file in csv-format and DIAP will create a template for you. Please see the animation to find out:

how to create the control file with DIAP support.
how to use the control file for the guided automated DIAP execution.

Using Guided Execution Mode to Ingest External Files with DIAP

Taking a Look at the Reference Tables - Flag for Suspicious Source Files

Watch the following animation to explore the reference tables. The flag "check_source_file_flag" is helpful for suspicious files in the upload_log table. This points out certain source files that might need some adjustments or should be skipped. Examples for adjustments are when column names in Excel files go over multiple rows, or when csv files contain header information, etc...

Exploring the Reference Tables with DIAP

Where Can I Play around with DIAP?

Everything you need to run DIAP is available on Github in the SAS Studio Custom Step repository. This also contains a README file with more information about all the parameters.

Final Thoughts

Please leave a comment and let me know what you think. Maybe you have some feature ideas? Also, share your experience with importing external files. I am curious to hear about your headaches when reading in external files and can't wait to hear from you.