This article is about the "Data Ingestion Auto Pilot - DIAP" custom step which is freely available in the public GitHub SAS Studio Custom Step repository for download. DIAP automatically determines the separators (comma, semicolon, pipe, tab, exclamation mark, hash, blank) when ingesting a text file and can deal with XLSX, JMP, SHP, JSON, XML, CSV, TXT files. This list is constantly expanding.
Take a look at the following animation to get a first idea of how interacting with DIAP could look like. And don't forget to continue reading this article to learn more about additional DIAP features:
It has been interesting for me to observe (first within myself, but then also with others), that the task of ingesting external files often comes with a form of frustration and can form a stumbling block in the progress of a project.
Even though there are many out-of-the-box capabilities for importing external files (making it supposedly easy), in many cases though there is more to be done. It quickly becomes cumbersome when there are many different files to be read in. Sometimes reading in even only a few files can cause headaches, because there are impurities in the file (on the meta level, but also in the actual content) that can cause unexpected outcomes.
And in some cases, it might be required to go through a whole directory tree to read in all files.
Dealing with similar issues repeatedly, I was determined to create a tool for resolving it. My nature, that I must automate something when I must do something twice, took over - DIAP, the Data Ingestion Auto Pilot was born. The purpose is to take care of those re-occurring headaches in an automated way.
While DIAP helps to drastically reduce the time to get insight into new data (from weeks down to hours), it also provides all the functionality necessary to use it for regular scheduled imports, since it can be set up so it only imports those external files updated since the last import.
DIAP only requires pointing to a root directory. After that, it traverses all directories and subdirectories, if desired, (similar to a file crawler) finding all available external files, and making all of them available in the SAS ecosystem. This could be in a CAS library or a Compute library.
Now, the beauty is DIAP attempts to read in all the files it can find while automatically taking care of all the things that usually cause headaches, like:
While the naming of target tables and variable names happen automatically, there are many ways to influence the naming to adhere to the SAS naming conventions and personal preferences. And, DIAP is also equipped to deal with fixed width files (see further down a potential workflow around that).
DIAP is doing all of this while keeping track of the operations using reference tables that are automatically created, either locally, or in a central location. Those reference tables help to:
And as mentioned in the intro, DIAP automatically determines the separators (comma, semicolon, pipe, tab, exclamation mark, hash, blank) when ingesting a text file and can deal with XLSX, JMP, SHP, JSON, XML, CSV, TXT files. This list is constantly expanding.
Also, using the newly created “Guided Automated Execution” DIAP mode, it allows to specifically point to files across SAS Server and SAS Content at the same time and read them in as provided in the control file.
DIAP is available as a custom step, which means the “step” file only needs to be uploaded somewhere into SAS Content (see detailed upload instructions here ), and it becomes automatically available under “Shared” Steps in SAS Studio.
The DIAP SAS Studio Custom Step can be downloaded here.
Read more about the public SAS Studio Custom Step repository here.
Following are a few screenshots of the user experience when using the custom step. Each screenshot shows one of available tabs in the custom step.
The “Essential” tab contains all the parameters to set for an initial run. All the other parameters in the other tabs could be used in their default setting.
Just provide the "Root Directory" and the "Target Libname", and DIAP is ready to go.
Learn more about the different execution modes further down showcasing a potential workflow in an animation.
Here are a few settings that control the general flow of DIAP. Most of them speak for themselves.
These settings control the execution with regards on how to handle the source files waiting to be read in.
The screenshot shows that DIAP allows many ways to exclude files from the ingestion process depending on the filename beginning or ending. It also shows how a check for certain characters can be initiated so the import process is flawless. In this case, as can be seen in the screenshot, DIAP is checking for carriage returns, that could occur in comment fields, and replace those characters with a Null character.
Here are all kinds of settings available that control how the data is stored in the destination.
Some of the other parameters on there speak for themselves, "Substitute String" will omit the provided characters in the target table name in case the original would be too long. Often times, by omitting these kinds of characters, the target table names become short enough without losing its readability.
The screenshot shows that DIAP allows many ways to influence the created table names and variable names.
Some administrative settings control where to store the generated metadata, and where to store the log file of a run, and some other useful settings.
This tab contains all the necessary information for using DIAP. It also contains a description of all the parameters.
For a deeper look into all the parameters, please check out the DIAP Custom Step documentation.
DIAP has various execution modes. Please watch the following animations for inspiration.
The following animation shows a potential workflow when running in default settings. Only set the parameters in the "Essentials" tab and hit run:
Reading in fixed-width files is made easy with DIAP. It requires a specific named control file ("_DIAP_Special_File_Definition.csv") to be available anywhere in the directory tree of the selected root directory, and with that DIAP will first follow the instructions in that control file, and after that read in the non fixed-width files automatically.
And, DIAP has the option to create a template control file for you (DIAP execution mode #4), which can then be placed somewhere in the directory tree where DIAP can find it.
The following animation provides more insight about a potential workflow:
The guided execution mode in DIAP allows to a more selective ingestion of external files with more control for the user. This mode uses a control file in csv-format and DIAP will create a template for you. Please see the animation to find out:
Watch the following animation to explore the reference tables. The flag "check_source_file_flag" is helpful for suspicious files in the upload_log table. This points out certain source files that might need some adjustments or should be skipped. Examples for adjustments are when column names in Excel files go over multiple rows, or when csv files contain header information, etc...
Everything you need to run DIAP is available on Github in the SAS Studio Custom Step repository. This also contains a README file with more information about all the parameters.
Please leave a comment and let me know what you think. Maybe you have some feature ideas? Also, share your experience with importing external files. I am curious to hear about your headaches when reading in external files and can't wait to hear from you.
Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.
If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.