7 ways to make your data management process portable

Whether you’re moving a data management job from your design client to a remote server or promoting your job from a development environment to a test or production environment, it will have to accommodate the differing environment settings and constraints found on each. Here we will explore some of the places where you will have to make design decisions to ensure your data management process works as expected wherever it might run. Many of these examples refer to functionality specific to the SAS Data Management Platform but are equally valid in concept, regardless of the data management technology you may choose to use.

Macros

Macros in the Data Management Platform are variables that can be set at design time, at run time or programmatically as needed. Macros make it possible to replace values with new values that need to change based on where a process is running or who is running it. Use macros to support portability in the following areas:

DSNs, schemas, and tables.
Quality Knowledge Base (QKB) location and locale designation.
Node property settings.
Memory allocation values.
Filenames and directories.

Data connections

Removing data connection information from data management processes improves portability by allowing someone to update data connection settings without editing the data job itself. Here’s how to keep data access and data transformation logic separate.

Use saved credential functionality in the Data Management Platform to remove connection string information from within your jobs.
Rather than having individual data source connections for each user, which can be difficult to manage in data jobs (or with macros), use a shared connection that many users can take advantage of without having to change job data access settings.
In cases where options like connection pooling or cache sizes speed up some processes but slow down others, use different settings for a set of DSNs. Some jobs will use one DSN with certain driver settings while other jobs will use a different DSN with different driver settings, depending on what the job is designed to do.
Use macros with data connection information like DSNs and schema names. Also use them to set the maximum number of rows that will be read from a particular data source. This is helpful when you don’t want to process millions of records at design time but do want to process the entire data set on the production system.

System global values

Jobs have access to a number of system global values that can be used to execute job logic differently in disparate environments. Key off these values in your Data Management job expression code or node properties to alter the way the job behaves, depending on where it’s running or who started the process. Here are some of the more commonly used system global values:

Operating system details.
System resource availability.
Operating system “path” information.
Network identification information.
Temp directory location.
Java version and location.

Parallelization/memory

Because different computing environments can have widely varying resources available to them, where possible, design your jobs in such a way that you can scale to take advantage of additional memory or processors/cores. Consider using this flexibility in the following areas:

Process job parallel iterator – make the number of parallel processes variable depending on system resources available.
Node memory settings – many Data Management data job nodes have settings for memory allocation, like the Clustering node and Data Sorting nodes. Use macros to set these differently on your design time and production server environments. The difference between running a process all in memory versus partially in memory and partially on disk can be substantial.
Data quality settings – things like parsing depth values and address verification memory data load settings should be configured to scale with the resources available on the system.
Bulk data loading – you may choose to enable bulk data loading on systems set up to handle it. Likewise, commit levels for data access nodes may be set differently depending on the database to which the process is writing.

Username and password information

Removing hard-coded username and password information from your data jobs is always a best practice. Ensuring security should be on the top of your list, especially when sensitive data is being accessed. A few things to keep in mind:

Use saved credentials for data source connections.
Create run time macros where you want to force a user to provide a username and password each time a process is invoked.
If using macros files (rather than macro values supplied at run time) protect the macro file directory with appropriate access control permissions.
When invoking command line applications from within your job, you may need to deal with username and password values. Take advantage of encryption native to the application or use a third party application to manage the sensitive connection information for you.

Filename and paths

Windows and UNIX operating systems deal with filenames and directories differently. While Windows can be more forgiving (usually no problems handling both forward slashes and backslashes, for example), it’s best to design your code to handle filenames and file paths appropriately for the system where the job is expected to run. Remember to:

Use relative paths where possible to avoid having to replicate directory structures on different systems.
Pay attention to forward slash and backslash characters in directory names. You may also need to escape these characters in special ways depending on the environment.
Create a standard for casing filenames and directories that will be followed consistently.
Use macro variables to represent filenames or directories that may change due to operating system differences. Files like referenced Data Management jobs, external input or output text files, and report output may all need to be set up such that the referenced location can be found regardless of where the process is executed.

Localization

While we often think of localization when we reflect on the need to show the appropriate language for end users in web applications, localized values must be supported in job logic too. Consider the following:

If you are explicitly setting string variable lengths in your Data Management job expression code, be sure to accommodate the lengths of data in different character encodings.
Where data quality definitions are in play, design your job logic so that the appropriate QKB locale and definition can be loaded and used. There are advanced properties in all QKB-based nodes to handle dynamically switching to the correct locale based on values found in the data.
Take care to set the proper character encoding when reading and writing text files in the Data Management Platform. An “encoding” property is available in these nodes and should be explicitly set to match the expected data and the operating system language settings.

There you have it – a brief guide to designing data management processes to run successfully across any number of related but divergent environments. Are there others areas not discussed here where your project was sidetracked because you didn’t plan for portability?