We have a pretty basic ETL architecture and all the DI extract jobs I've written to date have been straightforward in that the source datasets that I use get recreated overnight using the same physical dataset name.
We have a new requirement to extract data from datasets that are delivered daily but have a datestamp suffix as part of the dataset name, e.g. orders_yyyymmdd. The dataset structure will always be identical. All the DI job needs is a table loader and source and target table metadata objects. Is it possible to register the source table in metadata to cope with the daily changing dataset name?
I've written Base SAS code to work out what the name of the dataset to be processed is before writing to the target dataset via a data step in a User Written Code transform linked to the target table metadata object. This doesn't seem a particularly satisfactory way of doing this. I've also considered moving the filename identification piece of code to the job or Table Loader precode sections and setting a macro to store the required dataset name but am unsure how, or if, this could be referenced by the Loader. Is there a standard way of handling this scenario? Could the same techniques be applied if the data source was a text file, rather than a SAS dataset? We have DI Studio v4.21.
This is just the kind of requirement the Loop (and possibly Library Contents) transformations are suited to:
Assuming you're dealing with SAS datasets that have a date/time stamp, you can use Library contents to do a "query" that will return all the datasets in a library that meet a particular pattern. Behind the scenes, this queries the dictionary.tables table.
If you're dealing with text files instead of SAS datasets, you'll need to do a bit of user written code. For these types of files, you can use a PIPE as part of a fileref, as illustrated below in an example you can adapt
filename files pipe 'dir /B \**';
length fileName $100;
Either of these techniques will produce a work table with a list of tables or files to process. You can pass this table to the Loop transform. The Loop transform accepts parameters in the form of an input table. These parameters can be passed to a parameterized job, which will be called for each row in the parameter table described above.
Since the data structures are the same between all the tables/files, the parameterized job can be written once and then reused for all files. In your case, you would at least parameterize the input file/table. You would do this by defining a parameter in the job. Each parameter results in a macro variable that you can use in your job, which for what you're talking about would be used to parameterize the file name or table name. You would put the macro in the appropriate field of the metadata object:
The file name section of the External File object
The table name on the Physical Storage tab for a table
You might also need to parameterize your output - it depends. If you don't need each iteration to run in parallel, you can just append to the output table from the parameterized job. If you want them to run in parallel, you'd need to parameterize the output table to avoid concurrent write errors. The Loop transformation contains options in regard to parallel execution.
Thanks guys. These are both really useful suggestions which are much appreciated and I'm going to experiment with both. I'm intrigued by the possibility of parameterised jobs as this isn't something we've done before. To be honest I got a bit confused by the topic in the User guide but will try to work through that and research the topic further.
No problem. If you get hung up getting parameterized jobs and Loop to work, just post a question. After you get one of these working you'll be comfortable with it and use it all the time. The three keys to remember are:
A job parameter is really just a macro variable. The macro variable will be set by Loop before the parameterized (inner) job runs, so can be used anywhere in the parameterized job (file/table names, augmenting the where clause in queries, etc).
Every row passed to the Loop transformation results in an execution of the parameterized (inner) job and every column can be (but doesn't have to be) mapped to a parameter of the parameterized job as part of Loop configuration.
If you only need one iteration - meaning you just want to invoke a parameterized job in several contexts, from several jobs to reuse logic - you don't need multiple rows going to Loop. In fact, all you have to do is make sure the job parameter macros get set before you invoke the parameterized job, which can be done with Loop or just through user written code.