SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Need help with iterative job

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

Need help with iterative job

Hi guys,

My company recently started using DIS 4.21, but no one has that much experience with it. We had outside help getting what we wanted to do set up, but I am at a novice level with the package. I've got some pictures to illustrate what I'm trying to accomplish.

So basically the process we have in place now is to create some datasets, structure them for upload to our server, and then run a transformation to upload the data.

Picture1.png

Right now, I have to run RTEC1 through the process, wait for it to finish, connect RTEC2 to the process, rinse and repeat. I've read through the documentation for DIS and I know that I want to create an iterative job to cycle through all of the datasets.

Picture2.png

I created a library (Base) that contains all of the datasets and I used a Library Contents transformation to create a Control Table. Each line of the control table corresponds to each dataset,

Picture3.png

I then connected the control table to a Loop to run the tables through the job (the job is illustrated in the first picture). This is where I get lost and my luck with the documentation runs out. In my mind, the loop should read the name of the dataset from the control table, load that dataset into the Table Loader, run it through the upload transformation, and go get the next dataset from the control table. Repeat until finished.

The problem is that I don't know what settings or conditions I need to make to get the Loop to load the first and subsequent datasets into the job. I've been using SAS for awhile so I am comfortable with coding if that's what needs to be done here, but I have very little experience with DIS. Any help will be greatly appreciated. Thanks!!!

Chris


Accepted Solutions
Solution
‎10-19-2011 10:30 AM
SAS Employee
Posts: 51

Need help with iterative job

ChrisC,

I don't know if you're still looking for an answer to this, but the previous post were on target.  To answer in a bit more detail:

  1. The first step in creating jobs that use the Loop transformation is to create an "Inner Job".  I see that you've done that (I assume that's your 1st screen shot in your post).
  2. After you've gotten your inner job running and tested with some sample data, you need to add parameters to the job.  You do this by accessing the parameter tab in the job Properties.  Any parameter you create results in a macro variable that will be set by Loop before the job runs.
  3. You need to use these macro variables in your inner job.  In your case, you want to process a group of tables.  This tells me that you need to parameterize at least one location, perhaps two:
    • The name of the input table must be parameterized.  To do this, you would define a parameter (possible called "inTbl"), open the properties of the input table and go to the "Physical Storage" tab and replace the value for "Physical name" with a reference to the macro variable for the parameter:  &inTbl.  This will allow the inner job to process a different input table each time it runs, with Loop passing in the value for the table name.
    • You may need to do something similar for any output tables - it depends on what you're trying to accomplish.  If you're attempting to create a unique output table for each input table, you'd want to parameterize it as described above.   If you're trying to create a single table from all inputs, you wouldn't parameterize, but would instead set the "Load Style" of the Table Loader to "Append".  If this is what you're doing, there are implications in how you set parameters for the Loop transformation, depending on the target type of your table.
  4. Once you've done the 3 steps above, you're ready to create your "Outer Job", which will use Loop.   From what I can see in the screen shot, you've done this correctly at a macro level.  The main thing with the steps that precede loop is that you need to produce a table that has all of the required parameters that must be passed to the "Inner" job.  In your case, the name of each input table would certainly be one parameter, but perhaps you also need others.  All the columns in the table that is input to Loop are potential parameters that can be passed to the inner job.  All rows will result in an execution of the inner job by the loop transformation.
  5. Next you need to perform settings on two tabs in the Loop transformation
    • Parameter Mapping:  Map columns in the input table to Loop to parameters defined in the Inner job.  In your case, at least "inTbl" mapping would be required.
    • Edit Options on the "Loop Options" screen:
      • The main decision to make here whether and how to use parallel processing.  By default, all iterations will execute sequentially.  You can choose, however, to execute some or all of the iterations in parallel.  There are several considerations and settings:
        • If you want to execute in parallel, you have to make sure there will be no locking issues in your inner job.  Primarily, this means that you can't try to write to the same physical table in two concurrent iterations unless the table is stored in a database that allows parallel writes. 
          • As I mentioned above, if you're trying to create a different output table for each input, you could parameterize the output table name, which would have each parallel iteration writing to a different table, eliminating any write contention.
          • If, however, you're trying to append to the same table with each iteration, you'd have to be writing to a relational database like Oracle, which allows parallel writes to the same table If you're writing to a SAS dataset, you'd need to run all iterations sequentially, since without SAS/SHARE, a process writing to a table will lock the whole table and all other processes would fail.  There are patterns for dealing with this even with SAS tables if speed is important.
        • The considerations in the bullet point above are the most important to get correct.
      • In addition to making sure you avoid write contention:
        • You have to decide how many jobs to run in parallel.  The option "One process for each available CPU node" is a good one when just starting with this transform.  If you're using Platform scheduler or another advanced scheduler, other options can be considered - let me know if you'd like to learn more.
        • You need to provide a directory where each parallel job will write its logs - this is the "Location on host for log and output files" setting.  The job automatically names each log file with a unique (though not intuitive) name like L37.log.  You can use PROC PRINTTO in your inner job to redirect to a more friendly log name if necessary.

That's probably a lot to take in, so feel free to post follow up questions.  Loop is one of the more useful and versatile transformations in DI Studio, so while there is a bit of a learning curve, the payoff is worth it in the number of ways you'll find to use this pattern.

Thanks,

Tim Stearn

View solution in original post


All Replies
Super Contributor
Posts: 356

Need help with iterative job

Off the top of my head, may have missed something...

You need to pass parameters from the loop to the inner job that is in the loop (SVP Stage Loader).

How?

Setup paramters in the inner job (one of the tabs on the inner job), this will create macro vars that you can use in that job . 

Link the parameters created in the inner job to the loop (one of the tabs on the loop) and all should be good

Barry

Respected Advisor
Posts: 3,887

Re: Need help with iterative job

Hi

As Barry said: The inner job needs a parameter defined (which is a prompt of type "data source").

Capture1.PNG

In the outer job in the loop transformation you then link this parameter (which in the code is a macro variable) to the source column of your control table.

Capture2.PNG

In the inner job you then can use this macro variable (in the example here &ThisSex) to work your magic.

HTH

Patrick

Solution
‎10-19-2011 10:30 AM
SAS Employee
Posts: 51

Need help with iterative job

ChrisC,

I don't know if you're still looking for an answer to this, but the previous post were on target.  To answer in a bit more detail:

  1. The first step in creating jobs that use the Loop transformation is to create an "Inner Job".  I see that you've done that (I assume that's your 1st screen shot in your post).
  2. After you've gotten your inner job running and tested with some sample data, you need to add parameters to the job.  You do this by accessing the parameter tab in the job Properties.  Any parameter you create results in a macro variable that will be set by Loop before the job runs.
  3. You need to use these macro variables in your inner job.  In your case, you want to process a group of tables.  This tells me that you need to parameterize at least one location, perhaps two:
    • The name of the input table must be parameterized.  To do this, you would define a parameter (possible called "inTbl"), open the properties of the input table and go to the "Physical Storage" tab and replace the value for "Physical name" with a reference to the macro variable for the parameter:  &inTbl.  This will allow the inner job to process a different input table each time it runs, with Loop passing in the value for the table name.
    • You may need to do something similar for any output tables - it depends on what you're trying to accomplish.  If you're attempting to create a unique output table for each input table, you'd want to parameterize it as described above.   If you're trying to create a single table from all inputs, you wouldn't parameterize, but would instead set the "Load Style" of the Table Loader to "Append".  If this is what you're doing, there are implications in how you set parameters for the Loop transformation, depending on the target type of your table.
  4. Once you've done the 3 steps above, you're ready to create your "Outer Job", which will use Loop.   From what I can see in the screen shot, you've done this correctly at a macro level.  The main thing with the steps that precede loop is that you need to produce a table that has all of the required parameters that must be passed to the "Inner" job.  In your case, the name of each input table would certainly be one parameter, but perhaps you also need others.  All the columns in the table that is input to Loop are potential parameters that can be passed to the inner job.  All rows will result in an execution of the inner job by the loop transformation.
  5. Next you need to perform settings on two tabs in the Loop transformation
    • Parameter Mapping:  Map columns in the input table to Loop to parameters defined in the Inner job.  In your case, at least "inTbl" mapping would be required.
    • Edit Options on the "Loop Options" screen:
      • The main decision to make here whether and how to use parallel processing.  By default, all iterations will execute sequentially.  You can choose, however, to execute some or all of the iterations in parallel.  There are several considerations and settings:
        • If you want to execute in parallel, you have to make sure there will be no locking issues in your inner job.  Primarily, this means that you can't try to write to the same physical table in two concurrent iterations unless the table is stored in a database that allows parallel writes. 
          • As I mentioned above, if you're trying to create a different output table for each input, you could parameterize the output table name, which would have each parallel iteration writing to a different table, eliminating any write contention.
          • If, however, you're trying to append to the same table with each iteration, you'd have to be writing to a relational database like Oracle, which allows parallel writes to the same table If you're writing to a SAS dataset, you'd need to run all iterations sequentially, since without SAS/SHARE, a process writing to a table will lock the whole table and all other processes would fail.  There are patterns for dealing with this even with SAS tables if speed is important.
        • The considerations in the bullet point above are the most important to get correct.
      • In addition to making sure you avoid write contention:
        • You have to decide how many jobs to run in parallel.  The option "One process for each available CPU node" is a good one when just starting with this transform.  If you're using Platform scheduler or another advanced scheduler, other options can be considered - let me know if you'd like to learn more.
        • You need to provide a directory where each parallel job will write its logs - this is the "Location on host for log and output files" setting.  The job automatically names each log file with a unique (though not intuitive) name like L37.log.  You can use PROC PRINTTO in your inner job to redirect to a more friendly log name if necessary.

That's probably a lot to take in, so feel free to post follow up questions.  Loop is one of the more useful and versatile transformations in DI Studio, so while there is a bit of a learning curve, the payoff is worth it in the number of ways you'll find to use this pattern.

Thanks,

Tim Stearn

New Contributor
Posts: 2

Need help with iterative job

Sorry for the delayed response, but I've been traveling. I got the job running and really appreciate everyone's help.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 313 views
  • 6 likes
  • 4 in conversation