About TimStearn_SASProductManagement_

TimStearn_SASProductManagement_ · ‎10-19-2011

Anja, If I understand what you've asked correctly, I think you can do this with some simple SQL summarization functions. For example: select vehicle_id,count(Time_Step) from <table> group by vehicle_id; This can be accomplished with the "Extract" or "SQL Join" transformations. If your ultimate goal is to create one data set that has the data from all 20 datasets consolidated into a single dataset, rolled up by vehicle and segment, this would be a good use of the Loop transform. You would create one inner job that did the basic summarization by vehicle_id, as described above, derive the "segment name" from a parameter passed into the job (maybe the input dataset name), parameterize the name of the input table and set the options on the Table Loader for the target table to append. You would then create an outer job that calls the Inner Job using the Loop transformation, running each step sequentially instead of in parallel. If processing time is a problem, there are strategies to execute the iterations in parallel and then gather all the separate pieces later. Please see this post for a detailed explanation of how to use the Loop transform and Inner and Outer jobs in this manner. There are other ways to accomplish grouping all the datasets together, but this one will allow you to extend/reduce the number of segments without changing your job at all. Thanks, Tim Stearn

TimStearn_SASProductManagement_ · ‎10-19-2011

ChrisC, I don't know if you're still looking for an answer to this, but the previous post were on target. To answer in a bit more detail: The first step in creating jobs that use the Loop transformation is to create an "Inner Job". I see that you've done that (I assume that's your 1st screen shot in your post). After you've gotten your inner job running and tested with some sample data, you need to add parameters to the job. You do this by accessing the parameter tab in the job Properties. Any parameter you create results in a macro variable that will be set by Loop before the job runs. You need to use these macro variables in your inner job. In your case, you want to process a group of tables. This tells me that you need to parameterize at least one location, perhaps two: The name of the input table must be parameterized. To do this, you would define a parameter (possible called "inTbl"), open the properties of the input table and go to the "Physical Storage" tab and replace the value for "Physical name" with a reference to the macro variable for the parameter: &inTbl. This will allow the inner job to process a different input table each time it runs, with Loop passing in the value for the table name. You may need to do something similar for any output tables - it depends on what you're trying to accomplish. If you're attempting to create a unique output table for each input table, you'd want to parameterize it as described above. If you're trying to create a single table from all inputs, you wouldn't parameterize, but would instead set the "Load Style" of the Table Loader to "Append". If this is what you're doing, there are implications in how you set parameters for the Loop transformation, depending on the target type of your table. Once you've done the 3 steps above, you're ready to create your "Outer Job", which will use Loop. From what I can see in the screen shot, you've done this correctly at a macro level. The main thing with the steps that precede loop is that you need to produce a table that has all of the required parameters that must be passed to the "Inner" job. In your case, the name of each input table would certainly be one parameter, but perhaps you also need others. All the columns in the table that is input to Loop are potential parameters that can be passed to the inner job. All rows will result in an execution of the inner job by the loop transformation. Next you need to perform settings on two tabs in the Loop transformation Parameter Mapping: Map columns in the input table to Loop to parameters defined in the Inner job. In your case, at least "inTbl" mapping would be required. Edit Options on the "Loop Options" screen: The main decision to make here whether and how to use parallel processing. By default, all iterations will execute sequentially. You can choose, however, to execute some or all of the iterations in parallel. There are several considerations and settings: If you want to execute in parallel, you have to make sure there will be no locking issues in your inner job. Primarily, this means that you can't try to write to the same physical table in two concurrent iterations unless the table is stored in a database that allows parallel writes. As I mentioned above, if you're trying to create a different output table for each input, you could parameterize the output table name, which would have each parallel iteration writing to a different table, eliminating any write contention. If, however, you're trying to append to the same table with each iteration, you'd have to be writing to a relational database like Oracle, which allows parallel writes to the same table If you're writing to a SAS dataset, you'd need to run all iterations sequentially, since without SAS/SHARE, a process writing to a table will lock the whole table and all other processes would fail. There are patterns for dealing with this even with SAS tables if speed is important. The considerations in the bullet point above are the most important to get correct. In addition to making sure you avoid write contention: You have to decide how many jobs to run in parallel. The option "One process for each available CPU node" is a good one when just starting with this transform. If you're using Platform scheduler or another advanced scheduler, other options can be considered - let me know if you'd like to learn more. You need to provide a directory where each parallel job will write its logs - this is the "Location on host for log and output files" setting. The job automatically names each log file with a unique (though not intuitive) name like L37.log. You can use PROC PRINTTO in your inner job to redirect to a more friendly log name if necessary. That's probably a lot to take in, so feel free to post follow up questions. Loop is one of the more useful and versatile transformations in DI Studio, so while there is a bit of a learning curve, the payoff is worth it in the number of ways you'll find to use this pattern. Thanks, Tim Stearn

TimStearn_SASProductManagement_ · ‎10-10-2011

The reply above from Tom will certain work and is one way to go. Another way I've used in the past is to load all the column values into macro variables and then write a macro loop to pass the values: %macro queryAndLoopOnValues proc sql; select count(distinct route_code) into :routeCodeCnt from <table>; quit; /* Do let statement to left justify number */ %let routeCodeCnt=&routeCodeCnt; proc sql; select distinct route_code into :routeCode1 - :routeCode&routeCodeCnt; quit; %do i=1 %to &routeCodeCnt; %some_macro(&&routeCode&i); %end; %mend; Just another way to go. Thanks, Tim Stearn

TimStearn_SASProductManagement_ · ‎10-10-2011

Venkatesh, As a first step, it woudl be good if you could identify which step and the type of processing that is causing the slowdown, but I've provided some general advice below regardless. The papers provided above are a good reference. As mentioned by Linus, the first thing to determine is which step in your job is consuming the most runtime. As mentioned, the performance graphs that are available as of DI Studio 4.2 can be of great help here. If you're using an earlier version (3.4), let me know and I can provide other adivice. These graphs can also help you to determine whether your job is I/O bound (waiting on disk reads/writes) or CPU bound. A few other general points to consider: Make sure your indexes are actually being used. You can do this by setting the following option: option sastrace=",,,d" msglevel="i" sastraceloc=saslog Reduce the number of steps, if possible. Each step in the job is (potentially) another pass on the data, causing more runtime. If your job contains large joins, see if you can replace one or more joins with the Lookup transformation. This will load one or more tables into memory and perform a hash lookup, which can be faster than a disk-based join in some cases. Depending on the size of the lookup tables, you may need to increase the available memory for the SAS session, which you can do by setting the "memsize" parameter in your sasv9.cfg file. Do you have multiple cores/CPUs where the job is running? Given today's hardware, the answer is almost certainly yes . By default, your SAS job is going to run in a single threaded fashion, with the exception of certain PROCs. To take advantage of multiple CPUs, you can split your data into several chunks, and then using the Loop transform to run your job in parallel on each chunk. There are several ways of splitting your data (by some logical division like region, or randomly using the mod() function on a unique key). To accomplish this, you'll need an "inner job" that is parameterized to select a subset of your data and produce unique output tables. You then invoke this parameterized job using the Loop transformation in an "Outer Job". The DI Studio help contains an example of using the Loop transformation in this fashion. When you read the documenation, remember that the table that serves as input to the Loop transformation can be ANY table (or a work table that is the result of a transformation or user written code step) and does not need to be the resulf of the "Library Contents" transformation as shown in the example. Instead, you would create a table with as many rows as you want parallel jobs, with each row containing whatever parameters the job requires. I've used this technique in the past to dramatically reduce wall clock time for jobs processing large tables. If you need additional help with this technique, let me know. Thanks, Tim Stearn

TimStearn_SASProductManagement_ · ‎10-10-2011

As mentioned above, to connect to your local machine, you would need to have a metadata server running on your local machine.

TimStearn_SASProductManagement_ · ‎10-10-2011

Halaku, From what data source are you querying? If you're querying a relational database like MySQL, Oracle, etc, some of these databases have specialized functions that allow you to concatenate strings across rows. If your source is such a database, you could accomplish this by turning on passthrough in the SQL Query transformation. If this is not the case, I agree with the previous posts that the solution would be to use the User Written Code node and write a SAS program that takes advantage of retain variables.

TimStearn_SASProductManagement_ · ‎04-25-2011

Hi All, Thanks for the responses. It appears I confused people by asking about "incrementally" calculating the median. The idea was to avoid the use of PROCs since I need to generate metrics that either would require multiple SAS PROCs or DATA Step code. Instead of doing 3 or more passes on the data, I'd like to try to do it in a single pass using one DATA Step. I can certainly hold all (or perhaps 1/2) of the unique values in memory and do a calculation on those values at the end of the DATA Step but was wondering if there was another option. Thanks, Tim Stearn

TimStearn_SASProductManagement_ · ‎04-25-2011

All, I'm writing a program where I'm attempting to compute several descriptive statistics with only a single pass through the data. Example statistics are mean, stddev, frequency counts, etc. I am, of course, using DATA Step for this program;). With most stats, I can use sums or at worst a hash table that stores each unique value along with a count. Is there a calculation formula I can use to incrementally calculate the median (or for that matter, any percentile) without having every unique value available in memory or in a table? Sorry if this is a VERY naive question - I'm more of a programmer than a statistician:). Thanks, Tim Stearn

TimStearn_SASProductManagement_ · ‎04-08-2011

MC, I'd suggest opening a case with SAS Technical Support for this issue. Thanks, Tim Stearn

TimStearn_SASProductManagement_ · ‎04-08-2011

MI, Exporting a job and all it's dependencies is one of the most important things. This document is slightly out of date since it applied to 9.1.3, but I've attached a document on considerations for promotion on an ETL solution I wrote for a customer some time ago. There are details specific to their installation, but this will give you an idea. Thanks, Tim Stearn

TimStearn_SASProductManagement_ · ‎04-08-2011

MI, I don't know all the details of what you're trying to do, but if this value needs to be passed between jobs, your best bet is to write it to a table or file. Thanks, Tim Stearn

TimStearn_SASProductManagement_ · ‎04-08-2011

OLAP cubes may be built in a wizard driven environment with no coding required. You don't necessarily need to build a model in order to build the cube. The wizards will guide you through creating metrics, dimensions and hierarchies. Documentation on OLAP is available here: http://support.sas.com/documentation/cdl/en/olapug/59574/HTML/default/viewer.htm#a002255372.htm

TimStearn_SASProductManagement_ · ‎04-08-2011

Hi Frank, We're considering a product feature that would work in this way. We would allow you to specify template jobs for particular table types (dimension, fact, bridge, etc), specify a data model and then we would generate jobs and a process flow that you could customize. Does this sound like what you're after? This is future rather than current product, but I'm interested to hear if this would meet you needs. I think it would be difficult to do this with the current product. Jobs are modeled at a very low level of detail as metadata objects and it would take deep internal knowledge of how the job is constructed to accomplish this. That said, I'll check to see if we have any utilities that would meet your needs. Thanks, Tim S.

TimStearn_SASProductManagement_ · ‎04-08-2011

I'll have time to look into this on Monday and I'll get back to you. Sorry for the tardy response.

TimStearn_SASProductManagement_ · ‎03-24-2011

Don, I just realized the code I passed along to you isn't 9.2 compliant. There is an issue with PROC FCMP in 9.2 that won't allow this code to run. I just realized I had built/tested this on pre-release 9.3 software. Please give it a look and see if you're comfortable adapting it. If not, I can spend some time on it next week. Thanks, Tim Stearn

Online Status	Offline
Date Last Visited	‎09-01-2015 07:11 AM

Filtering Data

Need help with iterative job

SAS Marco Get observation value one by one for Processing

How to reduce job running time from 1 hour to 45 mins(Performance tunn...

Data integration local connection

How do I: Loop through the data to build a timestamp string.

Re: Incremental Median Calculation

Incremental Median Calculation

Re: Floating point exception error message

Re: What to look for when moving the development jobs to Production.

SAS Marco Get observation value one by one for Processing

Did You Know: Creating New Surrogate Key Values from Table Loader

Filtering Data

Need help with iterative job

SAS Marco Get observation value one by one for Processing

How to reduce job running time from 1 hour to 45 mins(Performance tunn...

Data integration local connection

How do I: Loop through the data to build a timestamp string.

Re: Incremental Median Calculation

Incremental Median Calculation

Re: Floating point exception error message

Re: What to look for when moving the development jobs to Production.

Re: HOW TO: Assign a column value to a global variable

Re: SAS OLAP studio

Re: Template based ETL design

Re: XML and DI Studio

Re: Getting a List of Metadata Folders