Re: di job table loader performance issue

sunilreddy · Posted 01-08-2013 11:31 AM

Hi Patrick,

I've tried with your suggestions in EG instead of DI(4.1.3). I've tried below scenario and sample eg log as attached with fullstimer option.

1) Tried with compress= yes and used physical table instead of view

2) using condition to create index and constarins, if its new sas dataset. so this index is not creating every time

3) Also i tried multiple iterations for for source. I tried with 300000 observations in each iteration for enite table loader. In this case, i observed overall real time is almost taking as existing.

what i observed from above is, each iteration is looking into existing dataset, overall its taking same real time

We are not using SPDE engine so not tried this option.

Any suggestions to use index after load the data or any other solution?. for first time in a month will mostly have new added records and might be few duplicates. from second time onwards in a month, will have around 300000 observations and no issues in performance

twocanbazza · Posted 01-08-2013 02:10 PM

Without looking at your log - Do u have an index on the table?

sunilreddy · Posted 01-08-2013 05:02 PM

Yes, i have index with update/insert load style. also load style properties attached

sunilreddy · Posted 01-08-2013 02:38 PM

Uploaded log file and table loader index

SASKiwi · Posted 01-08-2013 09:09 PM

Your log suggests that 23979334 observations were added and none updated. Every time SAS adds a new row it has to update the table index and this is a very resource-intensive process. I would be inclined to update without an index using merge-type processing to avoid data duplicates. Experiment with a small sample of data to confirm you get the required results with improved performance.

sunilreddy · Posted 01-09-2013 04:06 AM

Some times i will have updated records. I will run this job every day in a month. For first time in a month, mostly i will have new records. But i will always prefer to use update /index techique to load. because duplicates might be exist in first time also.

Could you please explain me about merge-type processing in your above stmt. first time in every month, i will have empty dataset. so i added 23979334 observations(as mentioned in log). First time in every month will have huge records and remaining times in every month will not have more obervations.so i dont have any perfomane issues to use update/insert except first time in a month.

Patrick · Posted 01-10-2013 03:16 AM

Looking at your log I can see that real time is much higher than CPU time. This is a very strong indication that I/O is the bottleneck.

When running your job the next time use option FULLSTIMER to get even more information around CPU and I/O.

A few thing you could try:

- Under Load Technique/Constraint Condition(and or Index Condition set everything to "As is". There shouldn't be a need to re-build constraints or indexes.

- Make sure that when creating the table option compress=yes is set.

- Consider storing the data using the SPDE engine and if possible use more than one disk to store the "fragments" (so spreading I/O over several disks). Storing the data under a library using the SPDE engine also allows for multi-threading which could reduce real time even more.

In your log the loader transformation creates first a view over the source table and then uses this view as source in the main datastep. This adds to cpu/memory usage. In my DIS4.5 version and with the conditions "As is" this view does not get created but the physical SAS table is used instead. Not sure if this is a DIS version thing or if changing your conditions would give you also such code.

Forget about this merge idea. The modify statement created by the loader transformation is more efficient than any merge alternative.

Let us know how you go. If your table has a lot of long character fields which are not "filled up" then re-creating the table with option compress=yes should already make a huge impact. You need to re-create the table because compress=yes impacts on the physical storage of SAS data. So once the table is created this can't be changed anymore. Just define this on your table metadata under options, then delete the physical table and re-run your job.

LinusH · Posted 01-10-2013 05:22 AM

I think that the difference between real and CPU can be explained by the job Base engine is doing when updating the index.

Moving to SPDE is probably the easiest (only?) way to improve performance without changing the loader logic.

Even SPDE cannot write data in parallel, it can update the index in parallel, which could improve your performance.

Data never sleeps

sunilreddy · Posted 01-11-2013 07:40 AM

Hi Patrick,

I've tried with your suggestions in EG instead of DI(4.1.3). I've tried below scenario and sample eg log as attached with fullstimer option.

1) Tried with compress= yes and used physical table instead of view

2) using condition to create index and constarins, if its new sas dataset. so this index is not creating every time

3) Also i tried multiple iterations for for source. I tried with 300000 observations in each iteration for enite table loader. In this case, i observed overall real time is almost taking as existing.

what i observed from above is, each iteration is looking into existing dataset, overall its taking same real time

We are not using SPDE engine so not tried this option.

Any suggestions to use index after load the data or any other solution?. for first time in a month will mostly have new added records and might be few duplicates. from second time onwards in a month, will have around 300000 observations and no issues in performance

LinusH · Posted 01-11-2013 09:07 AM

SPDE comes with Base SAS, it just needs an own directory, and a new libref, so I suggest that you try it!

Data never sleeps

Patrick · Posted 01-12-2013 09:04 PM

Hi sunilreddy

Based on your code I've done some testing. Linus is spot on: The majority of time spent is for maintaining the index. I was also under the impression that checking for the primary key constraint costed additional resources.

The variables which make up your primary key and index are everything else than ideal as these are all chacter variables with quite a length. What happens is that your index file takes up almost as much storage space as your data file. Such an index is pretty useless.

Is there any alternative key available which takes up less storage space?

I understand you need to design and implement for an initial load and then for incremental delta loads. What volumes (how many rows) could these delta loads have and what's the load frequency (expected target table growth)?

About duplicates: I believe you need to de-dup the source before loading (for efficiency reasons) and then to load with the indexes turned off. You need also to consider to set up a one off job for the initial load which then does a simple Proc Append. The indexes can then get created after the load (if needed at all).

For the incremental loads:

Not sure yet which approach to take. This depends on your DIS version and how much memory you have available (?). It also depends on the data volumes for incremental loads. Additionally: Will you inserts or also updates ("real" updates not duplicates or the same identical record sent more than once)?

I've got DIS4.5 and played around with the SCD Type 1 loader. There are a few good things in it and the way to improve performance might be using a digest value together with an Xref table. Unfortunately the update code generated by this loader (in my un-patched version) appears to be very inefficient. So if worst comes to worst this might end up using the SCD Type 1 code as a starting point to set up a user written code node for loading.

I'm attaching my DIS4.5 test job here. My be it's of some use for someone else.

Can you please answer above questions and also give us any further insights you've gained since your last post?

Thanks, Patrick

P.S: I've added now also a .sas code version for testing which is based on your log posted so also people without DIS can eventually give you some hands-on support.

sunilreddy · Posted 01-09-2013 02:57 PM

Any alternative technique to improve the performance in terms of real time,

SASKiwi · Posted 01-09-2013 06:16 PM

The merge technique in its simplest form looks like this:

data mastertable;

merge mastertable

transactiontable

;

by uniquekey;

if last.uniquekey;

run;

However I don't have access to DIS so I can't suggest how you would implement it as a DI task.