SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

di job table loader performance issue

Reply
Frequent Contributor
Posts: 115

di job table loader performance issue

Hi Patrick,

I've tried with your suggestions in EG instead of DI(4.1.3). I've tried below scenario and sample eg log as attached with fullstimer option.

1) Tried with compress= yes and used physical table instead of view

2) using condition to create index and constarins, if its new sas dataset. so this index is not creating every time

3) Also i tried multiple iterations for for source. I tried with 300000 observations in each iteration for enite table loader. In this case, i observed overall real time is almost taking as existing.
  
what i observed from above is, each iteration is looking into existing dataset, overall its taking same real time

We are not using SPDE engine so not tried this option.

Any suggestions to use index after load the data or any other solution?. for first time in a month will mostly have new added records and might be few duplicates. from second time onwards in a month, will have around 300000 observations and no issues in performance


Attachment
Super Contributor
Posts: 356

Re: di job table loader performance issue

Posted in reply to sunilreddy

Without looking at your log - Do u have an index on the table?

Frequent Contributor
Posts: 115

Re: di job table loader performance issue

Posted in reply to twocanbazza

Yes, i have index with update/insert load style. also load style properties attached

Frequent Contributor
Posts: 115

Re: di job table loader performance issue

Posted in reply to sunilreddy

Uploaded log file and table loader index

Attachment
Super User
Posts: 3,252

Re: di job table loader performance issue

Posted in reply to sunilreddy

Your log suggests that 23979334 observations were added and none updated. Every time SAS adds a new row it has to update the table index and this is a very resource-intensive process. I would be inclined to update without an index using merge-type processing to avoid data duplicates. Experiment with a small sample of data to confirm you get the required results with improved performance.  

Frequent Contributor
Posts: 115

Re: di job table loader performance issue

Some times i will have updated records. I will run this job every day in a month. For first time in a month, mostly i will have new records. But i will always prefer to use update /index techique to load. because duplicates might be exist in first time also.

Could you please explain me about merge-type processing in your above stmt. first time in every month, i will have empty dataset. so i added 23979334 observations(as mentioned in log). First time in  every month  will have huge records and remaining times in every month will not have more obervations.so i dont have any perfomane issues to use update/insert except first time in a month.

Respected Advisor
Posts: 4,173

Re: di job table loader performance issue

Posted in reply to sunilreddy

Looking at your log I can see that real time is much higher than CPU time. This is a very strong indication that I/O is the bottleneck.

When running your job the next time use option FULLSTIMER to get even more information around CPU and I/O.

A few thing you could try:

- Under Load Technique/Constraint Condition(and or Index Condition set everything to "As is". There shouldn't be a need to re-build constraints or indexes.

- Make sure that when creating the table option compress=yes is set.

- Consider storing the data using the SPDE engine and if possible use more than one disk to store the "fragments" (so spreading I/O over several disks). Storing the data under a library using the SPDE engine also allows for multi-threading which could reduce real time even more.

In your log the loader transformation creates first a view over the source table and then uses this view as source in the main datastep. This adds to cpu/memory usage. In my DIS4.5 version and with the conditions "As is" this view does not get created but the physical SAS table is used instead. Not sure if this is a DIS version thing or if changing your conditions would give you also such code.

Forget about this merge idea. The modify statement created by the loader transformation is more efficient than any merge alternative.

Let us know how you go. If your table has a lot of long character fields which are not "filled up" then re-creating the table with option compress=yes should already make a huge impact. You need to re-create the table because compress=yes impacts on the physical storage of SAS data. So once the table is created this can't be changed anymore. Just define this on your table metadata under options, then delete the physical table and re-run your job.

Super User
Posts: 5,427

Re: di job table loader performance issue

I think that the difference between real and CPU can be explained by the job Base engine is doing when updating the index.

Moving to SPDE is probably the easiest (only?) way to improve performance without changing the loader logic.

Even SPDE cannot write data in parallel, it can update the index in parallel, which could improve your performance.

Data never sleeps
Frequent Contributor
Posts: 115

Re: di job table loader performance issue

Hi Patrick,

I've tried with your suggestions in EG instead of DI(4.1.3). I've tried below scenario and sample eg log as attached with fullstimer option.

1) Tried with compress= yes and used physical table instead of view

2) using condition to create index and constarins, if its new sas dataset. so this index is not creating every time

3) Also i tried multiple iterations for for source. I tried with 300000 observations in each iteration for enite table loader. In this case, i observed overall real time is almost taking as existing.
  
what i observed from above is, each iteration is looking into existing dataset, overall its taking same real time

We are not using SPDE engine so not tried this option.

Any suggestions to use index after load the data or any other solution?. for first time in a month will mostly have new added records and might be few duplicates. from second time onwards in a month, will have around 300000 observations and no issues in performance

Super User
Posts: 5,427

Re: di job table loader performance issue

Posted in reply to sunilreddy

SPDE comes with Base SAS, it just needs an own directory, and a new libref, so I suggest that you try it!

Data never sleeps
Respected Advisor
Posts: 4,173

Re: di job table loader performance issue

Posted in reply to sunilreddy

Hi sunilreddy

Based on your code I've done some testing. Linus is spot on: The majority of time spent is for maintaining the index. I was also under the impression that checking for the primary key constraint costed additional resources.

The variables which make up your primary key and index are everything else than ideal as these are all chacter variables with quite a length. What happens is that your index file takes up almost as much storage space as your data file. Such an index is pretty useless.

Is there any alternative key available which takes up less storage space?

I understand you need to design and implement for an initial load and then for incremental delta loads. What volumes (how many rows) could these delta loads have and what's the load frequency (expected target table growth)?

About duplicates: I believe you need to de-dup the source before loading (for efficiency reasons) and then to load with the indexes turned off. You need also to consider to set up a one off job for the initial load which then does a simple Proc Append. The indexes can then get created after the load (if needed at all).

For the incremental loads:

Not sure yet which approach to take. This depends on your DIS version and how much memory you have available (?). It also depends on the data volumes for incremental loads. Additionally: Will you inserts or also updates ("real" updates not duplicates or the same identical record sent more than once)?

I've got DIS4.5 and played around with the SCD Type 1 loader. There are a few good things in it and the way to improve performance might be using a digest value together with an Xref table. Unfortunately the update code generated by this loader (in my un-patched version) appears to be very inefficient. So if worst comes to worst this might end up using the SCD Type 1 code as a starting point to set up a user written code node for loading.

I'm attaching my DIS4.5 test job here. My be it's of some use for someone else.

Can you please answer above questions and also give us any further insights you've gained since your last post?

Thanks, Patrick

P.S: I've added now also a .sas code version for testing which is based on your log posted so also people without DIS can eventually give you some hands-on support.

Attachment
Attachment
Frequent Contributor
Posts: 115

Re: di job table loader performance issue

Posted in reply to sunilreddy

Any alternative technique to improve the performance in terms of real time,

Super User
Posts: 3,252

Re: di job table loader performance issue

Posted in reply to sunilreddy

The merge technique in its simplest form looks like this:

data mastertable;

  merge mastertable

        transactiontable

        ;

  by uniquekey;

  if last.uniquekey;

run;

However I don't have access to DIS so I can't suggest how you would implement it as a DI task.

Ask a Question
Discussion stats
  • 12 replies
  • 1307 views
  • 0 likes
  • 5 in conversation