DATA Step, Macro, Functions and more

Using views to improve performance

Reply
Frequent Contributor
Posts: 122

Using views to improve performance

Hi,

 

I am using views in some data steps, in order to reduce I/O operations and improve performance when using large datasets.

 

I have tested this in some programs, in my program I have a result datasets and other than are intermediate (stored in work library).

I am changing the intermediate dataset for a view and I have found that if I change this intermediate dataset for a view in one or two steps then the performance is better, but if I made more than two steps with views then the performance is equal than using datasets or even worse. In both situations I create a results dataset (nor view) in the final step. 

 

I would like to know best practices of using views, if there are good noly for one or tow steps or for some type of operations. I am doing data /set with new fileds, joins, agregations.

 

Thanks in advance

Trusted Advisor
Posts: 1,022

Re: Using views to improve performance

Posted in reply to juanvg1972

If you re-use a data set view, then it is reconstructed each time from the original source, which is probably why you don't see expected benefits sometimes when using the view multiple times.  This would be especially true when the view is a merge of multiple datasets, or the view is a small subset of the original.

 

One way to mitigate this problem is to simultaneously define a matching data set view (VNEED) and a data set file (NEED), as below.  The first use (proc freq) call VNEED, which in the background would read BIGDATA, stream in VNEED, and write NEED.  Note the proc freq doesn't have to wait for dataset NEED to be completely generated.  The subsequent univariate proc would read dataset file NEED.

 

  data need  vneed / view=vneed;

    set bigdata;
    where ....;

     ....

   run;

 

  proc freq data=vneed;

     tables .... ;

  run;

 

  proc univariate data=need;

     ....

  run;

 

  1. The input/output cost of the above is that 
    1. BIGDATA is read once
    2. NEED is written to disk once, and
    3. NEED is read from disk once
  2. But if you used view both times, then
    1. BIGDATA is read twice
  3. Or if you used the data set file NEED twice, then
    1. BIGDATA is read once
    2. NEED is written to disk once
    3. NEED is read from disk twice

So depending on the relative size of bigdata vs need, you might save considerable time by using the data file for 2nd and further access, and use the view only for the first access.

Ask a Question
Discussion stats
  • 1 reply
  • 104 views
  • 1 like
  • 2 in conversation