09-30-2017 01:42 PM
I am using views in some data steps, in order to reduce I/O operations and improve performance when using large datasets.
I have tested this in some programs, in my program I have a result datasets and other than are intermediate (stored in work library).
I am changing the intermediate dataset for a view and I have found that if I change this intermediate dataset for a view in one or two steps then the performance is better, but if I made more than two steps with views then the performance is equal than using datasets or even worse. In both situations I create a results dataset (nor view) in the final step.
I would like to know best practices of using views, if there are good noly for one or tow steps or for some type of operations. I am doing data /set with new fileds, joins, agregations.
Thanks in advance
09-30-2017 02:43 PM
If you re-use a data set view, then it is reconstructed each time from the original source, which is probably why you don't see expected benefits sometimes when using the view multiple times. This would be especially true when the view is a merge of multiple datasets, or the view is a small subset of the original.
One way to mitigate this problem is to simultaneously define a matching data set view (VNEED) and a data set file (NEED), as below. The first use (proc freq) call VNEED, which in the background would read BIGDATA, stream in VNEED, and write NEED. Note the proc freq doesn't have to wait for dataset NEED to be completely generated. The subsequent univariate proc would read dataset file NEED.
data need vneed / view=vneed;
proc freq data=vneed;
tables .... ;
proc univariate data=need;
So depending on the relative size of bigdata vs need, you might save considerable time by using the data file for 2nd and further access, and use the view only for the first access.