A where clause on a view should be passed on for processing by the vie...

ChrisNZ · ‎08-02-2016

All tables are indexed, so this takes no time.


data ACCOUNT_444513219790;
  set XODSM.ACCOUNT_201101-XODSM.ACCOUNT_201112
      XODSM.ACCOUNT_201201-XODSM.ACCOUNT_201212
      XODSM.ACCOUNT_201301-XODSM.ACCOUNT_201312
      XODSM.ACCOUNT_201401-XODSM.ACCOUNT_201412
      XODSM.ACCOUNT_201501-XODSM.ACCOUNT_201512
      XODSM.ACCOUNT_201601-XODSM.ACCOUNT_201606;
  where IP_LOCAL_ID='444513219790';
run;

If I use a view, it takes minutes since the where clause is processed in the second data step.

data _V_ACCOUNTS/view=_V_ACCOUNTS;
  set XODSM.ACCOUNT_201101-XODSM.ACCOUNT_201112
      XODSM.ACCOUNT_201201-XODSM.ACCOUNT_201212
      XODSM.ACCOUNT_201301-XODSM.ACCOUNT_201312
      XODSM.ACCOUNT_201401-XODSM.ACCOUNT_201412
      XODSM.ACCOUNT_201501-XODSM.ACCOUNT_201512
      XODSM.ACCOUNT_201601-XODSM.ACCOUNT_201606;
run;
data ACCOUNT_444513219790; 
  set _V_ACCOUNTS;
  where IP_LOCAL_ID='444513219790';
run;

I would be better if the where clause was processed by the view.

Suggestion: Any where clause on a view should be passed on for processing by the view whenever possible.

An example of a case when this is not possible is when data sets include the where=option.

Another example is when the point= or key= options are used in the view.

RW9 · ‎08-03-2016

Errm, I don't think so. When you create a view, you are creating a piece of code which is stored. What you are saying is that some logic from other parts of the program should either be passed into, or overwrite the code which is in that file. This then invalidates the view, as code that you write could be different to code someone else writes and the view would then return different items. This should not be the case. I person a sets the view and person b sets the view they should both return *exactly* the same thing, regardless of what person a or b is doing with afterwards with that data. Take your first example, if I ran it and you ran it would give the same records. In your second example if you run that code, and I just set the data we will have different data, that is fundamentally against the point of views.

ChrisNZ · ‎08-03-2016

I don't understand what you are saying.

The view will always pass the same data as defined, but it doesn't have to be dumb and static, and can process clauses that are passed through to it.

The concept is hardly revolutionary and has existed for a long time. SAS/ACCESS views do this and more, and can pass (through) code including where clauses to the underlying database.

This does not mean that the data provided by the view has changed, just that it can receive and process some of the downstream code for better efficiency.

PGStats · ‎08-03-2016

@ChrisNZ, your example shows a view being used to regroup account data kept in separate tables. I would never consider such a strategy. You would have to provide a more compelling example to justify the development effort required to implement your suggestion.

Quentin · ‎08-03-2016

I think it's compelling. : )

I haven't tested, but sounds like the general case is:

Have a dataset with an index
Have a view of that dataset
Use the view in a DATA/PROC step that has a where clause

I would expect the where clause to make use of the index. As said, it's similar to implicit pass-through. It's just SAS deciding where to filter the data.

I note the docs have a section on using an Index with SAS views, that includes:

"You cannot create an index for a SAS view; it must be a data file. However, if a SAS view is created from an indexed data file, index usage is available. ***That is, if the view definition includes a WHERE expression using a key variable, then SAS attempts to use the index.*** There are other ways to take advantage of a key variable when using a SAS view. "

proc sql;
   create view stat as
   select * from crime
   where murder > 7;
quit;

proc sql;
  select * from stat where state > 42;
quit;

In that example CRIME is indexed, the WHERE clause is defined as part of the view STAT when it is created. When the second SQL step executes, the docs say the index will be used to subset by State, i.e. WHERE clause is pushed down.

From the *** sentence in the docs, looks like it's important that the view be defined with a where clause referencing and index key, which seems unfortunate. But maybe you could try adding WHERE not missing(Key); to your view definition, and see if that helps. Also looks like it varies between SQL views and data step views. The docs seems to say a DATA step/PROC step where clause would not be pushed down through a SQL view....

I haven't tested anything, but seems worth exploring some more

ChrisNZ · ‎08-03-2016

@PGStats This is bread and butter coding where I am now. You have snapshots, and if you want the history you use several snapshots. There are many uses for this history, such as building predictive models.
One benefit of an efficient view is that we could maintain a view that points to all the tables as they are added/removed so people wouldn't have to name them all each time (there are many more snapshots than my example's 5 years, and the list changes every month of course).

Anyway, this is just en example. The idea, more generally, is to make views more efficient. One would never think of removing the passthrough feature of SAS/ACCESS view, would one? It is so beneficial to pass on some of the processing upstream that it seems like an obvious thing to do. Likewise, I think that once this exists, it will seem obvious.

@Quentin Indeed! I suppose that "There are other ways to take advantage of a key variable when using a SAS view" means for example if the view contains KEY=. But it must be pre-defined in the view, which is very limiting. Hence my suggestion to enhance the way views work.

PhilC · ‎08-15-2016

The best way I can see to do this now is by using macros as in

%MACRO define_V_ACCOUNTS;
   %global _V_ACCOUNTS; 
   %let DSO=(where=(&&where_Statement));
   %let _V_ACCOUNTS=
      XODSM.ACCOUNT_201101-XODSM.ACCOUNT_201112 &DSO
      XODSM.ACCOUNT_201201-XODSM.ACCOUNT_201212 &DSO
      XODSM.ACCOUNT_201301-XODSM.ACCOUNT_201312 &DSO
      XODSM.ACCOUNT_201401-XODSM.ACCOUNT_201412 &DSO
      XODSM.ACCOUNT_201501-XODSM.ACCOUNT_201512 &DSO
      XODSM.ACCOUNT_201601-XODSM.ACCOUNT_201606 &DSO;
%MEND;%define_V_ACCOUNTS;

data ACCOUNT_444513219790;
  %let Where_statement=IP_LOCAL_ID='444513219790';
  set &_V_ACCOUNTS;
run;

I have thought of this too, I've figured it would be whole other evolution on a dataset view: A "dataset-function-view" if you will.

or a dataset option

data ACCOUNT_444513219790; 
  set _V_ACCOUNTS (special-view-where=(IP_LOCAL_ID='444513219790'));
run;

I like the idea.

ChrisNZ · ‎08-15-2016

That's a smart way to do it, @PhilC.

But not necessarily easy enough for users.

I don't think we need to change views into "dataset-function-views".

I think this suggestion is for an optimisation of the interpreter, which can complete the definition of the view of the fly, like sas/access views can.

The main reasons for this optimisation of the treatment of views are where clauses and keep/drop conditions.

When SPDE supports views (when will that be? when?) we can also pass order by clauses so they are processed by the source as well.

A where clause on a view should be passed on for processing by the view whenever possible