Solved: Re: Confusion about Where expression and By statement and writing orde...

ResoluteCarbon · Posted 04-02-2021 08:52 AM

Hi all SAS Users,

Today I faced this material, and this material says that I can not use where to deal with first.variable and last.variable from a dataset which has been sorted by this variable.

I am surprising about the yellowed sentence in this picture. For me, in the compilation phase, the compiler will go through the whole datastep to decide what will be the attributes (varname, length, type) for the output Table. It means that, to me, when the compiler go through the By statement, there should be the first.variable and last.variable overthere already?

I am wondering if I fall into a fallacy.

Many thanks in advance.

mkeintz · Posted 04-02-2021 08:18 PM

@ResoluteCarbon wrote:

Hi @Tom and @mkeintz

Thank you for your explanantion, I have some points not clear here

@Tom wrote:
In theory SAS might be able to allow FIRST. and LAST. references in WHERE statements to work, since at the time the WHERE executes it will be reading the data. But that would not work if the WHERE is going to push into a remote database.

Can you please explain this quotation for me, I am still ambiguous about that.

So if I filter out the some records with a WHERE statement then the FIRST.GROUP flag means the first observation in the group that makes it through the filter

Yes, it is right.

But what I really want to ask is: when the complier goes through the by statement, it means that the compiler already read the first.observation and last.observation to the attributes of the output datatset.

The compiler does NOT go through the the dataset - it only prepares executable code that goes through the dataset.

Let's say an example:
data singlestate;
set pg2.np_acres;
by ParkName State;
where first.state=1 AND last.state=1 ;
run;
Regarding the code below, before going to the first iteration, the compiler will read the whole datastep to decide what is in the attribute of the output datatset (outsourcing columns, flag columns, _error_,_n_). So, my understanding is that the attributes will include: all the variables from dataset pg2.np_acres,_error_,_n_, and first.state, and last.state. Then afterward, at the excution phase, where can treat first.state or last.state easily . But yeah, we all know that's wrong but I am not clear about that yet.

I am aware that we can use if clause easily in this situation, but I want to clarify how it works behind the scene that I can control the code down the roads,

Warmest regards,

To have an observation filtered on first.byvar and/or last.byvar you have to compare the observation-in-hand with the prior or subsequent observations - yet those observations would not have passed the where filter that you want to employ. Remember, the first.byvar and last.byvar dummies are built by the data step processor AFTER the data has got through the access engine.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Tom · Posted 04-02-2021 09:35 AM

When compiling the data step SAS can know things like like the metadata of the dataset (aka "table"). So it can know the variable names and types. It does not attempt to actually look at the values of variables.

In theory SAS might be able to allow FIRST. and LAST. references in WHERE statements to work, since at the time the WHERE executes it will be reading the data. But that would not work if the WHERE is going to push into a remote database.

And also it would change the values of the FIRST. and LAST. flags. With the current implementation those flags are set based on the values that actually make it past the WHERE clause filter into the data step. So if I filter out the some records with a WHERE statement then the FIRST.GROUP flag means the first observation in the group that makes it through the filter. If you tried to use it in the WHERE clause it could really only mean the first observation in the group regardless of whether it meets the filtering requirement or not.

So just use a subsetting IF for any filtering that you need to do based on FIRST./LAST. flags.

ResoluteCarbon · Posted 04-02-2021 05:12 PM

Hi @Tom and @mkeintz

Thank you for your explanantion, I have some points not clear here

@Tom wrote:
In theory SAS might be able to allow FIRST. and LAST. references in WHERE statements to work, since at the time the WHERE executes it will be reading the data. But that would not work if the WHERE is going to push into a remote database.

Can you please explain this quotation for me, I am still ambiguous about that.

So if I filter out the some records with a WHERE statement then the FIRST.GROUP flag means the first observation in the group that makes it through the filter

Yes, it is right.

But what I really want to ask is: when the complier goes through the by statement, it means that the compiler already read the first.observation and last.observation to the attributes of the output datatset.

Let's say an example:

data singlestate;
set pg2.np_acres;
by ParkName State;
where first.state=1 AND last.state=1 ;
run;

Regarding the code below, before going to the first iteration, the compiler will read the whole datastep to decide what is in the attribute of the output datatset (outsourcing columns, flag columns, _error_,_n_). So, my understanding is that the attributes will include: all the variables from dataset pg2.np_acres,_error_,_n_, and first.state, and last.state. Then afterward, at the excution phase, where can treat first.state or last.state easily . But yeah, we all know that's wrong but I am not clear about that yet.

I am aware that we can use if clause easily in this situation, but I want to clarify how it works behind the scene that I can control the code down the roads,

Warmest regards,

Tom · Posted 04-02-2021 07:42 PM

@ResoluteCarbon wrote:

Hi @Tom and @mkeintz

Thank you for your explanantion, I have some points not clear here

@Tom wrote:
In theory SAS might be able to allow FIRST. and LAST. references in WHERE statements to work, since at the time the WHERE executes it will be reading the data. But that would not work if the WHERE is going to push into a remote database.

Can you please explain this quotation for me, I am still ambiguous about that.

Think of the connection to the remote database as a straw. SAS is going to suck the data through that straw. If SAS is forced to interpret the WHERE conditions it must suck the whole table through the straw and then filter out the observations to keep. But if it can push the WHERE condition into the remote database to run then it has to suck fewer observation through the straw saving a lot of time. But that means the WHERE condition has to be translated into the language of the remote database. Implementing FIRST./LAST. logic in that language might be hard or even impossible.

So if I filter out the some records with a WHERE statement then the FIRST.GROUP flag means the first observation in the group that makes it through the filter

Yes, it is right.

But what I really want to ask is: when the complier goes through the by statement, it means that the compiler already read the first.observation and last.observation to the attributes of the output datatset.

Let's say an example:
data singlestate;
set pg2.np_acres;
by ParkName State;
where first.state=1 AND last.state=1 ;
run;
Regarding the code below, before going to the first iteration, the compiler will read the whole datastep to decide what is in the attribute of the output datatset (outsourcing columns, flag columns, _error_,_n_). So, my understanding is that the attributes will include: all the variables from dataset pg2.np_acres,_error_,_n_, and first.state, and last.state. Then afterward, at the excution phase, where can treat first.state or last.state easily . But yeah, we all know that's wrong but I am not clear about that yet.

I am aware that we can use if clause easily in this situation, but I want to clarify how it works behind the scene that I can control the code down the roads,

Warmest regards,

The compiler just needs to setup the variables type/length/label/format/informat. For FIRST. and LAST. variable that is trivial as they are always numeric with no formats or labels attached and they are always excluded from being written the the remote database.

As to how SAS implements the FIRST./LAST. logic I am not sure that is published anywhere. But you can envision that it needs to look at the next observations after the one it is currently passing to the SET/MERGE/UPDATE statement in order the set the LAST. flags. And remember the values from the previous observation to set the FIRST. flags.

Speaking of compilers a shout out to Aho and Ullman who recently won the Turing prize.

https://awards.acm.org/about/2020-turing

mkeintz · Posted 04-02-2021 08:18 PM

@ResoluteCarbon wrote:

Hi @Tom and @mkeintz

Thank you for your explanantion, I have some points not clear here

@Tom wrote:
In theory SAS might be able to allow FIRST. and LAST. references in WHERE statements to work, since at the time the WHERE executes it will be reading the data. But that would not work if the WHERE is going to push into a remote database.

Can you please explain this quotation for me, I am still ambiguous about that.

So if I filter out the some records with a WHERE statement then the FIRST.GROUP flag means the first observation in the group that makes it through the filter

Yes, it is right.

But what I really want to ask is: when the complier goes through the by statement, it means that the compiler already read the first.observation and last.observation to the attributes of the output datatset.

The compiler does NOT go through the the dataset - it only prepares executable code that goes through the dataset.

Let's say an example:
data singlestate;
set pg2.np_acres;
by ParkName State;
where first.state=1 AND last.state=1 ;
run;
Regarding the code below, before going to the first iteration, the compiler will read the whole datastep to decide what is in the attribute of the output datatset (outsourcing columns, flag columns, _error_,_n_). So, my understanding is that the attributes will include: all the variables from dataset pg2.np_acres,_error_,_n_, and first.state, and last.state. Then afterward, at the excution phase, where can treat first.state or last.state easily . But yeah, we all know that's wrong but I am not clear about that yet.

I am aware that we can use if clause easily in this situation, but I want to clarify how it works behind the scene that I can control the code down the roads,

Warmest regards,

To have an observation filtered on first.byvar and/or last.byvar you have to compare the observation-in-hand with the prior or subsequent observations - yet those observations would not have passed the where filter that you want to employ. Remember, the first.byvar and last.byvar dummies are built by the data step processor AFTER the data has got through the access engine.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

ResoluteCarbon · Posted 04-03-2021 05:50 AM

Hi @mkeintz

Thank you for your explanantion,

Yes, you are right, I made a mistake that SAS does not go through the dataset in the compilation phase but the program....

Yes, and not yet gone through the dataset so no way to generate the first. and lag. var. It makes a clear sense now.

Many thanks and warm regards!

mkeintz · Posted 04-02-2021 12:39 PM

Remember that the WHERE statement (or WHERE dataset name parameter) is outsourced to the data engine (maybe excel, postgres, sas, et.).

Because it is outsourced to any of a large number of engines, SAS can't guarantee that each engine can look ahead (or look back) in order to detect changes in values of by-variables.

But besides the benefit of faster transfer of the desired subset of data, outsourcing where has another benefit: it can be used with most PROC steps as well as the data step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Confusion about Where expression and By statement and writing order to PDV ?

Re: Confusion about Where expression and By statement and writing order to PDV ?

Re: Confusion about Where expression and By statement and writing order to PDV ?

Re: Confusion about Where expression and By statement and writing order to PDV ?

Re: Confusion about Where expression and By statement and writing order to PDV ?

Re: Confusion about Where expression and By statement and writing order to PDV ?

Re: Confusion about Where expression and By statement and writing order to PDV ?

Re: Confusion about Where expression and By statement and writing order to PDV ?

Classroom Training Available!