BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ResoluteCarbon
Obsidian | Level 7

Hi all SAS Users,

 

Today I faced this material, and this material says that I can not use where to deal with first.variable and last.variable from a dataset which has been sorted by this variable.

ResoluteCarbon_1-1617367657012.png

 

I am surprising about the yellowed sentence in this picture. For me, in the compilation phase, the compiler will go through the whole datastep to decide what will be the attributes (varname, length, type) for the output Table. It means that, to me, when the compiler go through the By statement, there should be the first.variable and last.variable overthere already?

I am wondering if I fall into a fallacy.

 

Many thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

@ResoluteCarbon wrote:

 Hi @Tom and @mkeintz 

Thank you for your explanantion, I have some points not clear here


@Tom wrote:

In theory SAS might be able to allow FIRST. and LAST. references in WHERE statements to work, since at the time the WHERE executes it will be reading the data.  But that would not work if the WHERE is going to push into a remote database.


Can you please explain this quotation for me, I am still ambiguous about that.

So if I filter out the some records with a WHERE statement then the FIRST.GROUP flag means the first observation in the group that makes it through the filter

Yes, it is right.

But what I really want to ask is: when the complier goes through the by statement, it means that the compiler already read the first.observation and last.observation to the attributes of the output datatset.

 

The compiler does NOT go through the the dataset - it only prepares executable code that goes through the dataset.  

 

Let's say an example:

data singlestate;
set pg2.np_acres;
by ParkName State;
where first.state=1 AND last.state=1 ;
run;

Regarding the code below, before going to the first iteration, the compiler will read the whole datastep to decide what is in the attribute of the output datatset (outsourcing columns, flag columns, _error_,_n_). So, my understanding is that the attributes will include: all the variables from dataset pg2.np_acres,_error_,_n_, and first.state, and last.state. Then afterward, at the excution phase, where can treat first.state or last.state easily . But yeah, we all know that's wrong but I am not clear about that yet.

 

I am aware that we can use if clause easily in this situation, but I want to clarify how it works behind the scene that I can control the code down the roads,

 

Warmest regards, 


To have an observation filtered on first.byvar and/or last.byvar you have to compare the observation-in-hand with the prior or subsequent observations - yet those observations would not have passed the where filter that you want to employ.  Remember, the first.byvar and last.byvar dummies are built by the data step processor AFTER the data has got through the access engine.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

6 REPLIES 6
Tom
Super User Tom
Super User

When compiling the data step SAS can know things like like the metadata of the dataset (aka "table"). So it can know the variable names and types.  It does not attempt to actually look at the values of variables.

 

In theory SAS might be able to allow FIRST. and LAST. references in WHERE statements to work, since at the time the WHERE executes it will be reading the data.  But that would not work if the WHERE is going to push into a remote database.

 

And also it would change the values of the FIRST. and LAST. flags.  With the current implementation those flags are set based on the values that actually make it past the WHERE clause filter into the data step.  So if I filter out the some records with a WHERE statement then the FIRST.GROUP flag means the first observation in the group that makes it through the filter.  If you tried to use it in the WHERE clause it could really only mean the first observation in the group regardless of whether it meets the filtering requirement or not.

 

So just use a subsetting IF for any filtering that you need to do based on FIRST./LAST. flags.

ResoluteCarbon
Obsidian | Level 7

 Hi @Tom and @mkeintz 

Thank you for your explanantion, I have some points not clear here


@Tom wrote:

In theory SAS might be able to allow FIRST. and LAST. references in WHERE statements to work, since at the time the WHERE executes it will be reading the data.  But that would not work if the WHERE is going to push into a remote database.


Can you please explain this quotation for me, I am still ambiguous about that.

So if I filter out the some records with a WHERE statement then the FIRST.GROUP flag means the first observation in the group that makes it through the filter

Yes, it is right.

But what I really want to ask is: when the complier goes through the by statement, it means that the compiler already read the first.observation and last.observation to the attributes of the output datatset.

Let's say an example:

data singlestate;
set pg2.np_acres;
by ParkName State;
where first.state=1 AND last.state=1 ;
run;

Regarding the code below, before going to the first iteration, the compiler will read the whole datastep to decide what is in the attribute of the output datatset (outsourcing columns, flag columns, _error_,_n_). So, my understanding is that the attributes will include: all the variables from dataset pg2.np_acres,_error_,_n_, and first.state, and last.state. Then afterward, at the excution phase, where can treat first.state or last.state easily . But yeah, we all know that's wrong but I am not clear about that yet.

 

I am aware that we can use if clause easily in this situation, but I want to clarify how it works behind the scene that I can control the code down the roads,

 

Warmest regards, 

Tom
Super User Tom
Super User

@ResoluteCarbon wrote:

 Hi @Tom and @mkeintz 

Thank you for your explanantion, I have some points not clear here


@Tom wrote:

In theory SAS might be able to allow FIRST. and LAST. references in WHERE statements to work, since at the time the WHERE executes it will be reading the data.  But that would not work if the WHERE is going to push into a remote database.


Can you please explain this quotation for me, I am still ambiguous about that.

Think of the connection to the remote database as a straw.  SAS is going to suck the data through that straw.  If SAS is forced to interpret the WHERE conditions it must suck the whole table through the straw and then filter out the observations to keep.  But if it can push the WHERE condition into the remote database to run then it has to suck fewer observation through the straw saving a lot of time.  But that means the WHERE condition has to be translated into the language of the remote database. Implementing FIRST./LAST. logic in that language might be hard or even impossible.

 

So if I filter out the some records with a WHERE statement then the FIRST.GROUP flag means the first observation in the group that makes it through the filter

Yes, it is right.

But what I really want to ask is: when the complier goes through the by statement, it means that the compiler already read the first.observation and last.observation to the attributes of the output datatset.

Let's say an example:

data singlestate;
set pg2.np_acres;
by ParkName State;
where first.state=1 AND last.state=1 ;
run;

Regarding the code below, before going to the first iteration, the compiler will read the whole datastep to decide what is in the attribute of the output datatset (outsourcing columns, flag columns, _error_,_n_). So, my understanding is that the attributes will include: all the variables from dataset pg2.np_acres,_error_,_n_, and first.state, and last.state. Then afterward, at the excution phase, where can treat first.state or last.state easily . But yeah, we all know that's wrong but I am not clear about that yet.

 

I am aware that we can use if clause easily in this situation, but I want to clarify how it works behind the scene that I can control the code down the roads,

 

Warmest regards, 


The compiler just needs to setup the variables type/length/label/format/informat.  For FIRST. and LAST. variable that is trivial as they are always numeric with no formats or labels attached and they are always excluded from being written the the remote database.   

 

As to how SAS implements the FIRST./LAST. logic I am not sure that is published anywhere.  But you can envision that it needs to look at the next observations after the one it is currently passing to the SET/MERGE/UPDATE statement in order the set the LAST. flags.  And remember the values from the previous observation to set the FIRST. flags.

 

Speaking of compilers a shout out to Aho and Ullman who recently won the Turing prize.

  https://awards.acm.org/about/2020-turing

image.png

mkeintz
PROC Star

@ResoluteCarbon wrote:

 Hi @Tom and @mkeintz 

Thank you for your explanantion, I have some points not clear here


@Tom wrote:

In theory SAS might be able to allow FIRST. and LAST. references in WHERE statements to work, since at the time the WHERE executes it will be reading the data.  But that would not work if the WHERE is going to push into a remote database.


Can you please explain this quotation for me, I am still ambiguous about that.

So if I filter out the some records with a WHERE statement then the FIRST.GROUP flag means the first observation in the group that makes it through the filter

Yes, it is right.

But what I really want to ask is: when the complier goes through the by statement, it means that the compiler already read the first.observation and last.observation to the attributes of the output datatset.

 

The compiler does NOT go through the the dataset - it only prepares executable code that goes through the dataset.  

 

Let's say an example:

data singlestate;
set pg2.np_acres;
by ParkName State;
where first.state=1 AND last.state=1 ;
run;

Regarding the code below, before going to the first iteration, the compiler will read the whole datastep to decide what is in the attribute of the output datatset (outsourcing columns, flag columns, _error_,_n_). So, my understanding is that the attributes will include: all the variables from dataset pg2.np_acres,_error_,_n_, and first.state, and last.state. Then afterward, at the excution phase, where can treat first.state or last.state easily . But yeah, we all know that's wrong but I am not clear about that yet.

 

I am aware that we can use if clause easily in this situation, but I want to clarify how it works behind the scene that I can control the code down the roads,

 

Warmest regards, 


To have an observation filtered on first.byvar and/or last.byvar you have to compare the observation-in-hand with the prior or subsequent observations - yet those observations would not have passed the where filter that you want to employ.  Remember, the first.byvar and last.byvar dummies are built by the data step processor AFTER the data has got through the access engine.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ResoluteCarbon
Obsidian | Level 7

Hi @mkeintz 

Thank you for your explanantion,

Yes, you are right, I made a mistake that SAS does not go through the dataset in the compilation phase but the program....

ResoluteCarbon_0-1617443331370.png

Yes, and not yet gone through the dataset so no way to generate the first. and lag. var. It makes a clear sense now.

 

Many thanks and warm regards!

 

mkeintz
PROC Star

Remember that the WHERE statement (or WHERE dataset name parameter) is outsourced to the data engine (maybe excel, postgres, sas, et.).

 

Because it is outsourced to any of a large number of engines, SAS can't  guarantee that each engine can look ahead (or look back) in order to detect changes in values of by-variables.  

 

But besides the benefit of faster transfer of the desired subset of data, outsourcing where has another benefit: it can be used  with most PROC steps as well as the data step.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 643 views
  • 4 likes
  • 3 in conversation