What does the By statement do in a data step?

Accepted Solution Solved
Reply
Contributor
Posts: 37
Accepted Solution

What does the By statement do in a data step?

I know this is a really stupid question. I've googled it but I'm havin trouble understanding what the by statement actually does in a data step!

 

I know that you need to proc sort first by the variable...

 

but what is it actually doing?

 

So for example in a data step like this:

 

data test;
	set test;
	by mrn program serv_date provider amount description; 
	firstmrn  = first.mrn;
	firstdesc = first.description;
	if firstdesc = 0 then delete;
run;

What is the by mrn program sev_date etc actually doing? How is it different from the proc sort?

 

please help! thank you!


Accepted Solutions
Solution
‎05-17-2017 10:03 PM
PROC Star
Posts: 7,471

Re: What does the By statement do in a data step?

Posted in reply to confused_saser

Your data set, test, has at least the following variables: 

mrn program serv_date provider amount description

Assuming the data were sorted by those variables, in that order, each record with the same mrn is now together and, within those records, all of the programs related to that mrn are together and, within them, the records for each serv_date (for that provider within that mrn) are together, etc, etc.

The by statement lets your datastep know that the records are ordered that way.

Within each group, first.variable (e.g., first.description) is equal to one for the first record for a particular description (with the same mrn, program, serv_date, provider and amount), and 0 for all other records with that same description.

As such, where a description is repeated (within a particular mrn, program, serv_date, provider and amount, your code is deleting all but the first description.

 

Art, CEO, AnalystFinder.com

 

View solution in original post


All Replies
Super User
Super User
Posts: 7,039

Re: What does the By statement do in a data step?

Posted in reply to confused_saser

It really does the same thing that it does when you use it with the MERGE or UPDATE statement.

It will just test that the data is sorted by the variables listed. If the data is out of order the data step will stop.

It creates the FIRST. and LAST. automatic flag variables that you can use to check when the BY variables change.

 

The FIRST. and LAST. variables will have values of either 0 or 1 and you can test for those values like in your IF/THEN statement. But it is probably easier to just treat them as the boolean values that they are.  So your IF statement could be re-written as 

if not first.description then delete;

and it will be more descriptive.

 

In your example you have 6 variables in your BY statement so there are 12 of these automatic variables.  Your code references just two of them that you save into permanent dataset variables. You then are eliminating all but the first record in each group by deleting the observations that are not FIRST.DESCRIPTION. So other than saving those two variables (one of which will always be 1) your code is the same as using the NODUPKEY option on PROC SORT.

 

 

Super User
Super User
Posts: 7,942

Re: What does the By statement do in a data step?

Posted in reply to confused_saser

You dont actually need to sort first, you can use the unsorted option:
by x y z notsorted;

 

What is basically does is allow processing over groups of data, by assigning a first and last marker to each block.  You can think of it like this:
by <group1>;

 

GROUP1  FIRST.GROUP1  LAST.GROUP1   

...            1                        .

...             .                       1

 

If ther are multiple groups on the by list, then it works left as highest group going right down the grouping, so:

GROUP1 GROUP2 GROUP3   FIRST.GROUP1  LAST.GROUP1  FIRST.GROUP2 LAST.GROUP2  FIRST.GROUP3 LAST.GROUP3

...            ...           ...             1                       .                       1                      .                       1                       .

...            ...           ...             .                                                .                                               1                      1

...            ...           ...             .                                                1                                              1                      1

 

Bit hard to draw out, but you can seefirst is assigned as 1 for the group only on the first occurency, and 1 for the last.  Then within the group for each second group 1 for first and 1 for last etc.

Thus you can do logic such as do logic on first record:

if first.group1 then ...;

 

Solution
‎05-17-2017 10:03 PM
PROC Star
Posts: 7,471

Re: What does the By statement do in a data step?

Posted in reply to confused_saser

Your data set, test, has at least the following variables: 

mrn program serv_date provider amount description

Assuming the data were sorted by those variables, in that order, each record with the same mrn is now together and, within those records, all of the programs related to that mrn are together and, within them, the records for each serv_date (for that provider within that mrn) are together, etc, etc.

The by statement lets your datastep know that the records are ordered that way.

Within each group, first.variable (e.g., first.description) is equal to one for the first record for a particular description (with the same mrn, program, serv_date, provider and amount), and 0 for all other records with that same description.

As such, where a description is repeated (within a particular mrn, program, serv_date, provider and amount, your code is deleting all but the first description.

 

Art, CEO, AnalystFinder.com

 

Contributor
Posts: 37

Re: What does the By statement do in a data step?

Thanks @art297. I think this makes sense to me. However, can you please explain what would happen if I dind't use the by statement?

 

Why do I need to tell my datastep that those variables are sorted? If I dind't use the by statement would it still intuitively know that it's sorted?

PROC Star
Posts: 7,471

Re: What does the By statement do in a data step?

Posted in reply to confused_saser

The first.variable (and there's also a last.variable) don't exist without declaring those variables in a by statement. However, to confuse you further, the data don't have to be sorted for those to work. One can specify a by statement and include the option 'notsorted', and the first.variable(s) and last.variable(s) will still exist and indicated when there is a change from one record to the next.

 

Art, CEO, AnalystFinder.com

 

Super User
Super User
Posts: 7,039

Re: What does the By statement do in a data step?

Posted in reply to confused_saser

If you don't use a BY statement then the data step doesn't have any reason to care if the data is sorted or not.

 

One other imapct that a BY statement can have is that if the input data set has an index for the BY variables then SAS will use the index to read the data in the order specified even if it is not physically sorted in that order.  Similarly if you are reading from an external database, like ORACLE or Teradata, then SAS can ask the database to sort the records before passing them to your data step.

 

You can mimic the calculation of the FIRST. variable of the BY statement in the NOTSORTED mode by using the LAG() function.  Try this little program.

data x ;
  set sashelp.class ;
  by sex notsorted ;
  f1= first.sex ;
  f2= sex ne lag(sex) ;
run;

proc print; 
  var name sex f1 f2 ;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 109 views
  • 2 likes
  • 4 in conversation