BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
confused_saser
Obsidian | Level 7

I know this is a really stupid question. I've googled it but I'm havin trouble understanding what the by statement actually does in a data step!

 

I know that you need to proc sort first by the variable...

 

but what is it actually doing?

 

So for example in a data step like this:

 

data test;
	set test;
	by mrn program serv_date provider amount description; 
	firstmrn  = first.mrn;
	firstdesc = first.description;
	if firstdesc = 0 then delete;
run;

What is the by mrn program sev_date etc actually doing? How is it different from the proc sort?

 

please help! thank you!

1 ACCEPTED SOLUTION

Accepted Solutions
art297
Opal | Level 21

Your data set, test, has at least the following variables: 

mrn program serv_date provider amount description

Assuming the data were sorted by those variables, in that order, each record with the same mrn is now together and, within those records, all of the programs related to that mrn are together and, within them, the records for each serv_date (for that provider within that mrn) are together, etc, etc.

The by statement lets your datastep know that the records are ordered that way.

Within each group, first.variable (e.g., first.description) is equal to one for the first record for a particular description (with the same mrn, program, serv_date, provider and amount), and 0 for all other records with that same description.

As such, where a description is repeated (within a particular mrn, program, serv_date, provider and amount, your code is deleting all but the first description.

 

Art, CEO, AnalystFinder.com

 

View solution in original post

6 REPLIES 6
Tom
Super User Tom
Super User

It really does the same thing that it does when you use it with the MERGE or UPDATE statement.

It will just test that the data is sorted by the variables listed. If the data is out of order the data step will stop.

It creates the FIRST. and LAST. automatic flag variables that you can use to check when the BY variables change.

 

The FIRST. and LAST. variables will have values of either 0 or 1 and you can test for those values like in your IF/THEN statement. But it is probably easier to just treat them as the boolean values that they are.  So your IF statement could be re-written as 

if not first.description then delete;

and it will be more descriptive.

 

In your example you have 6 variables in your BY statement so there are 12 of these automatic variables.  Your code references just two of them that you save into permanent dataset variables. You then are eliminating all but the first record in each group by deleting the observations that are not FIRST.DESCRIPTION. So other than saving those two variables (one of which will always be 1) your code is the same as using the NODUPKEY option on PROC SORT.

 

 

RW9
Diamond | Level 26 RW9
Diamond | Level 26

You dont actually need to sort first, you can use the unsorted option:
by x y z notsorted;

 

What is basically does is allow processing over groups of data, by assigning a first and last marker to each block.  You can think of it like this:
by <group1>;

 

GROUP1  FIRST.GROUP1  LAST.GROUP1   

...            1                        .

...             .                       1

 

If ther are multiple groups on the by list, then it works left as highest group going right down the grouping, so:

GROUP1 GROUP2 GROUP3   FIRST.GROUP1  LAST.GROUP1  FIRST.GROUP2 LAST.GROUP2  FIRST.GROUP3 LAST.GROUP3

...            ...           ...             1                       .                       1                      .                       1                       .

...            ...           ...             .                                                .                                               1                      1

...            ...           ...             .                                                1                                              1                      1

 

Bit hard to draw out, but you can seefirst is assigned as 1 for the group only on the first occurency, and 1 for the last.  Then within the group for each second group 1 for first and 1 for last etc.

Thus you can do logic such as do logic on first record:

if first.group1 then ...;

 

art297
Opal | Level 21

Your data set, test, has at least the following variables: 

mrn program serv_date provider amount description

Assuming the data were sorted by those variables, in that order, each record with the same mrn is now together and, within those records, all of the programs related to that mrn are together and, within them, the records for each serv_date (for that provider within that mrn) are together, etc, etc.

The by statement lets your datastep know that the records are ordered that way.

Within each group, first.variable (e.g., first.description) is equal to one for the first record for a particular description (with the same mrn, program, serv_date, provider and amount), and 0 for all other records with that same description.

As such, where a description is repeated (within a particular mrn, program, serv_date, provider and amount, your code is deleting all but the first description.

 

Art, CEO, AnalystFinder.com

 

confused_saser
Obsidian | Level 7

Thanks @art297. I think this makes sense to me. However, can you please explain what would happen if I dind't use the by statement?

 

Why do I need to tell my datastep that those variables are sorted? If I dind't use the by statement would it still intuitively know that it's sorted?

art297
Opal | Level 21

The first.variable (and there's also a last.variable) don't exist without declaring those variables in a by statement. However, to confuse you further, the data don't have to be sorted for those to work. One can specify a by statement and include the option 'notsorted', and the first.variable(s) and last.variable(s) will still exist and indicated when there is a change from one record to the next.

 

Art, CEO, AnalystFinder.com

 

Tom
Super User Tom
Super User

If you don't use a BY statement then the data step doesn't have any reason to care if the data is sorted or not.

 

One other imapct that a BY statement can have is that if the input data set has an index for the BY variables then SAS will use the index to read the data in the order specified even if it is not physically sorted in that order.  Similarly if you are reading from an external database, like ORACLE or Teradata, then SAS can ask the database to sort the records before passing them to your data step.

 

You can mimic the calculation of the FIRST. variable of the BY statement in the NOTSORTED mode by using the LAG() function.  Try this little program.

data x ;
  set sashelp.class ;
  by sex notsorted ;
  f1= first.sex ;
  f2= sex ne lag(sex) ;
run;

proc print; 
  var name sex f1 f2 ;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1159 views
  • 2 likes
  • 4 in conversation