Hello, I wonder what does first.var1 represent in this code?
proc sort data = have;
	by var1 var2 var3;
run;
data want;
	set have;
	by var1 var2 var3;
	if first.var1;
run;
Hi:
You have to read about BY group processing in the DATA step. FIRST. and LAST. variables are automatic variables that are "turned on" when you use BY group processing in the DATA step. These variables are not automatically written to the output dataset. So you have to capture the values, if you want to examine them.
Basically, when the value of FIRST.byvar is 1 then that means the current row is the first of the BYGROUP and if the value of FIRST.byvar is 0 it means that the current row is NOT the first of the group. And, similarly, when LAST.byvar=1 the current row is the last of the group and when LAST.byvar=0 the current row is NOT the last of the group.
This is easier to see with some actual data. If you capture the values for some data, as shown in the program below. I modified your program in order to create 2 datasets -- one that shows ALL the rows in the original input data and the other than is output after the subsetting IF statement:
data have;
  infile datalines;
  input var1 $ var2 $ var3 $;
datalines;
aaa 11a  111
aaa 11a  222
aaa 11a  222
bbb 11b  333
bbb 11b  444
ccc 22a  111
ccc 22a  222
ccc 22b  333
;
run;
  
proc sort data = have;
   by var1 var2 var3;
run;
data wantall wantonlyfirstvar1 ;
  set have;
  by var1 var2 var3;
  first_var1 = first.var1;
  last_var1 = last.var1;
  first_var2 = first.var2;
  last_var2 = last.var2;
  first_var3 = first.var3;
  last_var3 = last.var3;
  output wantall;
  if first.var1;
  output wantonlyfirstvar1;
run;
  
proc print data=wantall;
  title 'show all values for first. and last. automatic variables for all rows in original HAVE dataset';
run;
   
proc print data=wantonlyfirstvar1;
  title 'show result of using first.var1 subsetting if to see why only get 3 obs in this output dataset';
run;
title;
Of course, the second half of your question is implied and has to do with this statement:
if first.var1;
This is a subsetting IF statement that acts like a gate to let observations pass or not pass to the rest of the logic in the program. In this instance the subsetting if is controlling what will be output. So, based on the sample data above, there are only 3 rows where the value of FIRST.VAR1=1, so the "gate" is allowing only those 3 rows to pass to the end of the program where they will be output to the final dataset (my program has EXPLICIT OUTPUT statements so I can create 2 datasets; your program has an IMPLICIT output, which means the subsetting IF will implicitly cause an output to your dataset WANT).
Here's the output from the above program.
cynthia
** Output;
This is a "must have" tool if you're going to program using SAS.
first.var is created by the BY statement in the DATA step. It is automatically 1 or 0. As the data step progresses through the incoming data, whenever VAR1 takes on a new value, first.var1 is 1. Otherwise, first.var1 is 0.
The IF statement considers 1 to be true, and 0 to be false. So the DATA step is selecting the first observation for each value of VAR1.
The final basic: To be allowed to use a BY statement in a DATA step, your observations have to be in order. Usually that means running PROC SORT first, but if the observations are already in order for any reason, you don't have to use PROC SORT on top of that.
These are the basics only, but definitely enough to get started.
Hi:
You have to read about BY group processing in the DATA step. FIRST. and LAST. variables are automatic variables that are "turned on" when you use BY group processing in the DATA step. These variables are not automatically written to the output dataset. So you have to capture the values, if you want to examine them.
Basically, when the value of FIRST.byvar is 1 then that means the current row is the first of the BYGROUP and if the value of FIRST.byvar is 0 it means that the current row is NOT the first of the group. And, similarly, when LAST.byvar=1 the current row is the last of the group and when LAST.byvar=0 the current row is NOT the last of the group.
This is easier to see with some actual data. If you capture the values for some data, as shown in the program below. I modified your program in order to create 2 datasets -- one that shows ALL the rows in the original input data and the other than is output after the subsetting IF statement:
data have;
  infile datalines;
  input var1 $ var2 $ var3 $;
datalines;
aaa 11a  111
aaa 11a  222
aaa 11a  222
bbb 11b  333
bbb 11b  444
ccc 22a  111
ccc 22a  222
ccc 22b  333
;
run;
  
proc sort data = have;
   by var1 var2 var3;
run;
data wantall wantonlyfirstvar1 ;
  set have;
  by var1 var2 var3;
  first_var1 = first.var1;
  last_var1 = last.var1;
  first_var2 = first.var2;
  last_var2 = last.var2;
  first_var3 = first.var3;
  last_var3 = last.var3;
  output wantall;
  if first.var1;
  output wantonlyfirstvar1;
run;
  
proc print data=wantall;
  title 'show all values for first. and last. automatic variables for all rows in original HAVE dataset';
run;
   
proc print data=wantonlyfirstvar1;
  title 'show result of using first.var1 subsetting if to see why only get 3 obs in this output dataset';
run;
title;
Of course, the second half of your question is implied and has to do with this statement:
if first.var1;
This is a subsetting IF statement that acts like a gate to let observations pass or not pass to the rest of the logic in the program. In this instance the subsetting if is controlling what will be output. So, based on the sample data above, there are only 3 rows where the value of FIRST.VAR1=1, so the "gate" is allowing only those 3 rows to pass to the end of the program where they will be output to the final dataset (my program has EXPLICIT OUTPUT statements so I can create 2 datasets; your program has an IMPLICIT output, which means the subsetting IF will implicitly cause an output to your dataset WANT).
Here's the output from the above program.
cynthia
** Output;
Thank you. Does the order of by variables matter as below? Here can we say we want the first occurence of by var1 for each by var2?
proc sort data = have; by var2 var1 var3; run; data want; set have; by var2 var1 var3; if first.var1; run;
Try it.
Create a variable that holds the first.var2 variable and explore how it changes as you change your BY groupings.
proc sort data = have;
   by var2 var1 var3;
run;
 
data want;
  set have;
  by var2 var1 var3;
  first_var1=first.var1;
run;
proc print data=want;
var var2 var1 first_var1;
run;It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
