Motivation:
In my recent interactions with SAS customers (programmers), a question came up a few times. ‘Does SAS support SQL Window Functions?’. ‘I got some code that was handed to me with SQL Window functions. How can I port them to run in SAS?’. I realized that while it might be second nature to some, it can be daunting to those who haven’t heard of SQL Window Function or are beginner programmers in SAS or both. This motivated me to create this content about some common, basic SQL Window Functions and how to implement them in SAS code.
Introduction:
According to the Oracle documentation, a SQL window function performs an aggregate-like operation on a set of query rows. However, whereas an aggregate operation groups query rows into a single result row, a window function produces a result for each query row.
As per PostgreSQL documentation, a window function performs a calculation across a set of table rows that are somehow related to the current row.
SQL Window functions are supported by a lot of different databases but the example code shown here were executed in Oracle.
Aggregation Vs Window Function: Consider the following data from sashelp.class dataset.
If we aggregated (say took the average of) the height and weight by age and sex, the output would like this. One output record is created for each unique value of Age and Sex variables and the average height and weight for those distinct groups are displayed. As an example, the meanHeight and meanWeight for the group Sex=’F’ and Age=13 was calculated using the height and weight values for the 2 students in this group, Alice and Barbara.
|
In contrast, a window function applies aggregated value to each of the rows without collapsing it into a single row per group.
|
While in SQL there are Window functions that allow you to create output as shown above, SAS’ Proc SQL does not support SQL Window functions. Note that you can run SQL Window functions in explicit PROC SQL passthru code submitted to external databases, but they do not work in implicit Passthru code or code that works on SAS datasets.
Methods:
In this article, we will review the following commonly used SQL Window functions and how to implement it in SAS code.
In each example, we will look at:
Note: The dataset sashelp.class was loaded into Oracle so the output from SQL Window Functions and the equivalent SAS output can be compared.
Example 1: A simple case: Windowing function to aggregate over a partition
Let us review the code below to understand what the SQL Window function is doing.
/* Window function in Oracle SQL */
/* calculate mean height, mean weight by the partition variables and assign it to each record in the table */
SELECT name,
sex,
age,
height,
weight,
AVG(height) over (partition BY sex,age) AS meanheight,
AVG(weight) over (partition BY sex,age) AS meanWeight
FROM oradat1.CLASS;
In the code above, the “over (partition by sex, age)” clause indicates the use of SQL Window function. In simple words, it means, partition the data into a group defined by distinct values of sex and age and calculate the average height and average weight for each group (without collapsing the records into unique values).
The output from this SQL query in Oracle looks like this.
Before we proceed further, it is important to understand that the output of the Window function shown above is simply an aggregation performed over the partition variables and remerged back with the original dataset on the same partition variables. Here is the aggregation code and the output from SQL.
/* Average height and weight by sex and age - Oracle */
SELECT sex,
age,
AVG(height),
AVG(weight)
FROM oradat1.class
GROUP BY sex, age;
Query Output:
By simply merging this above aggregation output with the original CLASS dataset on ‘sex’ and ‘age’ variables, we can achieve the result of the SQL Window function in example 1. The process of joining a full table with its own summary is referred to as ‘remerging’ in SAS terminology.
Method 1: Using Proc SQL
/* mimic SQL Window function in SAS Proc SQL */
proc sql;
select *,
mean(height) as meanHeight,
mean(weight) as meanWeight
from sashelp.class
group by age, sex;
quit;
The code above is select all the columns but only calculating the meanHeight and meanWeight on age and sex columns.
The log contains the following message.
SAS Proc SQL Output is the same as the output using SQL window function executed in Oracle:
Method 2: Using Proc SQL (explicity joining the table with its summary table).
I am including this alternate approach since some programmers prefer to use this method to perform the remerging.
/* mimic SQL Window function in proc SQL */
/* remerge with aggregated values */
proc sql;
select a.*,b.meanWeight,b.meanHeight from
sashelp.class a,
(
select age, sex, mean(height) as meanHeight,
mean(weight) as meanWeight
from sashelp.class
group by age, sex
) b
where
a.age=b.age and a.sex=b.sex;
quit;
The full sashelp.class dataset (alias a) is joined (remerged) with the query which computes the average Height and Weight by age and sex (alias b) on age and sex.
Output:
Method 3: proc means to calculate averages and remerge using data step.
First, we sort the data on the BY variables. Then PROC MEANS is used to calculate the average HEIGHT and WEIGHT by AGE and SEX variables and saved to a summary dataset. Finally the DATA STEP Merge statement is used to merge the summary dataset back to the original dataset by AGE and SEX.
proc sort data=sashelp.class out=class;
by sex age;
run;
/* calculate summary stats */
proc means data=class;
by age sex;
var height weight;
output out=classAvg mean(height weight)=meanHeight meanWeight;
run;
/* remerge summary data back with original dataset*/
data class2;
merge class (in=in1) classAvg (in=in2);
by age sex;
if in1 and in2;
drop _type_ _freq_;
run;
Equivalent SAS Output (after sorting to match the original order):
Note that both the PROC SQLs and SAS DATA STEP methods produced the same output as the SQL Window function.
Example 2: Select row_number() over (order by vars1, vars2)
As the name suggests the row_number() window function simply generates row numbers. When used as shown below with only an order by, this sorts the rows by the variables and adds a row number.
SELECT name,
sex,
age,
height,
weight,
row_number() over (order BY sex,age) AS rowNumber
FROM oradat1.CLASS;
SQL Window function output:
Equivalent SAS Code:
/*sort the data by sex and age*/
proc sort data=sashelp.class out=class;
by sex age;
run;
/*add rownumber in the datastep*/
data classOut;
set class;
by sex age;
RowNumber=_N_;
run;
RowNumber is added to the dataset by using the _N_ DATASTEP variable.
Equivalent SAS output:
Example 3: Select row_number() over (partition by vars1 order by var2)
/*Add rowNumber for each AGE partition order by sex*/
SELECT name,
sex,
age,
height,
weight,
row_number() over (partition BY age order by sex) AS rowNumber
FROM oradat1.CLASS;
Using the row_number() window function with ‘partition by age sort by sex’, the data is sorted by sex within each AGE partition (one partition per unique value of AGE) and the rows are order within each partition starting from 1 and onward.
SQL Window function output:
Notice that within each AGE group, the records are sorted by SEX and the numbering for each AGE group (partition) begins at 1.
Equivalent SAS Code:
/*sort the data by AGE (partition var first) and SEX (sort var next)*/
proc sort data=sashelp.class out=class;
by age sex;
run;
/*reset rownumber for each new AGE group */
data classOut;
set class;
by age sex;
retain RowNumber 1;
if first.age then RowNumber=1;
else RowNumber=RowNumber+1;
run;
Retain statement is used to retain the value of RowNumber from one iteration to the next. Sum statement may be used instead of an explicit retain statement.
SAS output:
Example 4: Select rank() over (partition by vars1 order by var2)
Rank() window function is a bit different than the row_number() in that row_number() will assign a new rowNumber value even if there are duplicate values based on the order by variables. However, rank will assign the same value when it encounters duplicates based on the order by variables.
In order to illustrate this, the following updated dataset is used.
Notice that the height value for Robert and Mary have been updated to create a duplicate value within the age group and sex.
/*Difference between row_number() and rank() */
SELECT name,
sex,
age,
height,
weight,
row_number() over (partition BY age order by height) AS rowNumber,
rank() over (partition BY age order by height) AS rank
FROM oradat1.CLASS;
Output of SQL window function:
Note that output from the row_number() function incremented the value of RowNumber when there was a duplicate value based on Height (order by variable) whereas the rank() function did not increment the value when the duplicate was encountered. Thus the row_number() function is used to identify the total number of records in the partition, whereas the rank() function is used to identify the total number of unique records in a partition based on the order by variable.
Equivalent SAS Code:
/*sort the data by AGE (partition var first) and HEIGHT (sort var next)*/
proc sort data=class;
by age height;
run;
/*reset rownumber for each new AGE group */
/* Only increment rank if first.height is true */
data classOut;
set class;
by age height;
retain RowNumber 1 Rank 1;
if first.age then do;
RowNumber=1;
Rank=1;
end;
else do;
RowNumber=RowNumber+1;
if first.height then Rank=RowNumber;
end;
run;
SAS output:
The output is exactly the same as the Window function and is achieved easily using the data step functionality.
Example 5: Select dense_rank() over (partition by vars1 order by var2)
Dense_rank() window function is a slightly different than the rank() function, in that the former will not skip any values after it encounters duplicates whereas the latter does.
Here is the code and output comparison between the 2 functions.
/*Difference between row_number(), rank() and dense_rank() */
SELECT name,
sex,
age,
height,
weight,
row_number() over (partition BY age order by height) AS rowNumber,
rank() over (partition BY age order by height) AS rank,
dense_rank() over (partition BY age order by height) AS denseRank
FROM oradat1.CLASS;
Output of SQL Window Function:
Note that while the rank and dense_rank() functions produce the same value for the first set of duplicate value for Height, for any subsequent records in the same partition (AGE group here), dense_rank() does not skip any values and increments to 4, whereas rank() skips a number to indicate the number of duplicates and increments to 5 for the next record.
Equivalent SAS Code:
/*DENSE_RANK()*/
/*sort the data by AGE (partition var first) and HEIGHT (sort var next)*/
proc sort data=class;
by age height;
run;
/*reset rownumber for each new AGE group */
/* Only increment rank if first.height is true */
data classOut;
set class;
by age height;
retain RowNumber 1 Rank 1 DenseRank 1;
if first.age then do;
RowNumber=1;
Rank=1;
DenseRank=1;
end;
else do;
RowNumber=RowNumber+1;
if first.height then do;
Rank=RowNumber;
DenseRank=DenseRank + 1;
end;
end;
run;
SAS Output:
Note that the SAS output is the same as the SQL Window function output.
Example 6: lag() over partition() to access previous records
Consider the following table (partial output) which has records for height and weight for each student for 2 consecutive years.
The use of lag() function allows the access of the prior record from the current record to compare the change in height or weight values from one year to the previous year.
SELECT *
FROM
(SELECT name,
lag(height,1) over (partition BY name order by age) AS HeightLastYear,
height AS heightThisYear
FROM class3
) a
WHERE a.HeightLastYear IS NOT NULL;
The inner select statement uses the lag() function with a parameter of 1, meaning look at the immediate previous record. The over partition variable is name, since we want to compare the change in height for the same person (and here the names are unique) and we order by age, since we want the data ordered by age. The first record for each person will have missing values for previous year, so that is removed using the outer SQL select statement.
SQL Window function Output:
Equivalent SAS Code:
proc sort data=class3 (drop=sex weight); by name age; run; data class4 (rename=(height=heightThisYear));
set class3;
by name age;
HeightLastYear=lag(height);
/* if there are no duplicate records by name, then there is no last year value */
if first.name and last.name then HeightLastYear=.;
/* if there are duplicate values, then delete the first record*/
if first.name and not last.name then delete;
drop age;run;
Equivalent SAS output:
Example 7: lead() over partition() to access next record
The use of lead() function allows the access of the next record from the current record to compare the change in height or weight values from one year to the next.
We will use the input table from the previous example which has records for height and weight for each student for 2 consecutive years (except instead of looking to the previous record, we will look at the next record).
SELECT *
FROM
(SELECT name,
lead(height,1) over (partition BY name order by age) AS HeightNextYear,
height AS heightThisYear
FROM class3
) a
WHERE a.HeightNextYear IS NOT NULL;
The inner select statement uses the lead() function with a parameter of 1, meaning look at the immediate next record. The over partition variable is name, since we want to compare the change in height for the same person (and here the names are unique) and we order by age, since we want the data ordered by age. The last record for each person will have missing values for next year, so that is removed using the outer SQL select statement.
SQL Window function Output:
Equivalent SAS Code:
Implementing lead() function in SAS is not as straightforward as the lag() function because there is no function in SAS that allows you to look forward to the next record. However, there are programming techniques that allows you to implement the lead() function in SAS.
/*lead() function implementation using data step*/
data class4;
set class3 (rename=(height=heightThisYear));
by name age;
if eof=0 then
set class3(firstobs=2 keep=height rename=(height=heightNextYear)) end=eof;
else heightNextYear=.;
if last.name then
delete;
drop age;
run;
There is a lot happening in this code, so let us dissect it a bit more.
The first set statement is reading the input dataset by name and immediately renaming the variable height to HeightThisYear.
The second set statement is also reading the same input dataset but starts reading from observation #2 (firstobs=2).
It only keeps the height variable and renames it to HeightNextYear.
In essence, when the first statement is reading the first record of the input dataset, the second set statement is reading the height from the next record, thus implementing the lead() function.
Since the last record in the group (name) will not have any value for HeightNextYear, we delete the records where last.name=True.
Note: There are other methods to implement the lead() function in SAS but this might be the easiest.
Equivalent SAS output:
Conclusion:
SQL Window Functions provides some excellent capabilities. Although these are not directly available in SAS’s PROC SQL for SAS Datasets and Implicit passthru queries, the examples above illustrate that it can be fairly easily implemented in SAS using PROC SQL, Data Step and other Procs. Hope these are useful in your SAS exploration and journey.
Happy coding!
References:
Acknowlegements:
I'd like to thank @joeFurbee for his guidance on publishing the article here in the community library page.
Nice article, If I may, just my 2 cents:
1) SQL needs window functions because there is no BY-group processing ides in SQL, in contrary to SAS.
The moment "non-SAS-users" understand that we have BY-group processing in SAS, they will immediately understand this is the way we do "windows".
2) Just for fun:
/* @ Example 1. By Hash Tables. No sorting required, linear data pass. */
data want;
declare hash H();
H.defineKey("age","sex");
H.defineData("__sumH","__nH");
H.defineData("__sumW","__nW");
H.definedone();
do until(_E1_);
set sashelp.class end=_E1_;
if H.Find() then
do;
__sumH = height;
__sumW = weight;
__nH = height>.z;
__nW = weight>.z;
end;
else
do;
__sumH + height;
__sumW + weight;
__nH + height>.z;
__nW + weight>.z;
end;
H.replace();
put age= sex= (__:)(=);
end;
do until(_E2_);
set sashelp.class end=_E2_;
__rc =H.Find();
meanHeight = divide(__sumH,__nH);
meanWeight = divide(__sumW,__nW);
output;
end;
stop;
drop __:;
run;
proc print data= want;
run;
Bart
'The moment "non-SAS-users" understand that we have BY-group processing in SAS, they will immediately understand this is the way we do "windows".'
My hope was to ease the transition for "non-SAS-users" into using SAS native ways of doing Window functions, but you have captured the essence of the issue really well! Getting to know the BY-groups well is the key!
What a wonderful and fun piece of code using hash tables and methods and following even lesser known SAS conventions! I learnt something new (>.z convention) from your code!
Thank you for commenting and summarizing the issue succinctly, Bart!
-Vijay
@Vj_Govindarajan Many thanks for this fundamental topic ! A cross-boundary approach was - still is, in many ways - required to understand how SQL implementations differ for the same purpose. I came across the subject once or twice with Teradata and Snowflake when I was shown external SQL code which introduced the subject for me. A very seasoned team mate recently told me about it while refactoring SAS code. As mentioned, "Group By", "Remerging" are well-known expressions if and only if you know SAS already, since these concepts are peculiar to SAS and unintelligible for most other SQL users, whereas Window Functions are - by far in my opinion - more commonly known. Your article will therefore serve as reference on this subject and definitely help SAS as well as non SAS folks speak the same code, like a famous stone !
I have a small suggestion to make which might improve the readability of the text : could you systematically split up the SAS code inside the code blocks over multiple lines ? There are a couple of examples difficult to read on a single line.
Thank you Ronan for your wonderful comment and I fully agree with what you said. Its only easy if you know it! 🙂
Also thanks for pointing out how the SAS code is rendered. I will update that if it allows me to.
Cheers,
Vijay
Great article showcasing the power of SAS ...but: I feel given how many DB SQL flavours implemented windowing functions it would still be great to have them also in SAS SQL to ease conversion from one SQL flavour to another.
Thanks Patrick! I'm a bit torn on this myself since this is something you can code in SAS with a little bit of SAS coding knowledge, but have also heard strong customer sentiments on this subject.
@Vj_Govindarajan Thanks for your consideration and reformatting efforts : much easier to read now ! 🙂 I fully agree with @Patrick : a SAS implementation would be a nice addition.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.