Is there a way to mimic a Postgre SQL Window Function
For example
DATSET
JobType Salary
Farmer $50k
Farmer $10k
Banker $100k
Banker $300k
Select JOBTYPE, SALARY, MAX(SALARY) over (PARTITION BY JOBTYPE) AS MAXSALARY,
AVG(SALARY) over (PARTITION BY JOBTYPE) AS AVGSALARY
FROM DATASET;
NEW DATASET
JobType Salary MAXSALARY AVGSALARY
Farmer $50k $50k $30k
Farmer $10k $50k $30k
Banker $100k $300k $200k
Banker $300k $300k $200k
I was thinking perhaps there was a way to sort by jobtype and then run a data step with a BY JOBTYPE statement;
Thank you for the help.
However, the answer to the question whether it can be mimicked using the DATA step, the answer is "yes", provided that one is willing to (or can) do a modicum of procedural programming (as opposed to logical set programming done via SQL). Using the hash object, it is also possible to construct a step that does the job without relying on a sorted key order - just like SQL doesn't, either.
However, programming provisions must be made to account for the missing values of analytical variables (in your case, salary) properly, in order to prevent the averages from being corrupted. Note that below, I've added a few extra records with missing values of salary to the sample data to test the null handling. It is done by skipping missing salary records on the aggregation (i.e. first) pass through the data using the WHERE clause. On the second pass, when the accumulated stats are merged back to the original data, this clause is skipped.
data have ; input jobtype $ salary :dollar. ; cards ; Farmer $50,000 Farmer $10,000 Banker $100,000 Banker $300,000 Idler . Idler . Clerk $40,000 Clerk . ; run ; data want (drop = sum N) ; if _n_ = 1 then do ; dcl hash h () ; h.definekey ("jobtype") ; h.definedata ("maxsalary", "sum", "N") ; h.definedone () ; do until (z) ; set have (where=(N(salary))) end = z ; if h.find() ne 0 then call missing (maxsalary, sum, N) ; sum = sum (sum, salary) ; N = sum (N, N (salary)) ; maxsalary = max (maxsalary, salary) ; h.replace() ; end ; end ; set have ; _iorc_ = h.find() ; avgsalary = divide (sum, N) ; run ;
Note the advantage of using the DIVIDE function compared to the division operator: If the divisor is missing or zero, the function doesn't generate errors and/or warnings but simply an appropriate missing value. That is, if the divisor is 0, it generates the special missing value .I (standing for +infinity) if the numerator is positive and .M (standing for -infinity) if it is negative; if the divisor is a standard missing value (aka "dot-blank"), a standard missing value is generated. (There're other variants listed in the documentation.)
Kind regards
Paul D.
SAS does that almost automatically, no window function needed.
proc sql;
create table want as
select *, max(salary) as maxSalary, mean(salary) as meanSalary
from have
group by JobType;
quit;
@whs278 wrote:
Is there a way to mimic a Postgre SQL Window Function
For example
DATSET
JobType Salary
Farmer $50k
Farmer $10k
Banker $100k
Banker $300k
Select JOBTYPE, SALARY, MAX(SALARY) over (PARTITION BY JOBTYPE) AS MAXSALARY,
AVG(SALARY) over (PARTITION BY JOBTYPE) AS AVGSALARY
FROM DATASET;
NEW DATASET
JobType Salary MAXSALARY AVGSALARY
Farmer $50k $50k $30k
Farmer $10k $50k $30k
Banker $100k $300k $200k
Banker $300k $300k $200k
I was thinking perhaps there was a way to sort by jobtype and then run a data step with a BY JOBTYPE statement;
Thank you for the help.
This definitely solves my earlier question. Really simple and a good trick to know. Thank you very much!
However, I also have a follow up question.
Is there a way to calculate a statistic based on a subgroup of the main groups and then assign that statistic for every member of the main group.
For example, given a dataset....
DATA HAVE;
LENGTH JOBTYPE EDUCATION $16;
INPUT JOBTYPE $ EDUCATION $ SALARY;
DATALINES;
Farmer High_School 50000
Farmer High_School 30000
Farmer College 45000
Farmer College 65000
Banker High_School 100000
Banker High_School 300000
Banker College 250000
Banker College 450000
;
RUN;
CREATE THIS NEW DATASET....
JOBTYPE EDUCATION SALARY AVG_SALARY_HS MAX_SALARY_HS
Farmer High_School 50000 40000 50000
Farmer High_School 30000 40000 50000
Farmer College 45000 40000 50000
Farmer College 65000 40000 50000
Banker High_School 100000 200000 300000
Banker High_School 300000 200000 300000
Banker College 250000 200000 300000
Banker College 450000 200000 300000
Yes, it's easy. To account for a "subgroup", you only need to add that "subgroup" key to the key portion of the hash table - i.e. the portion whose hash variables are described by calling the DEFINEKEY method. In this particular case, this "subgroup" is the education variable, and thus:
h.definekey ("jobtype", "education") ;
Not a single other change is needed to be done to my code to work in this new case. In principle, it's no different from just adding another key to a GROUP BY clause in SQL. After another key is added, the hash object treats each distinct pair of [jobtype,education] as a combined composite key-value and computes the aggregates accordingly. When they are merged back to the original data on the second pass (enacted by the second SET statement), the H.FIND() method looks up [jobtype,education] in the hash table for the values of both variables in the current record having just been read in. This is the default behavior since no keys are specified to FIND when it's called (the parentheses are empty), and so it auto-accepts the current PDV values of the variables defined in DEFINEDATA as the key-values to search for in the table..
Kind regards
Paul D.
You probably want to use CASE so you decide what values are fed to the aggregate function.
CREATE TABLE WANT AS
SELECT JOBTYPE
, AVG(case when (EDUCATION = "High_School") then SALARY else . end) as AVG_SALARY
FROM HAVE
GROUP BY JOBTYPE
Thank you, Tom. This gives the exact output that I want.
Still curious if it is possible to do the same thing using the Hash-Object Programming technique from Paul's code.
However, the answer to the question whether it can be mimicked using the DATA step, the answer is "yes", provided that one is willing to (or can) do a modicum of procedural programming (as opposed to logical set programming done via SQL). Using the hash object, it is also possible to construct a step that does the job without relying on a sorted key order - just like SQL doesn't, either.
However, programming provisions must be made to account for the missing values of analytical variables (in your case, salary) properly, in order to prevent the averages from being corrupted. Note that below, I've added a few extra records with missing values of salary to the sample data to test the null handling. It is done by skipping missing salary records on the aggregation (i.e. first) pass through the data using the WHERE clause. On the second pass, when the accumulated stats are merged back to the original data, this clause is skipped.
data have ; input jobtype $ salary :dollar. ; cards ; Farmer $50,000 Farmer $10,000 Banker $100,000 Banker $300,000 Idler . Idler . Clerk $40,000 Clerk . ; run ; data want (drop = sum N) ; if _n_ = 1 then do ; dcl hash h () ; h.definekey ("jobtype") ; h.definedata ("maxsalary", "sum", "N") ; h.definedone () ; do until (z) ; set have (where=(N(salary))) end = z ; if h.find() ne 0 then call missing (maxsalary, sum, N) ; sum = sum (sum, salary) ; N = sum (N, N (salary)) ; maxsalary = max (maxsalary, salary) ; h.replace() ; end ; end ; set have ; _iorc_ = h.find() ; avgsalary = divide (sum, N) ; run ;
Note the advantage of using the DIVIDE function compared to the division operator: If the divisor is missing or zero, the function doesn't generate errors and/or warnings but simply an appropriate missing value. That is, if the divisor is 0, it generates the special missing value .I (standing for +infinity) if the numerator is positive and .M (standing for -infinity) if it is negative; if the divisor is a standard missing value (aka "dot-blank"), a standard missing value is generated. (There're other variants listed in the documentation.)
Kind regards
Paul D.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.