Solved: Mimic SQL Window Function in Data Step (MAX, AVG)

whs278 · Posted 11-05-2019 11:46 AM

Is there a way to mimic a Postgre SQL Window Function

For example

DATSET

JobType Salary

Farmer $50k

Farmer $10k

Banker $100k

Banker $300k

Select JOBTYPE, SALARY, MAX(SALARY) over (PARTITION BY JOBTYPE) AS MAXSALARY,

AVG(SALARY) over (PARTITION BY JOBTYPE) AS AVGSALARY

FROM DATASET;

NEW DATASET

JobType Salary MAXSALARY AVGSALARY

Farmer $50k $50k $30k

Farmer $10k $50k $30k

Banker $100k $300k $200k

Banker $300k $300k $200k

I was thinking perhaps there was a way to sort by jobtype and then run a data step with a BY JOBTYPE statement;

Thank you for the help.

hashman · Posted 11-05-2019 04:09 PM

@whs278: What @Reeza said.

However, the answer to the question whether it can be mimicked using the DATA step, the answer is "yes", provided that one is willing to (or can) do a modicum of procedural programming (as opposed to logical set programming done via SQL). Using the hash object, it is also possible to construct a step that does the job without relying on a sorted key order - just like SQL doesn't, either.

However, programming provisions must be made to account for the missing values of analytical variables (in your case, salary) properly, in order to prevent the averages from being corrupted. Note that below, I've added a few extra records with missing values of salary to the sample data to test the null handling. It is done by skipping missing salary records on the aggregation (i.e. first) pass through the data using the WHERE clause. On the second pass, when the accumulated stats are merged back to the original data, this clause is skipped.

data have ;                                                                                                                             
  input jobtype $ salary :dollar. ;                                                                                                     
  cards ;                                                                                                                               
Farmer   $50,000                                                                                                                        
Farmer   $10,000                                                                                                                        
Banker  $100,000                                                                                                                        
Banker  $300,000                                                                                                                        
Idler          .                                                                                                                        
Idler          .                                                                                                                        
Clerk    $40,000                                                                                                                        
Clerk          .                                                                                                                        
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = sum N) ;                                                                                                              
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("jobtype") ;                                                                                                           
    h.definedata ("maxsalary", "sum", "N") ;                                                                                            
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set have (where=(N(salary))) end = z ;                                                                                         
      if h.find() ne 0 then call missing (maxsalary, sum, N) ;                                                                          
      sum  = sum (sum, salary) ;                                                                                                        
      N    = sum (N, N (salary)) ;                                                                                                      
      maxsalary = max (maxsalary, salary) ;                                                                                             
      h.replace() ;                                                                                                                     
    end ;                                                                                                                               
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  _iorc_ = h.find() ;                                                                                                                   
  avgsalary = divide (sum, N) ;                                                                                                         
run ;

Note the advantage of using the DIVIDE function compared to the division operator: If the divisor is missing or zero, the function doesn't generate errors and/or warnings but simply an appropriate missing value. That is, if the divisor is 0, it generates the special missing value .I (standing for +infinity) if the numerator is positive and .M (standing for -infinity) if it is negative; if the divisor is a standard missing value (aka "dot-blank"), a standard missing value is generated. (There're other variants listed in the documentation.)

Kind regards

Paul D.

View solution in original post

Reeza · Posted 11-05-2019 12:05 PM

SAS does that almost automatically, no window function needed.

proc sql;
create table want as
select *, max(salary) as maxSalary, mean(salary) as meanSalary
from have
group by JobType;
quit;

@whs278 wrote:

Is there a way to mimic a Postgre SQL Window Function

For example

DATSET

JobType Salary

Farmer $50k

Farmer $10k

Banker $100k

Banker $300k

Select JOBTYPE, SALARY, MAX(SALARY) over (PARTITION BY JOBTYPE) AS MAXSALARY,

AVG(SALARY) over (PARTITION BY JOBTYPE) AS AVGSALARY

FROM DATASET;

NEW DATASET

JobType Salary MAXSALARY AVGSALARY

Farmer   $50k $50k $30k

Farmer $10k $50k $30k

Banker $100k $300k $200k

Banker     $300k $300k    $200k

I was thinking perhaps there was a way to sort by jobtype and then run a data step with a BY JOBTYPE statement;

Thank you for the help.

whs278 · Posted 11-05-2019 04:14 PM

This definitely solves my earlier question. Really simple and a good trick to know. Thank you very much!

However, I also have a follow up question.

Is there a way to calculate a statistic based on a subgroup of the main groups and then assign that statistic for every member of the main group.

For example, given a dataset....

DATA HAVE;
LENGTH JOBTYPE EDUCATION $16;
INPUT JOBTYPE $ EDUCATION $ SALARY;

DATALINES;

Farmer High_School 50000
Farmer High_School 30000
Farmer College 45000
Farmer College 65000
Banker High_School 100000
Banker High_School 300000
Banker College 250000
Banker College 450000
;
RUN;

CREATE THIS NEW DATASET....

JOBTYPE EDUCATION SALARY AVG_SALARY_HS MAX_SALARY_HS
Farmer High_School 50000 40000 50000
Farmer High_School 30000 40000 50000
Farmer College 45000 40000 50000
Farmer College 65000 40000 50000
Banker High_School 100000 200000 300000
Banker High_School 300000 200000 300000
Banker College 250000 200000 300000
Banker College 450000 200000 300000

Reeza · Posted 11-05-2019 04:44 PM

Not easily in SQL, but PROC MEANS allows you to control the different levels of classifications using TYPES/WAYS statements and have more functionality.

hashman · Posted 11-05-2019 04:48 PM

@whs278:

Yes, it's easy. To account for a "subgroup", you only need to add that "subgroup" key to the key portion of the hash table - i.e. the portion whose hash variables are described by calling the DEFINEKEY method. In this particular case, this "subgroup" is the education variable, and thus:

    h.definekey ("jobtype", "education") ;

Not a single other change is needed to be done to my code to work in this new case. In principle, it's no different from just adding another key to a GROUP BY clause in SQL. After another key is added, the hash object treats each distinct pair of [jobtype,education] as a combined composite key-value and computes the aggregates accordingly. When they are merged back to the original data on the second pass (enacted by the second SET statement), the H.FIND() method looks up [jobtype,education] in the hash table for the values of both variables in the current record having just been read in. This is the default behavior since no keys are specified to FIND when it's called (the parentheses are empty), and so it auto-accepts the current PDV values of the variables defined in DEFINEDATA as the key-values to search for in the table..

Kind regards

Paul D.

whs278 · Posted 11-07-2019 09:53 AM

Thank you Paul for the follow up response.

However, I just want to clarify. Would your code calculate a separate average and maximum for each jobtype, education category? In other words, would it be equivalent to the following SQL code?

CREATE TABLE WANT AS
SELECT AVG(SALARY), MAX(SALARY)
FROM HAVE
GROUP BY JOBTYPE, EDUCATION

What I really want to do is calculate a maximum and average within each job type that only includes cases whose highest education level is High School. Then I want to assign those values to all the rows for that job type. So the AVG_SALARY_HS column would represent the average salary of cases whose highest education was a HS diploma grouped by category.

For example, the $40,000 in the farmer category is the average of HS-educated farmers. However, I want to assign this $40,000 to all members of the parent group (i.e. all farmers) regardless of their education level. The HS suffix in the category name indicates that these values are derived only from the high school education level.

It would basically be the same in SQL as first creating a table like this.

CREATE TABLE WANT AS
SELECT AVG(SALARY) AS AVG_SALARY_HS,
MAX(SALARY) AS MAX_SALARY_HS
FROM HAVE
WHERE EDUCATION = "High_School"
GROUP BY JOBTYPE

and then performing a one-to-many merge with the original dataset on the jobtype key only. In fact this what I did with my actual dataset. However, the dataset is very large, so I was wondering if using a data step would be more efficient.

Again thank you for your help. I apologize for asking so many questions. However, I am dealing with large datasets with a lot of 'nested' data, so this issue comes up a lot.

Tom · Posted 11-07-2019 10:34 AM

You probably want to use CASE so you decide what values are fed to the aggregate function.

CREATE TABLE WANT AS
SELECT JOBTYPE
     , AVG(case when (EDUCATION = "High_School") then SALARY else . end) as AVG_SALARY
FROM HAVE
GROUP BY JOBTYPE

whs278 · Posted 11-07-2019 03:06 PM

Thank you, Tom. This gives the exact output that I want.

Still curious if it is possible to do the same thing using the Hash-Object Programming technique from Paul's code.

hashman · Posted 11-05-2019 04:09 PM

@whs278: What @Reeza said.

However, the answer to the question whether it can be mimicked using the DATA step, the answer is "yes", provided that one is willing to (or can) do a modicum of procedural programming (as opposed to logical set programming done via SQL). Using the hash object, it is also possible to construct a step that does the job without relying on a sorted key order - just like SQL doesn't, either.

However, programming provisions must be made to account for the missing values of analytical variables (in your case, salary) properly, in order to prevent the averages from being corrupted. Note that below, I've added a few extra records with missing values of salary to the sample data to test the null handling. It is done by skipping missing salary records on the aggregation (i.e. first) pass through the data using the WHERE clause. On the second pass, when the accumulated stats are merged back to the original data, this clause is skipped.

data have ;                                                                                                                             
  input jobtype $ salary :dollar. ;                                                                                                     
  cards ;                                                                                                                               
Farmer   $50,000                                                                                                                        
Farmer   $10,000                                                                                                                        
Banker  $100,000                                                                                                                        
Banker  $300,000                                                                                                                        
Idler          .                                                                                                                        
Idler          .                                                                                                                        
Clerk    $40,000                                                                                                                        
Clerk          .                                                                                                                        
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = sum N) ;                                                                                                              
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("jobtype") ;                                                                                                           
    h.definedata ("maxsalary", "sum", "N") ;                                                                                            
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set have (where=(N(salary))) end = z ;                                                                                         
      if h.find() ne 0 then call missing (maxsalary, sum, N) ;                                                                          
      sum  = sum (sum, salary) ;                                                                                                        
      N    = sum (N, N (salary)) ;                                                                                                      
      maxsalary = max (maxsalary, salary) ;                                                                                             
      h.replace() ;                                                                                                                     
    end ;                                                                                                                               
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  _iorc_ = h.find() ;                                                                                                                   
  avgsalary = divide (sum, N) ;                                                                                                         
run ;

Note the advantage of using the DIVIDE function compared to the division operator: If the divisor is missing or zero, the function doesn't generate errors and/or warnings but simply an appropriate missing value. That is, if the divisor is 0, it generates the special missing value .I (standing for +infinity) if the numerator is positive and .M (standing for -infinity) if it is negative; if the divisor is a standard missing value (aka "dot-blank"), a standard missing value is generated. (There're other variants listed in the documentation.)

Kind regards

Paul D.

Mimic SQL Window Function in Data Step (MAX, AVG)

Re: Mimic SQL Window Function in Data Step (MAX, AVG)

Re: Mimic SQL Window Function in Data Step (MAX, AVG)

Re: Mimic SQL Window Function in Data Step (MAX, AVG)

Re: Mimic SQL Window Function in Data Step (MAX, AVG)

Re: Mimic SQL Window Function in Data Step (MAX, AVG)

Re: Mimic SQL Window Function in Data Step (MAX, AVG)

Re: Mimic SQL Window Function in Data Step (MAX, AVG)

Re: Mimic SQL Window Function in Data Step (MAX, AVG)

Re: Mimic SQL Window Function in Data Step (MAX, AVG)

Classroom Training Available!