BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
whs278
Quartz | Level 8

Is there a way to mimic a Postgre SQL Window Function

 

For example 

 

DATSET

JobType     Salary

Farmer       $50k

Farmer       $10k

Banker       $100k

Banker       $300k

 

Select JOBTYPE, SALARY, MAX(SALARY) over (PARTITION BY JOBTYPE)  AS MAXSALARY,

              AVG(SALARY) over (PARTITION BY JOBTYPE)  AS AVGSALARY

FROM DATASET;

 

NEW DATASET

 

JobType        Salary     MAXSALARY    AVGSALARY

Farmer            $50k           $50k                  $30k

Farmer           $10k            $50k                 $30k

Banker           $100k         $300k                 $200k

Banker          $300k         $300k                    $200k

 

 

I was thinking perhaps there was a way to sort by jobtype and then run a data step with a BY JOBTYPE statement;

 

Thank you for the help.  

 

1 ACCEPTED SOLUTION

Accepted Solutions
hashman
Ammonite | Level 13

@whs278: What @Reeza said.

However, the answer to the question whether it can be mimicked using the DATA step, the answer is "yes", provided that one is willing to (or can) do a modicum of procedural programming (as opposed to logical set programming done via SQL). Using the hash object, it is also possible to construct a step that does the job without relying on a sorted key order - just like SQL doesn't, either.

 

However, programming provisions must be made to account for the missing values of analytical variables (in your case, salary) properly, in order to prevent the averages from being corrupted. Note that below,  I've added a few extra records with missing values of salary to the sample data to test the null handling. It is done by skipping missing salary records on the aggregation (i.e. first) pass through the data using the WHERE clause. On the second pass, when the accumulated stats are merged back to the original data, this clause is skipped.

data have ;                                                                                                                             
  input jobtype $ salary :dollar. ;                                                                                                     
  cards ;                                                                                                                               
Farmer   $50,000                                                                                                                        
Farmer   $10,000                                                                                                                        
Banker  $100,000                                                                                                                        
Banker  $300,000                                                                                                                        
Idler          .                                                                                                                        
Idler          .                                                                                                                        
Clerk    $40,000                                                                                                                        
Clerk          .                                                                                                                        
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = sum N) ;                                                                                                              
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("jobtype") ;                                                                                                           
    h.definedata ("maxsalary", "sum", "N") ;                                                                                            
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set have (where=(N(salary))) end = z ;                                                                                         
      if h.find() ne 0 then call missing (maxsalary, sum, N) ;                                                                          
      sum  = sum (sum, salary) ;                                                                                                        
      N    = sum (N, N (salary)) ;                                                                                                      
      maxsalary = max (maxsalary, salary) ;                                                                                             
      h.replace() ;                                                                                                                     
    end ;                                                                                                                               
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  _iorc_ = h.find() ;                                                                                                                   
  avgsalary = divide (sum, N) ;                                                                                                         
run ;                                                  

Note the advantage of using the DIVIDE function compared to the division operator: If the divisor is missing or zero, the function doesn't generate errors and/or warnings but simply an appropriate missing value. That is, if the divisor is 0, it generates the special missing value .I (standing for +infinity) if the numerator is positive and .M (standing for -infinity) if it is negative; if the divisor is a standard missing value (aka "dot-blank"), a standard missing value is generated. (There're other variants listed in the documentation.)

 

Kind regards

Paul D. 

 

View solution in original post

8 REPLIES 8
Reeza
Super User

SAS does that almost automatically, no window function needed.

 

proc sql;
create table want as
select *, max(salary) as maxSalary, mean(salary) as meanSalary
from have
group by JobType;
quit;

@whs278 wrote:

Is there a way to mimic a Postgre SQL Window Function

 

For example 

 

DATSET

JobType     Salary

Farmer       $50k

Farmer       $10k

Banker       $100k

Banker       $300k

 

Select JOBTYPE, SALARY, MAX(SALARY) over (PARTITION BY JOBTYPE)  AS MAXSALARY,

              AVG(SALARY) over (PARTITION BY JOBTYPE)  AS AVGSALARY

FROM DATASET;

 

NEW DATASET

 

JobType        Salary     MAXSALARY    AVGSALARY

Farmer            $50k           $50k                  $30k

Farmer           $10k            $50k                 $30k

Banker           $100k         $300k                 $200k

Banker          $300k         $300k                    $200k

 

 

I was thinking perhaps there was a way to sort by jobtype and then run a data step with a BY JOBTYPE statement;

 

Thank you for the help.  

 


 

whs278
Quartz | Level 8

This definitely solves my earlier question.  Really simple and a good trick to know.  Thank you very much!

 

However, I also have a follow up question.

 

Is there a way to calculate a statistic based on a subgroup of the main groups and then assign that statistic for every member of the main group.  

 

For example, given a dataset....

 

DATA HAVE;
LENGTH JOBTYPE EDUCATION $16;
INPUT JOBTYPE $ EDUCATION $ SALARY;

DATALINES;

Farmer High_School 50000
Farmer High_School 30000
Farmer College 45000
Farmer College 65000
Banker High_School 100000
Banker High_School 300000
Banker College 250000
Banker College 450000
;
RUN;

 

CREATE THIS NEW DATASET....

 

 

JOBTYPE    EDUCATION          SALARY        AVG_SALARY_HS        MAX_SALARY_HS
Farmer         High_School          50000            40000                             50000
Farmer         High_School          30000            40000                             50000
Farmer         College                  45000            40000                             50000
Farmer         College                  65000            40000                             50000
Banker         High_School        100000           200000                          300000
Banker         High_School        300000           200000                          300000
Banker         College                250000           200000                          300000
Banker         College                450000           200000                          300000

Reeza
Super User
Not easily in SQL, but PROC MEANS allows you to control the different levels of classifications using TYPES/WAYS statements and have more functionality.
hashman
Ammonite | Level 13

@whs278:

Yes, it's easy. To account for a "subgroup", you only need to add that "subgroup" key to the key portion of the hash table - i.e. the portion whose hash variables are described by calling the DEFINEKEY method. In this particular case, this "subgroup" is the education variable, and thus: 

    h.definekey ("jobtype", "education") ;

Not a single other change is needed to be done to my code to work in this new case. In principle, it's no different from just adding another key to a GROUP BY clause in SQL. After another key is added, the hash object treats each distinct pair of  [jobtype,education] as a combined composite key-value and computes the aggregates accordingly. When they are merged back to the original data on the second pass (enacted by the second SET statement), the H.FIND() method looks up [jobtype,education] in the hash table for the values of both variables in the current record having just been read in. This is the default behavior since no keys are specified to FIND when it's called (the parentheses are empty), and so it auto-accepts the current PDV values of the variables defined in DEFINEDATA as the key-values to search for in the table..   

 

Kind regards

Paul D.

whs278
Quartz | Level 8
Thank you Paul for the follow up response.

However, I just want to clarify. Would your code calculate a separate average and maximum for each jobtype, education category? In other words, would it be equivalent to the following SQL code?

CREATE TABLE WANT AS
SELECT AVG(SALARY), MAX(SALARY)
FROM HAVE
GROUP BY JOBTYPE, EDUCATION

What I really want to do is calculate a maximum and average within each job type that only includes cases whose highest education level is High School. Then I want to assign those values to all the rows for that job type. So the AVG_SALARY_HS column would represent the average salary of cases whose highest education was a HS diploma grouped by category.

For example, the $40,000 in the farmer category is the average of HS-educated farmers. However, I want to assign this $40,000 to all members of the parent group (i.e. all farmers) regardless of their education level. The HS suffix in the category name indicates that these values are derived only from the high school education level.

It would basically be the same in SQL as first creating a table like this.

CREATE TABLE WANT AS
SELECT AVG(SALARY) AS AVG_SALARY_HS,
MAX(SALARY) AS MAX_SALARY_HS
FROM HAVE
WHERE EDUCATION = "High_School"
GROUP BY JOBTYPE

and then performing a one-to-many merge with the original dataset on the jobtype key only. In fact this what I did with my actual dataset. However, the dataset is very large, so I was wondering if using a data step would be more efficient.

Again thank you for your help. I apologize for asking so many questions. However, I am dealing with large datasets with a lot of 'nested' data, so this issue comes up a lot.
Tom
Super User Tom
Super User

You probably want to use CASE so you decide what values are fed to the aggregate function.

CREATE TABLE WANT AS
SELECT JOBTYPE
     , AVG(case when (EDUCATION = "High_School") then SALARY else . end) as AVG_SALARY
FROM HAVE
GROUP BY JOBTYPE
whs278
Quartz | Level 8

Thank you, Tom.  This gives the exact output that I want.

 

Still curious if it is possible to do the same thing using the Hash-Object Programming technique from Paul's code.  

hashman
Ammonite | Level 13

@whs278: What @Reeza said.

However, the answer to the question whether it can be mimicked using the DATA step, the answer is "yes", provided that one is willing to (or can) do a modicum of procedural programming (as opposed to logical set programming done via SQL). Using the hash object, it is also possible to construct a step that does the job without relying on a sorted key order - just like SQL doesn't, either.

 

However, programming provisions must be made to account for the missing values of analytical variables (in your case, salary) properly, in order to prevent the averages from being corrupted. Note that below,  I've added a few extra records with missing values of salary to the sample data to test the null handling. It is done by skipping missing salary records on the aggregation (i.e. first) pass through the data using the WHERE clause. On the second pass, when the accumulated stats are merged back to the original data, this clause is skipped.

data have ;                                                                                                                             
  input jobtype $ salary :dollar. ;                                                                                                     
  cards ;                                                                                                                               
Farmer   $50,000                                                                                                                        
Farmer   $10,000                                                                                                                        
Banker  $100,000                                                                                                                        
Banker  $300,000                                                                                                                        
Idler          .                                                                                                                        
Idler          .                                                                                                                        
Clerk    $40,000                                                                                                                        
Clerk          .                                                                                                                        
;                                                                                                                                       
run ;                                                                                                                                   
                                                                                                                                        
data want (drop = sum N) ;                                                                                                              
  if _n_ = 1 then do ;                                                                                                                  
    dcl hash h () ;                                                                                                                     
    h.definekey ("jobtype") ;                                                                                                           
    h.definedata ("maxsalary", "sum", "N") ;                                                                                            
    h.definedone () ;                                                                                                                   
    do until (z) ;                                                                                                                      
      set have (where=(N(salary))) end = z ;                                                                                         
      if h.find() ne 0 then call missing (maxsalary, sum, N) ;                                                                          
      sum  = sum (sum, salary) ;                                                                                                        
      N    = sum (N, N (salary)) ;                                                                                                      
      maxsalary = max (maxsalary, salary) ;                                                                                             
      h.replace() ;                                                                                                                     
    end ;                                                                                                                               
  end ;                                                                                                                                 
  set have ;                                                                                                                            
  _iorc_ = h.find() ;                                                                                                                   
  avgsalary = divide (sum, N) ;                                                                                                         
run ;                                                  

Note the advantage of using the DIVIDE function compared to the division operator: If the divisor is missing or zero, the function doesn't generate errors and/or warnings but simply an appropriate missing value. That is, if the divisor is 0, it generates the special missing value .I (standing for +infinity) if the numerator is positive and .M (standing for -infinity) if it is negative; if the divisor is a standard missing value (aka "dot-blank"), a standard missing value is generated. (There're other variants listed in the documentation.)

 

Kind regards

Paul D. 

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1880 views
  • 4 likes
  • 4 in conversation