DATA Step, Macro, Functions and more

SUM() in SQL vs. Data step

Accepted Solution Solved
Reply
Contributor
Posts: 33
Accepted Solution

SUM() in SQL vs. Data step

Hey everyone. I'm having an issue with the sum() function in SAS. I'd like to know why the following two codes result in different output:

PROC SQL NOPRINT;

  CREATE TABLE cap1 AS

    SELECT var1                                                ,

           var2                                            ,

           var3                                             ,

           sum(var4) as sum_var4 ,

           sum(var5)  as sum_var5,

           sum(var6)    as sum_var6

    FROM   work.table1

    WHERE  var7 > 0

  GROUP BY var1, var2, var3;

QUIT;

DATA cap;

  SET work.table1(KEEP=var1 var2 var3 var4 var5 var6 var7);

  BY  var1 var2 var3;

  WHERE var7 > 0;

  sum_var4 = sum(var4);

  sum_var5 = sum(var5);

  sum_var6 = sum(var6);

RUN;

Not only would I like to know why it's different, but also which changes need to be made to the DATA step in order to duplicate the SQL output.

Many thanks!


Accepted Solutions
Solution
‎11-13-2012 10:59 AM
Super User
Super User
Posts: 7,076

Re: SUM() in SQL vs. Data step

There are two SUM() functions. The normal SAS function will sum one or more values within a single obsversion:   s1=sum(a,b,c).  The SQL aggregation function SUM() will sum one variable (or expression) across multiple observations.

To summarize across observations without using PROC SQl you should use PROC SUMMARY.  Use the CLASS statement to replicate your GROUP BY clause.

proc summary data=table1 nway ;

  where var7> 0;

  class var1 var2 var3 ;

  var var4 var5 var6;

  outout out=cap1 sum=sum_var4 sum_var5 sum_var6 ;

run;

To sum across observations in a DATA step you can use the sum statement (varname + expression; ) rather than the SUM() function.  Note that the sum statement will automatically retain the values of the target variable across iterations of the data step.

To recreate your GROUP BY clause you will need to sort by your grouping variables first and use BY and FIRST./LAST. processing.

proc sort data=table1 out=cap1 ;

  by var1 - var3;

run;

data cap1 ;

  set cap1 ;

  by var1 - var3 ;

  if first.var3 then call missing(sum_var4,sum_var5,sum_var6);

  sum_var4 + var4;

  sum_var5 + var5;

  sum_var6 + var6 ;

  if last.var3 ;

run;

View solution in original post


All Replies
Super User
Posts: 19,869

Re: SUM() in SQL vs. Data step

sum in a datastep goes over a ROW

sum in SQL is for a COLUMN

So very different results.

How to replicate it? Don't. Use proc means or summary or some other logic instead.

If you HAVE to then you need to retain it and add it as you go along.

Untested:

DATA cap;

  SET work.table1(KEEP=var1 var2 var3 var4 var5 var6 var7);

  BY  var1 var2 var3;

  WHERE var7 > 0;

retain sum_var4 sum_var5 sum_var6;

if first.var3 then do;

     sum_var4=0; sum_var5=0; sum_var6=0;

end;

  sum_var4 = sum(var4, sum_var4);

  sum_var5 = sum(var5, sum_var5);

  sum_var6 = sum(var6, sum_var6);

RUN;

proc means data=have1 noprint;

by var1 var2 var3;

output out=cap sum(var4)=sum_var4 sum(var5)=sum_var5 sum(var6)=sum_var6;

run;

Solution
‎11-13-2012 10:59 AM
Super User
Super User
Posts: 7,076

Re: SUM() in SQL vs. Data step

There are two SUM() functions. The normal SAS function will sum one or more values within a single obsversion:   s1=sum(a,b,c).  The SQL aggregation function SUM() will sum one variable (or expression) across multiple observations.

To summarize across observations without using PROC SQl you should use PROC SUMMARY.  Use the CLASS statement to replicate your GROUP BY clause.

proc summary data=table1 nway ;

  where var7> 0;

  class var1 var2 var3 ;

  var var4 var5 var6;

  outout out=cap1 sum=sum_var4 sum_var5 sum_var6 ;

run;

To sum across observations in a DATA step you can use the sum statement (varname + expression; ) rather than the SUM() function.  Note that the sum statement will automatically retain the values of the target variable across iterations of the data step.

To recreate your GROUP BY clause you will need to sort by your grouping variables first and use BY and FIRST./LAST. processing.

proc sort data=table1 out=cap1 ;

  by var1 - var3;

run;

data cap1 ;

  set cap1 ;

  by var1 - var3 ;

  if first.var3 then call missing(sum_var4,sum_var5,sum_var6);

  sum_var4 + var4;

  sum_var5 + var5;

  sum_var6 + var6 ;

  if last.var3 ;

run;

Contributor
Posts: 33

Re: SUM() in SQL vs. Data step

Thanks a lot guys! Both helpful! Wish I could mark both as correct.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 14751 views
  • 3 likes
  • 3 in conversation