Hey everyone. I'm having an issue with the sum() function in SAS. I'd like to know why the following two codes result in different output:
PROC SQL NOPRINT;
CREATE TABLE cap1 AS
SELECT var1 ,
var2 ,
var3 ,
sum(var4) as sum_var4 ,
sum(var5) as sum_var5,
sum(var6) as sum_var6
FROM work.table1
WHERE var7 > 0
GROUP BY var1, var2, var3;
QUIT;
DATA cap;
SET work.table1(KEEP=var1 var2 var3 var4 var5 var6 var7);
BY var1 var2 var3;
WHERE var7 > 0;
sum_var4 = sum(var4);
sum_var5 = sum(var5);
sum_var6 = sum(var6);
RUN;
Not only would I like to know why it's different, but also which changes need to be made to the DATA step in order to duplicate the SQL output.
Many thanks!
There are two SUM() functions. The normal SAS function will sum one or more values within a single obsversion: s1=sum(a,b,c). The SQL aggregation function SUM() will sum one variable (or expression) across multiple observations.
To summarize across observations without using PROC SQl you should use PROC SUMMARY. Use the CLASS statement to replicate your GROUP BY clause.
proc summary data=table1 nway ;
where var7> 0;
class var1 var2 var3 ;
var var4 var5 var6;
outout out=cap1 sum=sum_var4 sum_var5 sum_var6 ;
run;
To sum across observations in a DATA step you can use the sum statement (varname + expression; ) rather than the SUM() function. Note that the sum statement will automatically retain the values of the target variable across iterations of the data step.
To recreate your GROUP BY clause you will need to sort by your grouping variables first and use BY and FIRST./LAST. processing.
proc sort data=table1 out=cap1 ;
by var1 - var3;
run;
data cap1 ;
set cap1 ;
by var1 - var3 ;
if first.var3 then call missing(sum_var4,sum_var5,sum_var6);
sum_var4 + var4;
sum_var5 + var5;
sum_var6 + var6 ;
if last.var3 ;
run;
sum in a datastep goes over a ROW
sum in SQL is for a COLUMN
So very different results.
How to replicate it? Don't. Use proc means or summary or some other logic instead.
If you HAVE to then you need to retain it and add it as you go along.
Untested:
DATA cap;
SET work.table1(KEEP=var1 var2 var3 var4 var5 var6 var7);
BY var1 var2 var3;
WHERE var7 > 0;
retain sum_var4 sum_var5 sum_var6;
if first.var3 then do;
sum_var4=0; sum_var5=0; sum_var6=0;
end;
sum_var4 = sum(var4, sum_var4);
sum_var5 = sum(var5, sum_var5);
sum_var6 = sum(var6, sum_var6);
RUN;
proc means data=have1 noprint;
by var1 var2 var3;
output out=cap sum(var4)=sum_var4 sum(var5)=sum_var5 sum(var6)=sum_var6;
run;
There are two SUM() functions. The normal SAS function will sum one or more values within a single obsversion: s1=sum(a,b,c). The SQL aggregation function SUM() will sum one variable (or expression) across multiple observations.
To summarize across observations without using PROC SQl you should use PROC SUMMARY. Use the CLASS statement to replicate your GROUP BY clause.
proc summary data=table1 nway ;
where var7> 0;
class var1 var2 var3 ;
var var4 var5 var6;
outout out=cap1 sum=sum_var4 sum_var5 sum_var6 ;
run;
To sum across observations in a DATA step you can use the sum statement (varname + expression; ) rather than the SUM() function. Note that the sum statement will automatically retain the values of the target variable across iterations of the data step.
To recreate your GROUP BY clause you will need to sort by your grouping variables first and use BY and FIRST./LAST. processing.
proc sort data=table1 out=cap1 ;
by var1 - var3;
run;
data cap1 ;
set cap1 ;
by var1 - var3 ;
if first.var3 then call missing(sum_var4,sum_var5,sum_var6);
sum_var4 + var4;
sum_var5 + var5;
sum_var6 + var6 ;
if last.var3 ;
run;
Thanks a lot guys! Both helpful! Wish I could mark both as correct.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.