Hi im calculating average using data step method but unable to do so.
data :
data a;
input product$ sales day$9. week;
cards;
a 100 01-jan-19 1
a 20 01-jan-19 1
b 200 02-jan-19 1
h 100 09-jan-19 2
i 300 09-jan-19 2
;
run;
however i can do it using proc means:
proc means data=a nway;
var sales;
class product;
output out=s mean=avg;
;run;
Results:
product _type_ _freq_ avg
a 1 2 60
b 1 1 200
h 1 1 100
i 1 1 300
I want to get this output using data step way:
proc sort data=a ; by product week; run;
data want;
set a;
by product week;
if first.product then sumsales=0;
sumsales+sales;
if last.product;
avg=mean(sumsales);
run;
the mean function is not working and returning the exact values of sumsales column
Plz help!!
Like. But to emulate the mean stat to a tee, i.e. ignore the missing values:
numsales + N(sales) ;
is more precise than:
numsales+1;
Also, I find that tasks of this sort yield better to the DoW-loop, e.g.:
data want ;
do until (last.product) ;
set a ;
by product ;
numsales = sum (numsales, N(sales)) ;
sumsales = sum (sumsales, sales) ;
end ;
mean = divide (sumsales, numsales) ;
run ;
because it (a) uses the implicit DATA step actions to reinitialize the aggregates before each BY group and (b) uses one tests for LAST.product instead of two for FIRST.product and LAST.product. The implicit initialization of NUMSALES to missing at the top of the step also guards against the division by 0 in case when all SALES in a BY group are missing.
Kind regards
Paul D.
The MEAN function averages variables, not rows. This should do it:
data want;
set a;
by product week;
if first.product then do;
sumsales=0;
numsales = 0;
end;
sumsales+sales;
numsales+1;
if last.product then avg=sumsales/numsales;
run;
Like. But to emulate the mean stat to a tee, i.e. ignore the missing values:
numsales + N(sales) ;
is more precise than:
numsales+1;
Also, I find that tasks of this sort yield better to the DoW-loop, e.g.:
data want ;
do until (last.product) ;
set a ;
by product ;
numsales = sum (numsales, N(sales)) ;
sumsales = sum (sumsales, sales) ;
end ;
mean = divide (sumsales, numsales) ;
run ;
because it (a) uses the implicit DATA step actions to reinitialize the aggregates before each BY group and (b) uses one tests for LAST.product instead of two for FIRST.product and LAST.product. The implicit initialization of NUMSALES to missing at the top of the step also guards against the division by 0 in case when all SALES in a BY group are missing.
Kind regards
Paul D.
@hashman - Agreed, but I'd still use PROC MEANS or SQL unless there was some extra requirement justifying doing it on-the-fly in a DATA step.
'Nuff said! especially since with those, one doesn't have to have missing values explicitly.
Kind regards
Paul D.
data a;
input product$ sales day$9. week;
cards;
a 100 01-jan-19 1
a 20 01-jan-19 1
b 200 02-jan-19 1
h 100 09-jan-19 2
i 300 09-jan-19 2
;
run;
data want;
do _n_=1 by 1 until(last.week);
set a;
by product week;
sum=sum(sales,sum);
end;
avg=sum/_n_;
run;
proc print noobs;run;
although all 3 answers are giving desired output but @hashman code is simple and giving only 1 row per product .
Thanks for the help.
@novinosrin's code is basically the same (sans handling the missing values), and it also gives 1 row per product.
@SASKiwi code will also output 1 row per product if you add:
if last.product ;
before the RUN statement.
Kind regards
Paul D.
Strongly agree with the above: do it in PROC MEANS or PROC SUMMARY instead of writing your own data step code. As we have seen many times in this forum, writing your own code can easily produce wrong results in the presence of missing values (although @hashman has avoided this issue, the answers by @novinosrin and @SASKiwi do not handle missing values properly). PROC MEANS and PROC SUMMARY handle missing values properly, all the time, 24/7. Ice Ice Baby.
As @SASKiwi pointed out, you cannot apply aggregate functions like MEAN across rows in the DATA step. The environment where they naturally work across rows is SQL, so If you want to do that, use it instead; for example:
data have ;
input product :$1. sales ;
cards;
A 100
A .
A 20
B 200
B 300
B .
B 500
H 100
H 200
I 300
;
run ;
proc sql ;
create table want_sql as
select product
, N (sales) as N
, sum (sales) as sum
, mean (sales) as mean
from have group product
;
quit ;
With respect to the aggregate functions, SQL handles missing values as properly as proc MEANS.
Kind regards
Paul D.
Guru @hashman Neat. Would you rank SQL approach slightly higher than datastep if you had SQL pass through ACCESS and also make it the gold standard? The reason being the performance is prolly not compromised either?
Trust me for people like you and @SASKiwi the question might seem obvious having worked and seen tons but for the benefit of the folks(like me 🙂 ) who think they know everything yet only know to use SAS like a playstation/Xbox console and know nothing beyond a few notes will help.
PS Only when you have a moment and when you can. Cheers!
Also I wanted to reach out offline for something else if you remember. I shall touch base on that later. Bye for now!
Disciple@novinosrin:
Naturally, SQL is the choice when getting data from external data bases and/or aggregating it. Normally I'd use explicit pass-through, though I've seen cases when the SAS engine working via implicit pass-through was smarter than me in terms of translating my SAS SQL into the data base specific SQL. I think it's also smart enough to translate simple DATA step logic as well. However, I'm leery of setting any "gold standards" when it comes to programming - there're too many confounding factors and specific circumstances.
Kind regards
Paul D.
p.s. Yes, I do remember; thanks for reminding me.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.