Solved: Re: calculate average in data step

adi121 · Posted 09-28-2019 05:54 PM

Hi im calculating average using data step method but unable to do so.

data :

data a;
input product$ sales day$9. week;
cards;
a 100 01-jan-19 1
a 20 01-jan-19 1
b 200 02-jan-19 1
h 100 09-jan-19 2
i 300 09-jan-19 2
;
run;

however i can do it using proc means:

proc means data=a nway;
var sales;
class product;
output out=s mean=avg;
;run;
Results:
product _type_ _freq_ avg
a 1 2 60
b 1 1 200
h 1 1 100
i 1 1 300

I want to get this output using data step way:

proc sort data=a ; by product week; run;
data want;
set a;
by product week;
if first.product then sumsales=0;
sumsales+sales;

if last.product;
avg=mean(sumsales);
run;

the mean function is not working and returning the exact values of sumsales column

Plz help!!

hashman · Posted 09-28-2019 06:39 PM

@SASKiwi:

Like. But to emulate the mean stat to a tee, i.e. ignore the missing values:

numsales + N(sales) ;

is more precise than:

numsales+1;

Also, I find that tasks of this sort yield better to the DoW-loop, e.g.:

data want ;                              
  do until (last.product) ;              
    set a ;                              
    by product ;                         
    numsales = sum (numsales, N(sales)) ;
    sumsales = sum (sumsales, sales) ;   
  end ;                                  
  mean = divide (sumsales, numsales) ;   
run ;

because it (a) uses the implicit DATA step actions to reinitialize the aggregates before each BY group and (b) uses one tests for LAST.product instead of two for FIRST.product and LAST.product. The implicit initialization of NUMSALES to missing at the top of the step also guards against the division by 0 in case when all SALES in a BY group are missing.

Kind regards

Paul D.

View solution in original post

SASKiwi · Posted 09-28-2019 06:01 PM

The MEAN function averages variables, not rows. This should do it:

data want;
set a;
by product week;
if first.product then do;
  sumsales=0;
  numsales = 0;
end;
sumsales+sales;
numsales+1;
if last.product then avg=sumsales/numsales;
run;

hashman · Posted 09-28-2019 06:39 PM

@SASKiwi:

Like. But to emulate the mean stat to a tee, i.e. ignore the missing values:

numsales + N(sales) ;

is more precise than:

numsales+1;

Also, I find that tasks of this sort yield better to the DoW-loop, e.g.:

data want ;                              
  do until (last.product) ;              
    set a ;                              
    by product ;                         
    numsales = sum (numsales, N(sales)) ;
    sumsales = sum (sumsales, sales) ;   
  end ;                                  
  mean = divide (sumsales, numsales) ;   
run ;

because it (a) uses the implicit DATA step actions to reinitialize the aggregates before each BY group and (b) uses one tests for LAST.product instead of two for FIRST.product and LAST.product. The implicit initialization of NUMSALES to missing at the top of the step also guards against the division by 0 in case when all SALES in a BY group are missing.

Kind regards

Paul D.

SASKiwi · Posted 09-28-2019 06:51 PM

@hashman - Agreed, but I'd still use PROC MEANS or SQL unless there was some extra requirement justifying doing it on-the-fly in a DATA step.

hashman · Posted 09-28-2019 07:56 PM

@SASKiwi:

'Nuff said! especially since with those, one doesn't have to have missing values explicitly.

Kind regards

Paul D.

novinosrin · Posted 09-28-2019 06:26 PM


data a;
input product$ sales day$9. week;
cards;
a 100 01-jan-19 1
a 20 01-jan-19 1
b 200 02-jan-19 1
h 100 09-jan-19 2
i 300 09-jan-19 2
;
run;

data want;
 do _n_=1 by 1 until(last.week);
  set a;
  by product week;
  sum=sum(sales,sum);
 end;
 avg=sum/_n_;
run;

proc print noobs;run;

adi121 · Posted 09-29-2019 01:38 AM

although all 3 answers are giving desired output but @hashman code is simple and giving only 1 row per product .

Thanks for the help.

hashman · Posted 09-29-2019 02:00 AM

@adi121:

@novinosrin's code is basically the same (sans handling the missing values), and it also gives 1 row per product.

@SASKiwi code will also output 1 row per product if you add:

  if last.product ;

before the RUN statement.

Kind regards

Paul D.

PaigeMiller · Posted 09-29-2019 07:59 AM

Strongly agree with the above: do it in PROC MEANS or PROC SUMMARY instead of writing your own data step code. As we have seen many times in this forum, writing your own code can easily produce wrong results in the presence of missing values (although @hashman has avoided this issue, the answers by @novinosrin and @SASKiwi do not handle missing values properly). PROC MEANS and PROC SUMMARY handle missing values properly, all the time, 24/7. Ice Ice Baby.

--
Paige Miller

hashman · Posted 09-29-2019 03:04 PM

@adi121:

As @SASKiwi pointed out, you cannot apply aggregate functions like MEAN across rows in the DATA step. The environment where they naturally work across rows is SQL, so If you want to do that, use it instead; for example:

data have ;                   
  input product :$1. sales ;  
cards;                        
A 100                         
A   .                         
A  20                         
B 200                         
B 300                         
B   .                         
B 500                         
H 100                         
H 200                         
I 300                         
;                             
run ;                         
                              
proc sql ;                    
  create table want_sql as    
  select product              
       , N    (sales) as N    
       , sum  (sales) as sum  
       , mean (sales) as mean 
  from   have group product   
  ;                           
quit ;

With respect to the aggregate functions, SQL handles missing values as properly as proc MEANS.

Kind regards

Paul D.

novinosrin · Posted 09-29-2019 03:14 PM

Guru @hashman Neat. Would you rank SQL approach slightly higher than datastep if you had SQL pass through ACCESS and also make it the gold standard? The reason being the performance is prolly not compromised either?

Trust me for people like you and @SASKiwi the question might seem obvious having worked and seen tons but for the benefit of the folks(like me 🙂 ) who think they know everything yet only know to use SAS like a playstation/Xbox console and know nothing beyond a few notes will help.

PS Only when you have a moment and when you can. Cheers!

Also I wanted to reach out offline for something else if you remember. I shall touch base on that later. Bye for now!

hashman · Posted 09-29-2019 06:37 PM

Disciple@novinosrin:

Naturally, SQL is the choice when getting data from external data bases and/or aggregating it. Normally I'd use explicit pass-through, though I've seen cases when the SAS engine working via implicit pass-through was smarter than me in terms of translating my SAS SQL into the data base specific SQL. I think it's also smart enough to translate simple DATA step logic as well. However, I'm leery of setting any "gold standards" when it comes to programming - there're too many confounding factors and specific circumstances.

Kind regards

Paul D.

p.s. Yes, I do remember; thanks for reminding me.

Classroom Training Available!