BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
adi121
Fluorite | Level 6

Hi im calculating average using data step method but unable to do so.

 

data :

data a;
input product$ sales day$9. week;
cards;
a 100 01-jan-19 1
a 20 01-jan-19 1
b 200 02-jan-19 1
h 100 09-jan-19 2
i 300 09-jan-19 2
;
run;

 

however i can do it using proc means:

proc means data=a nway;
var sales;
class product;
output out=s mean=avg;
;run;
Results:
product _type_ _freq_ avg
a 1 2 60
b 1 1 200
h 1 1 100
i 1 1 300

 

I want to get this output using data step way:

proc sort data=a ; by product week; run;
data want;
set a;
by product week;
if first.product then sumsales=0;
sumsales+sales;

if last.product;
avg=mean(sumsales);
run;

 

the mean function is not working and returning the exact values of sumsales column

 

Plz help!!

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
hashman
Ammonite | Level 13

@SASKiwi:

Like. But to emulate the mean stat to a tee, i.e. ignore the missing values:

numsales + N(sales) ;

 is more precise than:

numsales+1;

 Also, I find that tasks of this sort yield better to the DoW-loop, e.g.:

data want ;                              
  do until (last.product) ;              
    set a ;                              
    by product ;                         
    numsales = sum (numsales, N(sales)) ;
    sumsales = sum (sumsales, sales) ;   
  end ;                                  
  mean = divide (sumsales, numsales) ;   
run ;                                    

because it (a) uses the implicit DATA step actions to reinitialize the aggregates before each BY group and (b) uses one tests for LAST.product instead of two for FIRST.product and LAST.product. The implicit initialization of NUMSALES to missing at the top of the step also guards against the division by 0 in case when all SALES in a BY group are missing.

 

Kind regards

Paul D.  

View solution in original post

11 REPLIES 11
SASKiwi
PROC Star

The MEAN function averages variables, not rows. This should do it:

data want;
set a;
by product week;
if first.product then do;
  sumsales=0;
  numsales = 0;
end;
sumsales+sales;
numsales+1;
if last.product then avg=sumsales/numsales;
run;
hashman
Ammonite | Level 13

@SASKiwi:

Like. But to emulate the mean stat to a tee, i.e. ignore the missing values:

numsales + N(sales) ;

 is more precise than:

numsales+1;

 Also, I find that tasks of this sort yield better to the DoW-loop, e.g.:

data want ;                              
  do until (last.product) ;              
    set a ;                              
    by product ;                         
    numsales = sum (numsales, N(sales)) ;
    sumsales = sum (sumsales, sales) ;   
  end ;                                  
  mean = divide (sumsales, numsales) ;   
run ;                                    

because it (a) uses the implicit DATA step actions to reinitialize the aggregates before each BY group and (b) uses one tests for LAST.product instead of two for FIRST.product and LAST.product. The implicit initialization of NUMSALES to missing at the top of the step also guards against the division by 0 in case when all SALES in a BY group are missing.

 

Kind regards

Paul D.  

SASKiwi
PROC Star

@hashman - Agreed, but I'd still use PROC MEANS or SQL unless there was some extra requirement justifying doing it on-the-fly in a DATA step.

hashman
Ammonite | Level 13

@SASKiwi:

'Nuff said! especially since with those, one doesn't have to have missing values explicitly.

 

Kind regards

Paul D.  

novinosrin
Tourmaline | Level 20

data a;
input product$ sales day$9. week;
cards;
a 100 01-jan-19 1
a 20 01-jan-19 1
b 200 02-jan-19 1
h 100 09-jan-19 2
i 300 09-jan-19 2
;
run;

data want;
 do _n_=1 by 1 until(last.week);
  set a;
  by product week;
  sum=sum(sales,sum);
 end;
 avg=sum/_n_;
run;

proc print noobs;run;
adi121
Fluorite | Level 6

 

although all 3 answers are giving desired output but @hashman code is simple and giving only 1 row per product .

 

Thanks for the help.

 

hashman
Ammonite | Level 13

@adi121:

@novinosrin's code is basically the same (sans handling the missing values), and it also gives 1 row per product.

@SASKiwi code will also output 1 row per product if you add:

  if last.product ;

before the RUN statement.

 

Kind regards

Paul D.   

PaigeMiller
Diamond | Level 26

Strongly agree with the above: do it in PROC MEANS or PROC SUMMARY instead of writing your own data step code. As we have seen many times in this forum, writing your own code can easily produce wrong results in the presence of missing values (although @hashman has avoided this issue, the answers by @novinosrin and @SASKiwi do not handle missing values properly). PROC MEANS and PROC SUMMARY handle missing values properly, all the time, 24/7. Ice Ice Baby.

--
Paige Miller
hashman
Ammonite | Level 13

@adi121:

As @SASKiwi pointed out, you cannot apply aggregate functions like MEAN across rows in the DATA step. The environment where they naturally work across rows is SQL, so If you want to do that, use it instead; for example:

data have ;                   
  input product :$1. sales ;  
cards;                        
A 100                         
A   .                         
A  20                         
B 200                         
B 300                         
B   .                         
B 500                         
H 100                         
H 200                         
I 300                         
;                             
run ;                         
                              
proc sql ;                    
  create table want_sql as    
  select product              
       , N    (sales) as N    
       , sum  (sales) as sum  
       , mean (sales) as mean 
  from   have group product   
  ;                           
quit ;                        

With respect to the aggregate functions, SQL handles missing values as properly as proc MEANS.

 

Kind regards

Paul D.

   

novinosrin
Tourmaline | Level 20

Guru @hashman  Neat. Would you rank SQL approach slightly higher than datastep if you had SQL pass through ACCESS and also make it the gold standard? The reason being the performance is prolly not compromised either?

 

Trust me for people like you and @SASKiwi  the question might seem obvious having worked and seen tons but for the benefit of the folks(like me 🙂 ) who think they know everything yet only know to use SAS like a playstation/Xbox console and know nothing beyond a few notes will help. 

 

PS Only when you have a moment and when you can. Cheers!

 

Also I wanted to reach out offline for something else if you remember. I shall touch base on that later. Bye for now!

hashman
Ammonite | Level 13

Disciple@novinosrin:

Naturally, SQL is the choice when getting data from external data bases and/or aggregating it. Normally I'd use explicit pass-through, though I've seen cases when the SAS engine working via implicit pass-through was smarter than me in terms of translating my SAS SQL into the data base specific SQL. I think it's also smart enough to translate simple DATA step logic as well. However, I'm leery of setting any "gold standards" when it comes to programming - there're too many confounding factors and specific circumstances. 

 

Kind regards

Paul D.

 

p.s. Yes, I do remember; thanks for reminding me.    

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 38385 views
  • 3 likes
  • 5 in conversation