I have 18 numerical variables pm25_total2000 to pm25_total2018
Each person have a starting year between 2013 and 2018, we can call that variable "reqyear".
Now I want to calculate mean for each persons 10 years before the starting year.
For example if a person have starting year 2015 I want mean(of pm25_total2006-pm25_total2015)
Or if a person have starting year 2013 I want mean(of pm25_total2004-pm25_total2013)
How to do this?
@TobbeNord wrote:
I have 18 numerical variables pm25_total2000 to pm25_total2018
If those variables are numbered sequential there are 19 variables. If you only have 18 then you need to tell us which one is missing.
If you are going to throw at problem that has 20+ variables you should provide an example data set as making us dummy of that much data is asking a bit.
Here is one way to get started. I am too lazy to make up that many values so I am using 10 "years" and much shorter variable names.
data have; input v2000 - v2009 reqyear; datalines; 1 2 3 4 5 6 7 8 9 10 2005 11 22 33 44 55 66 77 88 99 110 2008 ; data want; set have; array v(2000:2009) v2000-v2009; array t(2000:2009) _temporary_; /* reset as temporary arrays hold values across data step)*/ call missing(of t(*)); /*load temporary array*/ do i=(reqyear-3) to reqyear; t[i]=v[i]; end; Average= mean(of t(*)); drop i; run;
The key to my approach are two arrays. One that hold the full set of values and one that we copy the values to for the mean function to use. If you have not used arrays much you may not recognize the array statement with two values separated by a colon. That is used to indicate the low and high index values instead of the default 1 to number of elements in the array. This is often useful when you have some other variable such as your Reqyear to extract values by index.
The T array is temporary, meaning that the contents are not written to the result data set. If you want to see what they might be remove the _temporary_.
One thing is that while the values of _temporary_ arrays aren't written to the output data set the values are retained. So we need to start fresh with T at each step.
Since I only used 10 variables I am using 3 previous "years" which is seen on the Do I= loop to extract the desired values into the temporary array.
Hi,
while performing the calculation in a datastep, I realized it became quickly diffcult to read, so I did the same using a macro step.
There must be an easier way I can't think of right now
%** Create test data;
DATA have;
length person $10 reqyear pm25_total2000-pm25_total2018 8;
array pm pm25_total2000-pm25_total2018;
do i=1 to 10;
person=cats('Person',put(i,best.));
do over pm;
pm=int(ranuni(0)*100);
end;
reqyear=2000 + floor((1+2018-2000)*rand("uniform"));
output;
end;
drop i;
RUN;
%** 1 - Using datastep;
DATA want;
set _NULL_;
RUN;
DATA _NULL_;
length pmname pm_start pm_end $32;
SET have;
ARRAY PMS pm25_total2000-pm25_total2018;
do i=1 to dim(PMS);
pmname =strip(upcase(vname(PMS[i])));
pm_start =strip(upcase(cats('pm25_total',put(reqyear-9,best.))));
pm_end =strip(upcase(cats('pm25_total',put(reqyear,best.))));
end;
put 'N' 'OTE:' Person= reqyear= pm_start= pm_end=;
if pm_start GE vname(PMS[1]) then do;
call execute('DATA _hlp; LENGTH mean 8 analyzed $70; SET have(FIRSTOBS='||strip(put(_N_,best.))||' OBS='||strip(put(_N_,best.))||');');
call execute('ARRAY pms_mean {*} '||strip(pm_start)||'--'||strip(pm_end)||';');
call execute('mean=mean(of pms_mean[*]); analyzed="mean of '||strip(pm_start)||'--'||strip(pm_end)||'";');
call execute('RUN;');
call execute('DATA want; SET want _hlp; RUN; PROC DATASETS lib=work nolist; delete _:; RUN;QUIT;');
end;
else do;
call execute('DATA _hlp; LENGTH mean 8 analyzed $70; SET have(FIRSTOBS='||strip(put(_N_,best.))||' OBS='||strip(put(_N_,best.))||');');
call execute('mean=.; analyzed="mean since '||strip(pm_start)||' not calculable";');
call execute('RUN;');
call execute('DATA want; SET want _hlp; RUN; PROC DATASETS lib=work nolist; delete _:; RUN;QUIT;');
end;
RUN;
%** 2 - Using Macro;
DATA want2;
set _NULL_;
RUN;
%macro calcMean(inDS=,outDS=,n=,Miss=,startVar=,endVar=);
%if &Miss. ne 1 %then %do;
DATA _hlp;
LENGTH mean 8 analyzed $70;
SET have(FIRSTOBS=&n. OBS=&n.);
ARRAY pms_mean {*} &startVar.--&endVar.;
mean=mean(of pms_mean[*]);
analyzed="Mean of &startVar.--&endVar.";
RUN;
%end;
%else %do;
DATA _hlp;
LENGTH mean 8 analyzed $70;
SET have(FIRSTOBS=&n. OBS=&n.);
mean=.;
analyzed="Mean since &startVar. not calculable";
RUN;
%end;
DATA &outDS.;
SET &outDS. _hlp;
RUN;
PROC DATASETS lib=work nolist; delete _:; RUN;QUIT;
%mend calcMean;
DATA _NULL_;
length pmname pm_start pm_end $32;
SET have;
ARRAY PMS pm25_total2000-pm25_total2018;
do i=1 to dim(PMS);
pmname =strip(upcase(vname(PMS[i])));
pm_start =strip(upcase(cats('pm25_total',put(reqyear-9,best.))));
pm_end =strip(upcase(cats('pm25_total',put(reqyear,best.))));
end;
put 'N' 'OTE:' Person= reqyear= pm_start= pm_end=;
if pm_start GE vname(PMS[1]) then
call execute('%nrstr(%calcMean(inDS=want,outDS=want2,n='||strip(put(_N_,best.))||' ,startVar='||strip(pm_start)||',endVar='||strip(pm_end)||');)');
else call execute('%nrstr(%calcMean(inDS=want,outDS=want2,n='||strip(put(_N_,best.))||',Miss=1,startVar='||strip(pm_start)||',endVar='||strip(pm_end)||');)');
RUN;
- Cheers -
@TobbeNord wrote:
I have 18 numerical variables pm25_total2000 to pm25_total2018
If those variables are numbered sequential there are 19 variables. If you only have 18 then you need to tell us which one is missing.
If you are going to throw at problem that has 20+ variables you should provide an example data set as making us dummy of that much data is asking a bit.
Here is one way to get started. I am too lazy to make up that many values so I am using 10 "years" and much shorter variable names.
data have; input v2000 - v2009 reqyear; datalines; 1 2 3 4 5 6 7 8 9 10 2005 11 22 33 44 55 66 77 88 99 110 2008 ; data want; set have; array v(2000:2009) v2000-v2009; array t(2000:2009) _temporary_; /* reset as temporary arrays hold values across data step)*/ call missing(of t(*)); /*load temporary array*/ do i=(reqyear-3) to reqyear; t[i]=v[i]; end; Average= mean(of t(*)); drop i; run;
The key to my approach are two arrays. One that hold the full set of values and one that we copy the values to for the mean function to use. If you have not used arrays much you may not recognize the array statement with two values separated by a colon. That is used to indicate the low and high index values instead of the default 1 to number of elements in the array. This is often useful when you have some other variable such as your Reqyear to extract values by index.
The T array is temporary, meaning that the contents are not written to the result data set. If you want to see what they might be remove the _temporary_.
One thing is that while the values of _temporary_ arrays aren't written to the output data set the values are retained. So we need to start fresh with T at each step.
Since I only used 10 variables I am using 3 previous "years" which is seen on the Do I= loop to extract the desired values into the temporary array.
data have;
call streaminit(123);
array x{*} pm25_total2000-pm25_total2018;
do id=1 to 10;
start_year=rand('integer',2013,2018);
do i=1 to dim(x);
x{i}=rand('integer',1,10);
end;
output;
end;
drop i;
run;
data want;
set have;
sum=0;
do year=start_year-9 to start_year;
sum+vvaluex(cats('pm25_total',year));
end;
mean=sum/10;
drop sum year;
run;
easy 😅 thanks
- Cheers -
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.