I have a dataset, which contains:
Date
,GadgetId
,SomeMeasurement
I would like to calculate the median of SomeMeasurement for every month whilst considering the retrospective/previous data. Example:
Date | GadgetId | SomeMeasurement |
31-Jan-15 | A1 | 5 |
26-Jan-15 | A1 | 3 |
26-Jan-15 | A1 | 3 |
26-Jan-15 | A1 | 3 |
03-Feb-15 | A1 | 5 |
07-Feb-15 | A1 | 5 |
07-Feb-15 | A1 | 5 |
07-Feb-15 | A1 | 4 |
02-Feb-15 | A1 | 5 |
02-Feb-15 | A1 | 5 |
03-Feb-15 | A1 | 5 |
02-Feb-15 | A1 | 5 |
07-Feb-15 | A1 | 4 |
03-Feb-15 | A1 | 5 |
In month Jan 2015 one would consider the values of this month only to calculate the median. In month Feb 2015 one would consider the values in Jan 2015 and Feb 2015, in Dec 2017 one would consider the data for Dec 2017 and all the previous months etc.
Please note that each dataset contains several GadgetIds so a BY GadgetId would be required I suppose. Also each GadgetId has different number samples/dates (some may only have 1 year's worth of data whereas others may have several year's worth of data).
PROC EXPAND.
Thanks. I think this is sas/ets which we do not have )-:
How many "and so on"'s are we talking? I mean you could keep all the values in an array for instance then median each row.
data want; set have; array vals{100} 8; retain vals:; retain num; num=ifn(_n_=1,1,num+1); vals{num}=somemeasurement; result=median(of vals{*}); run;
That is given 100 observations.
I am not sure I quote see the logic here though, why doing a rolling median? Would not a monthly or yearly be appropriate?
Thanks. It could be 3-4 years worth of data. so in month 12 of year 4 I have to use data of all 4 years to get the median. Please also not that I have to use a BY for different gadgets. Each gadget can have 3-4 years but the amount of data is dynamic - i.e. depends on the gadget.
Can you please provide a more complete sample data with gadgets and the rest?
You only included one gadget, he asked for a few.
The second solution I posted deals with BY groups - see the BY and IF FIRST statement that resets things.
@csetzkorn wrote:
Done - sorry if it was not clear enough ...
Helps when you provide complete and comprehensive samples and details
data have;
input Date : date9. gadgetid $ SomeMeasurement;
format date date9.;
datalines;
31-Jan-15 A1 5
26-Jan-15 A1 3
26-Jan-15 A1 3
26-Jan-15 A1 3
03-Feb-15 A1 5
07-Feb-15 A1 5
07-Feb-15 A1 5
07-Feb-15 A1 4
02-Feb-15 A1 5
02-Feb-15 A1 5
03-Feb-15 A1 5
02-Feb-15 A1 5
07-Feb-15 A1 4
03-Feb-15 A1 5
31-Jan-15 B1 5
26-Jan-15 B1 3
26-Jan-15 B1 3
26-Jan-15 B1 3
03-Feb-15 B1 5
07-Feb-15 B1 5
07-Feb-15 B1 5
07-Feb-15 B1 4
02-Feb-15 B1 5
02-Feb-15 B1 5
03-Feb-15 B1 5
02-Feb-15 B1 5
07-Feb-15 B1 4
03-Feb-15 B1 5
;
run;
data temp;
set have;
by gadgetid;
if first.gadgetid then grp=0;
formatted_date=date;
if month(date) ne lag(month(date)) then grp+1;
format formatted_date monyy7.;
run;
data want;
_k=_n_;
_c=0;
array t(20) _temporary_ ;/*array subscript arbitrary,should assign a big one to hold*/
call missing(median,of t(*));
do until(last.gadgetid);
do until(last.grp);
set temp;
by gadgetid grp;
_c+1;
t(_c)=SomeMeasurement;
if last.grp then do; median=median(median,of t(*));output;end;
end;
end;
drop _:;
run;
slight correction to the data want step:
data want;
_k=_n_;
_c=0;
array t(20) _temporary_ ;/*array subscript arbitrary,should assign a big one to hold*/
call missing(median,of t(*));
do until(last.gadgetid);
do until(last.grp);
set temp;
by gadgetid grp;
_c+1;
t(_c)=SomeMeasurement;
if last.grp then do; median=median(of t(*));output;end;
end;
end;
drop _: grp;
run;
@csetzkorn Yes, the bigger subscript makes sure values(elements doesn't go out of range.
For example, if you believe there could be 10000 records per gadgetid
And temporary arrays method, make your array 31 to have a full month of data. If you have repeated measurements for a month are they considered the same? I noticed you had two observations for month=1 and 1 for month =3. If you have a variable number per month you may want to standardize or aggregate that somehow first.
https://gist.github.com/statgeek/27e23c015eae7953eff2
data want;
set sashelp.stocks;
by stock notsorted;
array p{0:30} _temporary_;
if first.stock then call missing(of p{*});
p{mod(_n_,31)} = open;
lowest = median(of p{*});
highest = max(of p{*});
run;
yes there could be several values per day as indicated in the example.
If you have SAS9.4
data have;
input Date : date9. gadgetid $ SomeMeasurement;
new_date=intnx('month',date,0);
format date new_date date9.;
datalines;
31-Jan-15 A1 5
26-Jan-15 A1 3
26-Jan-15 A1 3
26-Jan-15 A1 3
03-Feb-15 A1 5
07-Feb-15 A1 5
07-Feb-15 A1 5
07-Feb-15 A1 4
02-Feb-15 A1 5
02-Feb-15 A1 5
03-Feb-15 A1 5
02-Feb-15 A1 5
07-Feb-15 A1 4
03-Feb-15 A1 5
31-Jan-15 B1 5
26-Jan-15 B1 3
26-Jan-15 B1 3
26-Jan-15 B1 3
03-Feb-15 B1 5
07-Feb-15 B1 5
07-Feb-15 B1 5
07-Feb-15 B1 4
02-Feb-15 B1 5
02-Feb-15 B1 5
03-Feb-15 B1 5
02-Feb-15 B1 5
07-Feb-15 B1 4
03-Feb-15 B1 5
;
run;
proc sql;
create table want as
select *,(select median(SomeMeasurement) from have
where gadgetid=a.gadgetid and new_date<=a.new_date) as median
from have as a;
quit;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.