I have an undefined number of observations sorted by a var1 and I'm looking to find the median of var1 of these observations in a data step (without using proc means). I know I can add a variable for _n_ which will give me the observation number. But then, I'm wondering how I can get the middle observation (if the number of observations is odd) or average the two middle observations (if the number of observations is even).
Any help on this would be appreciated.
Why would you want to do this since a number of procs already have the ability to do all of the work?
Anyhow, since you asked, how about (using your previous example as the data)?:
data have;
input var1-var10 col1-col11;
cards;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
;
data want (keep=variable);
set have;
array vars _all_;
do over vars;
variable=vars;
output;
end;
run;
data median (keep=median);
set want end=lastrec nobs=numobs;
retain low high;
if numobs/2-int(numobs/2) and _n_ eq ceil(numobs/2) then do;
low=variable;
high=variable;
end;
else if _n_ eq int(numobs/2) then low=variable;
else if _n_ eq int(numobs/2)+1 then high=variable;
if lastrec then do;
median=sum(low,high)/2;
output;
end;
run;
Another way :
data have;
input var1 @@;
datalines;
4 7 2 6 32 4 5 8 3 7
;
data _null_;
set have nobs=n;
call symput ("firstobs", ceil(n/2));
call symput ("obs", ceil(n/2)+(mod(n,2)=0));
stop;
run;
proc sort data=have; by var1; run;
data want(keep=median);
set have (firstobs=&firstobs. obs=&firstobs.);
varx = var1;
set have (firstobs=&obs. obs=&obs.);
median = mean(var1,varx);
run;
PG
PGStats,
I know you know this, but it looks like you're burning a little too much midnight oil. Switch the CALL SYMPUTs to:
call symputx("firstobs", floor(n/2));
call symputx("obs", ceil(n/2));
The tools are good, the details are tricky. Also, can the formula for medians get complex if there can be ties?
: Ties are irrelevant. The definition can be found at: Median - Wikipedia, the free encyclopedia
Art,
I'm focusing on the word "usually" at the end of the first paragraph of your link.
Also, on the PCTLDEF option within PROC UNIVARIATE.
You might be right on this, but I'm just not sure yet.
If N=5, I want firstobs=3 and obs=3. If N=4, I want firstobs=2 and obs=3. Hence the expressions I proposed.
Ties are not a problem. Empty datasets, are, however.
PG
PGStats,
You're right, my bad. Where's that coffee?
If the data are sorted you can use direct access to find the one or two obs needed for median.
Nicely done DN. It can be further simplified as :
proc sort data=sashelp.class(obs=16) out=class;
by age;
run;
data want;
if mod(nobs, 2) then do point=(1+nobs)/2;
set class point=point;
median = age;
end;
else do point = nobs/2, 1+nobs/2;
set class nobs=nobs point=point;
median + age/2;
end;
age = median;
output; stop;
keep age ;
format age 8.2;
run;
PG
Or if you like compactness :
proc sort data=sashelp.class(obs=16) out=class;
by age;
run;
data want;
do point=(mod(nobs, 2)+nobs)/2, (2-mod(nobs, 2)+nobs)/2 ;
set class nobs=nobs point=point;
median + age/2;
end;
age = median;
output; stop;
keep age ;
format age 8.2;
run;
PG
I am so smart that I have deleted my stupid post this morning:smileylaugh:.
Maybe I should have done the same. :smileyshocked:
To atone for my earlier fog, here's my less foggy version of the looping on this one:
do point = ceil(nobs/2), ceil( (nobs+1)/2 );
NIce! It makes PG's code more compact:
data want;
do point = ceil(nobs/2), ceil((nobs+1)/2);
set class nobs=nobs point=point;
median + age/2;
end;
age = median;
output; stop;
keep age ;
format age 8.2;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.