Solved: How to compute the median of a variable but exclude the observation it...

navyblue · Posted 09-19-2013 06:31 PM

I'm try to compute the difference between each observation's value of a certain variable (A) and the median of the same variable (A) for

its joint industry (X) and performance (Y) decile, where the median excludes the obs itself. I already created deciles for variable Y within each industry X,

but can't figure out how to calculate the median of each decile excluding the obs. Any suggestions are much appreciated!

PGStats · Posted 09-19-2013 10:22 PM

You can try this :

proc sort data=have; by x decileY; run;

proc sql;
create table c as
select x, decileY, count(a) as n
from have
group by x, decileY;
quit;

data want;
merge have c; by x decileY;
medPos = cc + floor((n+1)/2);
if _n_ <= medPos then do;
     medPos = min(cc + n, max(cc + 1, medPos + 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
else do;
     medPos = min(cc + n, max(cc + 1, medPos - 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
medDif = a - medA;
if last.decileY then cc + n;
drop n cc;
run;

The idea is that for observations lower than the true median, your median is the observation just after the true median and for observations greater than the true median, your median is the observation just before the true median.

PG

View solution in original post

PGStats · Posted 09-19-2013 10:22 PM

You can try this :

proc sort data=have; by x decileY; run;

proc sql;
create table c as
select x, decileY, count(a) as n
from have
group by x, decileY;
quit;

data want;
merge have c; by x decileY;
medPos = cc + floor((n+1)/2);
if _n_ <= medPos then do;
     medPos = min(cc + n, max(cc + 1, medPos + 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
else do;
     medPos = min(cc + n, max(cc + 1, medPos - 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
medDif = a - medA;
if last.decileY then cc + n;
drop n cc;
run;

The idea is that for observations lower than the true median, your median is the observation just after the true median and for observations greater than the true median, your median is the observation just before the true median.

PG

navyblue · Posted 09-20-2013 02:21 PM

Thank you so much for the great help! Your code works perfectly! I'm so impressed by your prompt feedback and glad I find this supportive community. Happy Friday!

LLCLOUD · Posted 05-06-2015 09:17 PM

hI

Refer back to the program listed by PG stats

medPos = cc + floor((n+1)/2);

if _n_ <= medPos then do;

medPos = min(cc + n, max(cc + 1, medPos + 1));

set have(keep=a rename=(a=medA)) point=medPos;

end;

else do;

medPos = min(cc + n, max(cc + 1, medPos - 1));

set have(keep=a rename=(a=medA)) point=medPos;

end;

medDif = a - medA;

if last.decileY then cc + n;

drop n cc;

run;

What cc stands for, there is no value or definition in the program. Your help is much appreciated.

Cheers

LEE

PGStats · Posted 05-07-2015 11:15 AM

In the above datastep, cc is a variable created automatically by SAS because of the statement

if last.decileY then cc + n;

cc + n; is a sum statement which creates the accumulator variable cc. It is equivalent to

retain cc 0;

cc = sum(cc, n);

Thus cc is initialized with zero and it automatically retains its value across datastep iterations.

PG

LLCLOUD · Posted 05-07-2015 08:11 PM

Hi PGStats

I blindly used the code you posted, it is running ok. however, it doesnot produce MedA further Medif for each observation. There are quite large observations without MedA. I am confused. Your thoughts and help is much appreciated.

Cheers

li

PGStats · Posted 05-08-2015 10:59 AM

It might help if your initial sort also includes variable a :

proc sort data=have; by x decileY a; run;

If that doesn't solve the problem, please provide a small example.

PG

mkeintz · Posted 09-19-2013 11:19 PM

For each company in a decile/industry group, you want CMED, the "complimentary median" (i.e. median of group members except the company in hand) of variable X.

But of course, you can't directly calculate the CMED the same way you could directly calculate the complimentary mean - after all its a non-parametric measure.

But you are helped by the fact that you're going to get no more than 3 distinct values for CMED over each group. Start out with the MED(X), the "ordinary median" (i.e. including the subject company).

Then all companies with X < MED(X) will have the same CMED (i.e. dropping any of these companies from median calculation is the same as dropping the minimum X). Call this value MEDexMIN(X).

The same principle applies to all companies with X> MED(X). Dropping any of them will produce the same CMED as dropping the maximum X. Call it MEDexMAX(X).

And if the company has X=MED(X)? Then the complimentary median is the mid-point between MEDexMIN(X) and MEDexMAX(X).

Note all of these could have the same value if there are a lot of ties at MED(X)..

Here's a suggested program - I'd consider it a good example of why and how to use the double DOW approach. It assumes your data set HAVE is grouped by decile and industry, and that you have (say) no more than 100 companies in a single decile/industry group:

data want (drop=_:);

array _x {100}; /* Establishes variables _x1 ... _x100 */

do _I = 1 by 1 until (last.industry);

set have;

by decile industry nosorted;

_x{_I}=x;
_xmax=max(_xmax,x); if _xmax=x then _imax=_I;

_xmin=min(_xmin,x); if _xmin=x then _imin=_I;

end;

_med=median(of _x:);

_x{_Imax}=.; _MEDexMAX=median(of _x:); _x{_imax}=_xmax;

_x{_imin}=.; _MEDexMIN=median(of _x:); _x{_imin}=_xmin;

do until (last.industry);

set have;

by decile industry notsorted;

if x=. then CMED=.;

else select (sign(x-_med));

      when (1) cmed = MEDexMAX;
      when (-1) cmed = MEDexMIN;
      when (0) cmed = (MEDexMAX+MEDexMIN)/2;

end;

output;

end;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

navyblue · Posted 09-20-2013 02:45 PM

Thank you so much for explaining the logic behind the code and providing an alternative solution for my problem. I feel so lucky to get help from you all, this means so much to a beginner. I did try your code but getting error message as follows:

1185 data want (drop=_:)

1186 array _x (100); /* Establishes variables _x1 ... _x100 */

---

22

ERROR 22-7: Invalid option name 100.

I'll check into this more later when I get a chance. I'm sure your code definitely works perfectly, I just need to find out what I might have typed wrong.

Again, thank you very much!

Tom · Posted 09-20-2013 04:33 PM

Your DATA statement is missing the semi-colon.

navyblue · Posted 09-24-2013 06:36 PM

Good eyes, thanks a lot!

How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Re: How to compute the median of a variable but exclude the observation itself

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away