DATA Step, Macro, Functions and more

How to compute the median of a variable but exclude the observation itself

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 9
Accepted Solution

How to compute the median of a variable but exclude the observation itself

I'm try to compute the difference between each observation's value of a certain variable (A) and the median of the same variable (A) for

its joint industry (X) and performance (Y) decile, where the median excludes the obs itself. I already created deciles for variable Y within each industry X,

but can't figure out how to calculate the median of each decile excluding the obs. Any suggestions are much appreciated!


Accepted Solutions
Solution
‎09-19-2013 10:22 PM
Respected Advisor
Posts: 4,644

Re: How to compute the median of a variable but exclude the observation itself

You can try this :


proc sort data=have; by x decileY; run;

proc sql;
create table c as
select x, decileY, count(a) as n
from have
group by x, decileY;
quit;

data want;
merge have c; by x decileY;
medPos = cc + floor((n+1)/2);
if _n_ <= medPos then do;
     medPos = min(cc + n, max(cc + 1, medPos + 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
else do;
     medPos = min(cc + n, max(cc + 1, medPos - 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
medDif = a - medA;
if last.decileY then cc + n;
drop n cc;
run;

The idea is that for observations lower than the true median, your median is the observation just after the true median and for observations greater than the true median, your median is the observation just before the true median.

PG

PG

View solution in original post


All Replies
Solution
‎09-19-2013 10:22 PM
Respected Advisor
Posts: 4,644

Re: How to compute the median of a variable but exclude the observation itself

You can try this :


proc sort data=have; by x decileY; run;

proc sql;
create table c as
select x, decileY, count(a) as n
from have
group by x, decileY;
quit;

data want;
merge have c; by x decileY;
medPos = cc + floor((n+1)/2);
if _n_ <= medPos then do;
     medPos = min(cc + n, max(cc + 1, medPos + 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
else do;
     medPos = min(cc + n, max(cc + 1, medPos - 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
medDif = a - medA;
if last.decileY then cc + n;
drop n cc;
run;

The idea is that for observations lower than the true median, your median is the observation just after the true median and for observations greater than the true median, your median is the observation just before the true median.

PG

PG
Occasional Contributor
Posts: 9

Re: How to compute the median of a variable but exclude the observation itself

Thank you so much for the great help! Your code works perfectly! I'm so impressed by your prompt feedback and glad I find this supportive community. Happy Friday!

New Contributor
Posts: 2

Re: How to compute the median of a variable but exclude the observation itself

hI

Refer back to the program listed by PG stats

medPos = cc + floor((n+1)/2);

if _n_ <= medPos then do;

     medPos = min(cc + n, max(cc + 1, medPos + 1));

     set have(keep=a rename=(a=medA)) point=medPos;

     end;

else do;

     medPos = min(cc + n, max(cc + 1, medPos - 1));

     set have(keep=a rename=(a=medA)) point=medPos;

     end;

medDif = a - medA;

if last.decileY then cc + n;

drop n cc;

run;


What cc stands for, there is no value or definition in the program. Your help is much appreciated.


Cheers


LEE

Respected Advisor
Posts: 4,644

Re: How to compute the median of a variable but exclude the observation itself

In the above datastep, cc is a variable created automatically by SAS because of the statement

if last.decileY then cc + n;


cc + n; is a sum statement which creates the accumulator variable cc. It is equivalent to


retain cc 0;

cc = sum(cc, n);


Thus cc is initialized with zero and it automatically retains its value across datastep iterations.


PG

PG
New Contributor
Posts: 2

Re: How to compute the median of a variable but exclude the observation itself

Hi PGStats

I blindly used the code you posted, it is running ok. however, it doesnot produce MedA further Medif for each observation. There are quite large observations without MedA. I am confused. Your thoughts and help is much appreciated.

Cheers

li

Respected Advisor
Posts: 4,644

Re: How to compute the median of a variable but exclude the observation itself

It might help if your initial sort also includes variable a :

proc sort data=have; by x decileY a; run;

If that doesn't solve the problem, please provide a small example.

PG

PG
Valued Guide
Posts: 797

Re: How to compute the median of a variable but exclude the observation itself

For each company in a decile/industry group, you want CMED, the "complimentary median" (i.e. median of group members except the company in hand) of variable X.

But of course, you can't directly calculate the CMED the same way you could directly calculate the complimentary mean - after all its a non-parametric measure.

But you are helped by the fact that you're going to get no more than 3 distinct values for CMED over each group.  Start out with the MED(X), the "ordinary median" (i.e. including the subject company).

Then all companies with X <  MED(X) will have the same CMED (i.e. dropping any of these companies from median calculation is the same as dropping the minimum X).  Call this value MEDexMIN(X).

The same principle applies to all companies with X> MED(X).  Dropping any of them will produce the same CMED as dropping the maximum X.  Call it MEDexMAX(X).

And if the company has X=MED(X)?  Then the complimentary median is the mid-point between MEDexMIN(X) and MEDexMAX(X).

Note all of these could have the same value if there are a lot of ties at MED(X)..

Here's a suggested program - I'd consider it a good example of why and how to use the double DOW approach.  It assumes your data set HAVE is grouped by decile and industry, and that you have (say) no more than 100 companies in a single decile/industry group:

data want (drop=_Smiley Happy;

  array _x {100}; /* Establishes variables _x1 ... _x100 */

  do _I = 1 by 1 until (last.industry);

    set have;

    by decile industry nosorted;

    _x{_I}=x;
    _xmax=max(_xmax,x);  if _xmax=x then _imax=_I;

    _xmin=min(_xmin,x);  if _xmin=x then _imin=_I;

  end;

  _med=median(of _xSmiley Happy;

  _x{_Imax}=.;  _MEDexMAX=median(of _xSmiley Happy;  _x{_imax}=_xmax;

  _x{_imin}=.;  _MEDexMIN=median(of _xSmiley Happy;  _x{_imin}=_xmin;

  do until (last.industry);

    set have;

    by decile industry notsorted;

    if x=. then CMED=.;

    else select (sign(x-_med));

      when (1)  cmed = MEDexMAX;
      when (-1) cmed = MEDexMIN;
      when (0)  cmed = (MEDexMAX+MEDexMIN)/2;

    end;

    output;

  end;

run;

Occasional Contributor
Posts: 9

Re: How to compute the median of a variable but exclude the observation itself

Thank you so much for explaining the logic behind the code and providing an alternative solution for my problem. I feel so lucky to get help from you all, this means so much to a beginner. I did try your code but getting error message as follows:

1185  data want (drop=_Smiley Happy

1186    array _x (100); /* Establishes variables _x1 ... _x100 */

                  ---

                  22

ERROR 22-7: Invalid option name 100.

I'll check into this more later when I get a chance. I'm sure your code definitely works perfectly, I just need to find out what I might have typed wrong.

Again, thank you very much!

Super User
Super User
Posts: 6,499

Re: How to compute the median of a variable but exclude the observation itself

Your DATA statement is missing the semi-colon.

Occasional Contributor
Posts: 9

Re: How to compute the median of a variable but exclude the observation itself

Good eyes, thanks a lot!

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 10 replies
  • 1174 views
  • 7 likes
  • 5 in conversation