BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
navyblue
Calcite | Level 5

I'm try to compute the difference between each observation's value of a certain variable (A) and the median of the same variable (A) for

its joint industry (X) and performance (Y) decile, where the median excludes the obs itself. I already created deciles for variable Y within each industry X,

but can't figure out how to calculate the median of each decile excluding the obs. Any suggestions are much appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You can try this :


proc sort data=have; by x decileY; run;

proc sql;
create table c as
select x, decileY, count(a) as n
from have
group by x, decileY;
quit;

data want;
merge have c; by x decileY;
medPos = cc + floor((n+1)/2);
if _n_ <= medPos then do;
     medPos = min(cc + n, max(cc + 1, medPos + 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
else do;
     medPos = min(cc + n, max(cc + 1, medPos - 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
medDif = a - medA;
if last.decileY then cc + n;
drop n cc;
run;

The idea is that for observations lower than the true median, your median is the observation just after the true median and for observations greater than the true median, your median is the observation just before the true median.

PG

PG

View solution in original post

10 REPLIES 10
PGStats
Opal | Level 21

You can try this :


proc sort data=have; by x decileY; run;

proc sql;
create table c as
select x, decileY, count(a) as n
from have
group by x, decileY;
quit;

data want;
merge have c; by x decileY;
medPos = cc + floor((n+1)/2);
if _n_ <= medPos then do;
     medPos = min(cc + n, max(cc + 1, medPos + 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
else do;
     medPos = min(cc + n, max(cc + 1, medPos - 1));
     set have(keep=a rename=(a=medA)) point=medPos;
     end;
medDif = a - medA;
if last.decileY then cc + n;
drop n cc;
run;

The idea is that for observations lower than the true median, your median is the observation just after the true median and for observations greater than the true median, your median is the observation just before the true median.

PG

PG
navyblue
Calcite | Level 5

Thank you so much for the great help! Your code works perfectly! I'm so impressed by your prompt feedback and glad I find this supportive community. Happy Friday!

LLCLOUD
Calcite | Level 5

hI

Refer back to the program listed by PG stats

medPos = cc + floor((n+1)/2);

if _n_ <= medPos then do;

     medPos = min(cc + n, max(cc + 1, medPos + 1));

     set have(keep=a rename=(a=medA)) point=medPos;

     end;

else do;

     medPos = min(cc + n, max(cc + 1, medPos - 1));

     set have(keep=a rename=(a=medA)) point=medPos;

     end;

medDif = a - medA;

if last.decileY then cc + n;

drop n cc;

run;


What cc stands for, there is no value or definition in the program. Your help is much appreciated.


Cheers


LEE

PGStats
Opal | Level 21

In the above datastep, cc is a variable created automatically by SAS because of the statement

if last.decileY then cc + n;


cc + n; is a sum statement which creates the accumulator variable cc. It is equivalent to


retain cc 0;

cc = sum(cc, n);


Thus cc is initialized with zero and it automatically retains its value across datastep iterations.


PG

PG
LLCLOUD
Calcite | Level 5

Hi PGStats

I blindly used the code you posted, it is running ok. however, it doesnot produce MedA further Medif for each observation. There are quite large observations without MedA. I am confused. Your thoughts and help is much appreciated.

Cheers

li

PGStats
Opal | Level 21

It might help if your initial sort also includes variable a :

proc sort data=have; by x decileY a; run;

If that doesn't solve the problem, please provide a small example.

PG

PG
mkeintz
PROC Star

For each company in a decile/industry group, you want CMED, the "complimentary median" (i.e. median of group members except the company in hand) of variable X.

But of course, you can't directly calculate the CMED the same way you could directly calculate the complimentary mean - after all its a non-parametric measure.

But you are helped by the fact that you're going to get no more than 3 distinct values for CMED over each group.  Start out with the MED(X), the "ordinary median" (i.e. including the subject company).

Then all companies with X <  MED(X) will have the same CMED (i.e. dropping any of these companies from median calculation is the same as dropping the minimum X).  Call this value MEDexMIN(X).

The same principle applies to all companies with X> MED(X).  Dropping any of them will produce the same CMED as dropping the maximum X.  Call it MEDexMAX(X).

And if the company has X=MED(X)?  Then the complimentary median is the mid-point between MEDexMIN(X) and MEDexMAX(X).

Note all of these could have the same value if there are a lot of ties at MED(X)..

Here's a suggested program - I'd consider it a good example of why and how to use the double DOW approach.  It assumes your data set HAVE is grouped by decile and industry, and that you have (say) no more than 100 companies in a single decile/industry group:

data want (drop=_:);

  array _x {100}; /* Establishes variables _x1 ... _x100 */

  do _I = 1 by 1 until (last.industry);

    set have;

    by decile industry nosorted;

    _x{_I}=x;
    _xmax=max(_xmax,x);  if _xmax=x then _imax=_I;

    _xmin=min(_xmin,x);  if _xmin=x then _imin=_I;

  end;

  _med=median(of _x:);

  _x{_Imax}=.;  _MEDexMAX=median(of _x:);  _x{_imax}=_xmax;

  _x{_imin}=.;  _MEDexMIN=median(of _x:);  _x{_imin}=_xmin;

  do until (last.industry);

    set have;

    by decile industry notsorted;

    if x=. then CMED=.;

    else select (sign(x-_med));

      when (1)  cmed = MEDexMAX;
      when (-1) cmed = MEDexMIN;
      when (0)  cmed = (MEDexMAX+MEDexMIN)/2;

    end;

    output;

  end;

run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
navyblue
Calcite | Level 5

Thank you so much for explaining the logic behind the code and providing an alternative solution for my problem. I feel so lucky to get help from you all, this means so much to a beginner. I did try your code but getting error message as follows:

1185  data want (drop=_:)

1186    array _x (100); /* Establishes variables _x1 ... _x100 */

                  ---

                  22

ERROR 22-7: Invalid option name 100.

I'll check into this more later when I get a chance. I'm sure your code definitely works perfectly, I just need to find out what I might have typed wrong.

Again, thank you very much!

Tom
Super User Tom
Super User

Your DATA statement is missing the semi-colon.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 10 replies
  • 5247 views
  • 7 likes
  • 5 in conversation