Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- How to compute the median of a variable but exclude the observation it...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 09-19-2013 06:31 PM
(3560 views)

I'm try to compute the difference between each observation's value of a certain variable (A) and the median of the same variable (A) for

its joint industry (X) and performance (Y) decile, where the median excludes the obs itself. I already created deciles for variable Y within each industry X,

but can't figure out how to calculate the median of each decile excluding the obs. Any suggestions are much appreciated!

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You can try this :

**proc sort data=have; by x decileY; run;**

** **

**proc sql;****create table c as****select x, decileY, count(a) as n****from have****group by x, decileY;****quit;**

** **

**data want;****merge have c; by x decileY;****medPos = cc + floor((n+1)/2);****if _n_ <= medPos then do;**** medPos = min(cc + n, max(cc + 1, medPos + 1));**** set have(keep=a rename=(a=medA)) point=medPos;**** end;****else do;**** medPos = min(cc + n, max(cc + 1, medPos - 1));**** set have(keep=a rename=(a=medA)) point=medPos;**** end;****medDif = a - medA;****if last.decileY then cc + n;****drop n cc;****run;**

The idea is that for observations lower than the true median, your median is the observation just after the true median and for observations greater than the true median, your median is the observation just before the true median.

PG

PG

10 REPLIES 10

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You can try this :

**proc sort data=have; by x decileY; run;**

** **

**proc sql;****create table c as****select x, decileY, count(a) as n****from have****group by x, decileY;****quit;**

** **

**data want;****merge have c; by x decileY;****medPos = cc + floor((n+1)/2);****if _n_ <= medPos then do;**** medPos = min(cc + n, max(cc + 1, medPos + 1));**** set have(keep=a rename=(a=medA)) point=medPos;**** end;****else do;**** medPos = min(cc + n, max(cc + 1, medPos - 1));**** set have(keep=a rename=(a=medA)) point=medPos;**** end;****medDif = a - medA;****if last.decileY then cc + n;****drop n cc;****run;**

The idea is that for observations lower than the true median, your median is the observation just after the true median and for observations greater than the true median, your median is the observation just before the true median.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

hI

Refer back to the program listed by PG stats

**medPos = cc + floor((n+1)/2);**

**if _n_ <= medPos then do;**

** medPos = min(cc + n, max(cc + 1, medPos + 1));**

** set have(keep=a rename=(a=medA)) point=medPos;**

** end;**

**else do;**

** medPos = min(cc + n, max(cc + 1, medPos - 1));**

** set have(keep=a rename=(a=medA)) point=medPos;**

** end;**

**medDif = a - medA;**

**if last.decileY then cc + n;**

**drop n cc;**

**run;**

**What cc stands for, there is no value or definition in the program. Your help is much appreciated. **

**Cheers**

**LEE**

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

In the above datastep,** cc** is a variable created automatically by SAS because of the statement

**if last.decileY then cc + n;**

**cc + n; **is a sum statement which creates the accumulator variable **cc**. It is equivalent to

**retain cc 0;**

**cc = sum(cc, n);**

Thus **cc** is initialized with zero and it automatically retains its value across datastep iterations.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi PGStats

I blindly used the code you posted, it is running ok. however, it doesnot produce MedA further Medif for each observation. There are quite large observations without MedA. I am confused. Your thoughts and help is much appreciated.

Cheers

li

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It might help if your initial sort also includes variable **a **:

**proc sort data=have; by x decileY a; run;**

If that doesn't solve the problem, please provide a small example.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

For each company in a decile/industry group, you want CMED, the "complimentary median" (i.e. median of group members except the company in hand) of variable X.

But of course, you can't directly calculate the CMED the same way you could directly calculate the complimentary mean - after all its a non-parametric measure.

But you are helped by the fact that you're going to get no more than 3 distinct values for CMED over each group. Start out with the MED(X), the "ordinary median" (i.e. including the subject company).

Then all companies with X < MED(X) will have the same CMED (i.e. dropping any of these companies from median calculation is the same as dropping the minimum X). Call this value MEDexMIN(X).

The same principle applies to all companies with X> MED(X). Dropping any of them will produce the same CMED as dropping the maximum X. Call it MEDexMAX(X).

And if the company has X=MED(X)? Then the complimentary median is the mid-point between MEDexMIN(X) and MEDexMAX(X).

Note all of these could have the same value if there are a lot of ties at MED(X)..

Here's a suggested program - I'd consider it a good example of why and how to use the double DOW approach. It assumes your data set HAVE is grouped by decile and industry, and that you have (say) no more than 100 companies in a single decile/industry group:

data want (drop=_:);

array _x {100}; /* Establishes variables _x1 ... _x100 */

do _I = 1 by 1 until (last.industry);

set have;

by decile industry nosorted;

_x{_I}=x;

_xmax=max(_xmax,x); if _xmax=x then _imax=_I;

_xmin=min(_xmin,x); if _xmin=x then _imin=_I;

end;

_med=median(of _x:);

_x{_Imax}=.; _MEDexMAX=median(of _x:); _x{_imax}=_xmax;

_x{_imin}=.; _MEDexMIN=median(of _x:); _x{_imin}=_xmin;

do until (last.industry);

set have;

by decile industry notsorted;

if x=. then CMED=.;

else select (sign(x-_med));

when (1) cmed = MEDexMAX;

when (-1) cmed = MEDexMIN;

when (0) cmed = (MEDexMAX+MEDexMIN)/2;

end;

output;

end;

run;

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thank you so much for explaining the logic behind the code and providing an alternative solution for my problem. I feel so lucky to get help from you all, this means so much to a beginner. I did try your code but getting error message as follows:

1185 data want (drop=_:)

1186 array _x (100); /* Establishes variables _x1 ... _x100 */

---

22

ERROR 22-7: Invalid option name 100.

I'll check into this more later when I get a chance. I'm sure your code definitely works perfectly, I just need to find out what I might have typed wrong.

Again, thank you very much!

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Your DATA statement is missing the semi-colon.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Good eyes, thanks a lot!

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.