DATA Step, Macro, Functions and more

How to generate a (constant) variable that is the median of another variable ?

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 6
Accepted Solution

How to generate a (constant) variable that is the median of another variable ?

Hello,

I have a (hopefully) very basic problem, but as I'm not an experienced user, i request some help. I have to create a dummy variable, say y, that is 1 is the observation on an underlying variable x is above the median of x and 0 otherwise. I can't genrerate the median as a variable: when i use the median(x) function in a data step to create a new variable, it actually just duplicates the variable x... Can anybody help me ?

Thank you so much in advance,

Jeoffrey


Accepted Solutions
Solution
‎04-03-2013 11:07 AM
Super User
Posts: 17,868

Re: How to generate a (constant) variable that is the median of another variable ?

I agree with LinusH, you're better off using proc summary or means in this case and merging in the results. It's also simpler for beginners Smiley Happy

data example;

input id exp;

datalines;

1 10

2 20

3 30

run;

*calculate the median;

proc means data=example noprint;

output out=median_value median(exp)=exp_median;

run;

data example2;

    set example;

    *add in the median value;

    if _n_=1 then set median_value;

    if exp<exp_median then flag=1;

    else flag=0;

run;

View solution in original post


All Replies
Super User
Posts: 5,257

Re: How to generate a (constant) variable that is the median of another variable ?

Don't really understand what you want to do.

Can you describe it using sample input and output data?

Data never sleeps
Occasional Contributor
Posts: 6

Re: How to generate a (constant) variable that is the median of another variable ?

Thank you for the reply ! Well my data are on my other computer (that is not connected on the internet, but the program goes approximately like this:

data project.cis;

set project.cis;

med_x=median(x);

if x >= med_x then z=1;

else if x<med_x then z=0;

run;

My hope would have been that z = 1  for values of x above the median and 0 otherwise. But the variable med_x in the example above is not constant and takes the same values as x ??? I'm a little lost...

Thank you !

Super User
Posts: 5,257

Re: How to generate a (constant) variable that is the median of another variable ?

Still not getting what your want.

But the median function used in the data step, will pick the median value from a list of values/variables.

In your case, when your list just contain one variable, well, the median on 1 is...1.

My guess is that you probably want to calculate the median of x between rows, then you have to use different techniques. If you need some guidance, again, provide some sample data and desired output.

Data never sleeps
Occasional Contributor
Posts: 6

Re: How to generate a (constant) variable that is the median of another variable ?

Indeed that's exactly what i want: compute a median across rows (not columns!) and compare the value in each row to that overall median Smiley Happy !

Suppose the following data:

data example;
input id exp;

datalines;

1 10

2 20

3 30

run;

The median of 'exp' is obviously 20. What i'd like to get is a new variable that takes value for observation 1 (below the median) and 1 for observations 2 and 3 (at the median and above)... Is this possible ?

Thank you so much for your help !

Jeoffrey

Super User
Posts: 5,257

Re: How to generate a (constant) variable that is the median of another variable ?

Still confused, do you want to put 10 in a variable named bellow, and 20 and 30 into a variable "median_above", sort of?

id     exp   bellow above

1     10     10       .

2     20     .          20

3     30     .          30

Data never sleeps
Occasional Contributor
Posts: 6

Re: How to generate a (constant) variable that is the median of another variable ?

Oh thank you for your kind reply !

Sort of, but not exactly: i just want to create a single binary variable that takes value 1 for observations above the median (i.e. ID = 2 and 3) and 0 for values below that median (i.e. ID = 1).. Well if i could already generate the output you've just shown, i thionk i could find my way Smiley Happy !

Thank you again,

Jeoffrey

Respected Advisor
Posts: 3,124

Re: How to generate a (constant) variable that is the median of another variable ?

I think this is what you are after, make sure your RAM is large enough to hold your array():

data example;

input id exp;

datalines;

1 10

2 20

3 30

4 32

5 21

;

data _null_;

call symputx('nobs',nobs);

stop;

  set example nobs=nobs;

run;

data want;

  do _i=1 to &nobs;

  set example;

array med(&nobs) _temporary_;

med(_i)=exp;

  end;

  _m=median(of med(*));

  do _i=1 to &nobs;

  set example;

  ind=ifn(exp<_m,0,1);

output;

  end;

  /*drop _:;*/

run;

_m is the calculated median.

Haikuo

Occasional Contributor
Posts: 6

Re: How to generate a (constant) variable that is the median of another variable ?


Thank you soooo much ! Code looks nice.. I'll try it see if it works Smiley Happy !

Best regards,

Jeoffrey

Super User
Posts: 5,257

Re: How to generate a (constant) variable that is the median of another variable ?

For calculating medians, you are probably best off using PROC SUMMARY.

The you need to join/merge the result back to your original data, and then do your binary assignment in that step.

Data never sleeps
Respected Advisor
Posts: 3,124

Re: How to generate a (constant) variable that is the median of another variable ?

said: For calculating medians, you are probably best off using PROC SUMMARY.

"best off" is a very bold statement.

Here is what I got from a test:

378 data test;

379 do i=1 to 30000000;

380 output;

381 end;

382 run;

NOTE: The data set WORK.TEST has 30000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

  real time 1.19 seconds

  cpu time 1.21 seconds

383

384 proc summary data=test print median;

385 var i;

386 run;

WARNING: A shortage of memory has caused the quantile computations to terminate prematurely for

  QMETHOD=OS. Consider using QMETHOD=P2.

NOTE: The affected statistics will be missing from the corresponding classification levels.

NOTE: There were 30000000 observations read from the data set WORK.TEST.

NOTE: PROCEDURE SUMMARY used (Total process time):

  real time 9.48 seconds

  cpu time 13.58 seconds

387

388

389 data _null_;

390 call symputx('nobs',nobs);

391 stop;

392 set test nobs=nobs;

393 run;

NOTE: DATA statement used (Total process time):

  real time 0.00 seconds

  cpu time 0.00 seconds

394

395 data _null_;

396 do _n_=1 to &nobs;

397 set test;

398 array med(&nobs) _temporary_;

399 med(_n_)=i;

400 end;

401 Median=median(of med(*));

402 put Median=;

403 run;

Median=15000000.5

NOTE: There were 30000000 observations read from the data set WORK.TEST.

NOTE: DATA statement used (Total process time):

  real time 4.65 seconds

  cpu time 4.55 seconds

Proc Summary ended up failed, regardless of double amount of time consumed.  If we roll back the total number of obs to the extend that Proc Summary can run:

404 data test;

405 do i=1 to 20000000;

406 output;

407 end;

408 run;

NOTE: The data set WORK.TEST has 20000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

  real time 0.76 seconds

  cpu time 0.71 seconds

409

410 proc summary data=test print median;

411 var i;

412 run;

NOTE: There were 20000000 observations read from the data set WORK.TEST.

NOTE: PROCEDURE SUMMARY used (Total process time):

  real time 6.56 seconds

  cpu time 9.51 seconds

413

414

415 data _null_;

416 call symputx('nobs',nobs);

417 stop;

418 set test nobs=nobs;

419 run;

NOTE: DATA statement used (Total process time):

  real time 0.01 seconds

  cpu time 0.01 seconds

420

421 data _null_;

422 do _n_=1 to &nobs;

423 set test;

424 array med(&nobs) _temporary_;

425 med(_n_)=i;

426 end;

427 Median=median(of med(*));

428 put Median=;

429 run;

Median=10000000.5

NOTE: There were 20000000 observations read from the data set WORK.TEST.

NOTE: DATA statement used (Total process time):

  real time 2.96 seconds

  cpu time 2.97 seconds

It is less than half of what Array can do in term of performance. Test has been done on Win7, 64bit, with 4G RAM, Intel G850, and SAS 9.3.

Haikuo

Respected Advisor
Posts: 3,777

Re: How to generate a (constant) variable that is the median of another variable ?

PROC STDIZE has fast percentile algorithm that you may find interesting.

Solution
‎04-03-2013 11:07 AM
Super User
Posts: 17,868

Re: How to generate a (constant) variable that is the median of another variable ?

I agree with LinusH, you're better off using proc summary or means in this case and merging in the results. It's also simpler for beginners Smiley Happy

data example;

input id exp;

datalines;

1 10

2 20

3 30

run;

*calculate the median;

proc means data=example noprint;

output out=median_value median(exp)=exp_median;

run;

data example2;

    set example;

    *add in the median value;

    if _n_=1 then set median_value;

    if exp<exp_median then flag=1;

    else flag=0;

run;

Occasional Contributor
Posts: 6

Re: How to generate a (constant) variable that is the median of another variable ?

Thnk you so much, it works perfectly Smiley Happy !!!


☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 724 views
  • 3 likes
  • 5 in conversation