BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JMMK007
Calcite | Level 5

Hello,

I have a (hopefully) very basic problem, but as I'm not an experienced user, i request some help. I have to create a dummy variable, say y, that is 1 is the observation on an underlying variable x is above the median of x and 0 otherwise. I can't genrerate the median as a variable: when i use the median(x) function in a data step to create a new variable, it actually just duplicates the variable x... Can anybody help me ?

Thank you so much in advance,

Jeoffrey

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User

I agree with LinusH, you're better off using proc summary or means in this case and merging in the results. It's also simpler for beginners Smiley Happy

data example;

input id exp;

datalines;

1 10

2 20

3 30

run;

*calculate the median;

proc means data=example noprint;

output out=median_value median(exp)=exp_median;

run;

data example2;

    set example;

    *add in the median value;

    if _n_=1 then set median_value;

    if exp<exp_median then flag=1;

    else flag=0;

run;

View solution in original post

13 REPLIES 13
LinusH
Tourmaline | Level 20

Don't really understand what you want to do.

Can you describe it using sample input and output data?

Data never sleeps
JMMK007
Calcite | Level 5

Thank you for the reply ! Well my data are on my other computer (that is not connected on the internet, but the program goes approximately like this:

data project.cis;

set project.cis;

med_x=median(x);

if x >= med_x then z=1;

else if x<med_x then z=0;

run;

My hope would have been that z = 1  for values of x above the median and 0 otherwise. But the variable med_x in the example above is not constant and takes the same values as x ??? I'm a little lost...

Thank you !

LinusH
Tourmaline | Level 20

Still not getting what your want.

But the median function used in the data step, will pick the median value from a list of values/variables.

In your case, when your list just contain one variable, well, the median on 1 is...1.

My guess is that you probably want to calculate the median of x between rows, then you have to use different techniques. If you need some guidance, again, provide some sample data and desired output.

Data never sleeps
JMMK007
Calcite | Level 5

Indeed that's exactly what i want: compute a median across rows (not columns!) and compare the value in each row to that overall median Smiley Happy !

Suppose the following data:

data example;
input id exp;

datalines;

1 10

2 20

3 30

run;

The median of 'exp' is obviously 20. What i'd like to get is a new variable that takes value for observation 1 (below the median) and 1 for observations 2 and 3 (at the median and above)... Is this possible ?

Thank you so much for your help !

Jeoffrey

LinusH
Tourmaline | Level 20

Still confused, do you want to put 10 in a variable named bellow, and 20 and 30 into a variable "median_above", sort of?

id     exp   bellow above

1     10     10       .

2     20     .          20

3     30     .          30

Data never sleeps
JMMK007
Calcite | Level 5

Oh thank you for your kind reply !

Sort of, but not exactly: i just want to create a single binary variable that takes value 1 for observations above the median (i.e. ID = 2 and 3) and 0 for values below that median (i.e. ID = 1).. Well if i could already generate the output you've just shown, i thionk i could find my way Smiley Happy !

Thank you again,

Jeoffrey

Haikuo
Onyx | Level 15

I think this is what you are after, make sure your RAM is large enough to hold your array():

data example;

input id exp;

datalines;

1 10

2 20

3 30

4 32

5 21

;

data _null_;

call symputx('nobs',nobs);

stop;

  set example nobs=nobs;

run;

data want;

  do _i=1 to &nobs;

  set example;

array med(&nobs) _temporary_;

med(_i)=exp;

  end;

  _m=median(of med(*));

  do _i=1 to &nobs;

  set example;

  ind=ifn(exp<_m,0,1);

output;

  end;

  /*drop _:;*/

run;

_m is the calculated median.

Haikuo

JMMK007
Calcite | Level 5


Thank you soooo much ! Code looks nice.. I'll try it see if it works Smiley Happy !

Best regards,

Jeoffrey

LinusH
Tourmaline | Level 20

For calculating medians, you are probably best off using PROC SUMMARY.

The you need to join/merge the result back to your original data, and then do your binary assignment in that step.

Data never sleeps
Haikuo
Onyx | Level 15

said: For calculating medians, you are probably best off using PROC SUMMARY.

"best off" is a very bold statement.

Here is what I got from a test:

378 data test;

379 do i=1 to 30000000;

380 output;

381 end;

382 run;

NOTE: The data set WORK.TEST has 30000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

  real time 1.19 seconds

  cpu time 1.21 seconds

383

384 proc summary data=test print median;

385 var i;

386 run;

WARNING: A shortage of memory has caused the quantile computations to terminate prematurely for

  QMETHOD=OS. Consider using QMETHOD=P2.

NOTE: The affected statistics will be missing from the corresponding classification levels.

NOTE: There were 30000000 observations read from the data set WORK.TEST.

NOTE: PROCEDURE SUMMARY used (Total process time):

  real time 9.48 seconds

  cpu time 13.58 seconds

387

388

389 data _null_;

390 call symputx('nobs',nobs);

391 stop;

392 set test nobs=nobs;

393 run;

NOTE: DATA statement used (Total process time):

  real time 0.00 seconds

  cpu time 0.00 seconds

394

395 data _null_;

396 do _n_=1 to &nobs;

397 set test;

398 array med(&nobs) _temporary_;

399 med(_n_)=i;

400 end;

401 Median=median(of med(*));

402 put Median=;

403 run;

Median=15000000.5

NOTE: There were 30000000 observations read from the data set WORK.TEST.

NOTE: DATA statement used (Total process time):

  real time 4.65 seconds

  cpu time 4.55 seconds

Proc Summary ended up failed, regardless of double amount of time consumed.  If we roll back the total number of obs to the extend that Proc Summary can run:

404 data test;

405 do i=1 to 20000000;

406 output;

407 end;

408 run;

NOTE: The data set WORK.TEST has 20000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

  real time 0.76 seconds

  cpu time 0.71 seconds

409

410 proc summary data=test print median;

411 var i;

412 run;

NOTE: There were 20000000 observations read from the data set WORK.TEST.

NOTE: PROCEDURE SUMMARY used (Total process time):

  real time 6.56 seconds

  cpu time 9.51 seconds

413

414

415 data _null_;

416 call symputx('nobs',nobs);

417 stop;

418 set test nobs=nobs;

419 run;

NOTE: DATA statement used (Total process time):

  real time 0.01 seconds

  cpu time 0.01 seconds

420

421 data _null_;

422 do _n_=1 to &nobs;

423 set test;

424 array med(&nobs) _temporary_;

425 med(_n_)=i;

426 end;

427 Median=median(of med(*));

428 put Median=;

429 run;

Median=10000000.5

NOTE: There were 20000000 observations read from the data set WORK.TEST.

NOTE: DATA statement used (Total process time):

  real time 2.96 seconds

  cpu time 2.97 seconds

It is less than half of what Array can do in term of performance. Test has been done on Win7, 64bit, with 4G RAM, Intel G850, and SAS 9.3.

Haikuo

data_null__
Jade | Level 19

PROC STDIZE has fast percentile algorithm that you may find interesting.

Reeza
Super User

I agree with LinusH, you're better off using proc summary or means in this case and merging in the results. It's also simpler for beginners Smiley Happy

data example;

input id exp;

datalines;

1 10

2 20

3 30

run;

*calculate the median;

proc means data=example noprint;

output out=median_value median(exp)=exp_median;

run;

data example2;

    set example;

    *add in the median value;

    if _n_=1 then set median_value;

    if exp<exp_median then flag=1;

    else flag=0;

run;

JMMK007
Calcite | Level 5

Thnk you so much, it works perfectly Smiley Happy !!!


sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 4073 views
  • 3 likes
  • 5 in conversation