turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- Base SAS Programming
- /
- How to generate a (constant) variable that is the ...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 08:35 AM

Hello,

I have a (hopefully) very basic problem, but as I'm not an experienced user, i request some help. I have to create a dummy variable, say y, that is 1 is the observation on an underlying variable x is above the median of x and 0 otherwise. I can't genrerate the median as a variable: when i use the median(x) function in a data step to create a new variable, it actually just duplicates the variable x... Can anybody help me ?

Thank you so much in advance,

Jeoffrey

Accepted Solutions

Solution

04-03-2013
11:07 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 11:07 AM

I agree with LinusH, you're better off using proc summary or means in this case and merging in the results. It's also simpler for beginners

data example;

input id exp;

datalines;

1 10

2 20

3 30

run;

*calculate the median;

proc means data=example noprint;

output out=median_value median(exp)=exp_median;

run;

data example2;

set example;

*add in the median value;

if _n_=1 then set median_value;

if exp<exp_median then flag=1;

else flag=0;

run;

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 08:54 AM

Don't really understand what you want to do.

Can you describe it using sample input and output data?

Data never sleeps

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 09:02 AM

Thank you for the reply ! Well my data are on my other computer (that is not connected on the internet, but the program goes approximately like this:

data project.cis;

set project.cis;

med_x=median(x);

if x >= med_x then z=1;

else if x<med_x then z=0;

run;

My hope would have been that z = 1 for values of x above the median and 0 otherwise. But the variable med_x in the example above is not constant and takes the same values as x ??? I'm a little lost...

Thank you !

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 09:20 AM

Still not getting what your want.

But the median function used in the data step, will pick the median value from a list of values/variables.

In your case, when your list just contain one variable, well, the median on 1 is...1.

My guess is that you probably want to calculate the median of x between rows, then you have to use different techniques. If you need some guidance, again, provide some sample data and desired output.

Data never sleeps

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 09:33 AM

Indeed that's exactly what i want: compute a median across rows (not columns!) and compare the value in each row to that overall median !

Suppose the following data:

data example;

input id exp;

datalines;

1 10

2 20

3 30

run;

The median of 'exp' is obviously 20. What i'd like to get is a new variable that takes value for observation 1 (below the median) and 1 for observations 2 and 3 (at the median and above)... Is this possible ?

Thank you so much for your help !

Jeoffrey

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 09:57 AM

Still confused, do you want to put 10 in a variable named bellow, and 20 and 30 into a variable "median_above", sort of?

id exp bellow above

1 10 10 .

2 20 . 20

3 30 . 30

Data never sleeps

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 10:02 AM

Oh thank you for your kind reply !

Sort of, but not exactly: i just want to create a single binary variable that takes value 1 for observations above the median (i.e. ID = 2 and 3) and 0 for values below that median (i.e. ID = 1).. Well if i could already generate the output you've just shown, i thionk i could find my way !

Thank you again,

Jeoffrey

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 10:06 AM

I think this is what you are after, make sure your RAM is large enough to hold your array():

data example;

input id exp;

datalines;

1 10

2 20

3 30

4 32

5 21

;

data _null_;

call symputx('nobs',nobs);

stop;

set example nobs=nobs;

run;

data want;

do _i=1 to &nobs;

set example;

array med(&nobs) _temporary_;

med(_i)=exp;

end;

_m=median(of med(*));

do _i=1 to &nobs;

set example;

ind=ifn(exp<_m,0,1);

output;

end;

/*drop _:;*/

run;

_m is the calculated median.

Haikuo

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 10:09 AM

Thank you soooo much ! Code looks nice.. I'll try it see if it works !

Best regards,

Jeoffrey

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 10:14 AM

For calculating medians, you are probably best off using PROC SUMMARY.

The you need to join/merge the result back to your original data, and then do your binary assignment in that step.

Data never sleeps

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 10:52 AM

said: For calculating medians, you are probably best off using PROC SUMMARY.

"best off" is a very bold statement.

Here is what I got from a test:

378 data test;

379 do i=1 to 30000000;

380 output;

381 end;

382 run;

NOTE: The data set WORK.TEST has 30000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

real time 1.19 seconds

cpu time 1.21 seconds

383

384 proc summary data=test print median;

385 var i;

386 run;

WARNING: A shortage of memory has caused the quantile computations to terminate prematurely for

QMETHOD=OS. Consider using QMETHOD=P2.

NOTE: The affected statistics will be missing from the corresponding classification levels.

NOTE: There were 30000000 observations read from the data set WORK.TEST.

NOTE: PROCEDURE SUMMARY used (Total process time):

real time 9.48 seconds

cpu time 13.58 seconds

387

388

389 data _null_;

390 call symputx('nobs',nobs);

391 stop;

392 set test nobs=nobs;

393 run;

NOTE: DATA statement used (Total process time):

real time 0.00 seconds

cpu time 0.00 seconds

394

395 data _null_;

396 do _n_=1 to &nobs;

397 set test;

398 array med(&nobs) _temporary_;

399 med(_n_)=i;

400 end;

401 Median=median(of med(*));

402 put Median=;

403 run;

Median=15000000.5

NOTE: There were 30000000 observations read from the data set WORK.TEST.

NOTE: DATA statement used (Total process time):

real time 4.65 seconds

cpu time 4.55 seconds

Proc Summary ended up failed, regardless of double amount of time consumed. If we roll back the total number of obs to the extend that Proc Summary can run:

404 data test;

405 do i=1 to 20000000;

406 output;

407 end;

408 run;

NOTE: The data set WORK.TEST has 20000000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

real time 0.76 seconds

cpu time 0.71 seconds

409

410 proc summary data=test print median;

411 var i;

412 run;

NOTE: There were 20000000 observations read from the data set WORK.TEST.

NOTE: PROCEDURE SUMMARY used (Total process time):

real time 6.56 seconds

cpu time 9.51 seconds

413

414

415 data _null_;

416 call symputx('nobs',nobs);

417 stop;

418 set test nobs=nobs;

419 run;

NOTE: DATA statement used (Total process time):

real time 0.01 seconds

cpu time 0.01 seconds

420

421 data _null_;

422 do _n_=1 to &nobs;

423 set test;

424 array med(&nobs) _temporary_;

425 med(_n_)=i;

426 end;

427 Median=median(of med(*));

428 put Median=;

429 run;

Median=10000000.5

NOTE: There were 20000000 observations read from the data set WORK.TEST.

NOTE: DATA statement used (Total process time):

real time 2.96 seconds

cpu time 2.97 seconds

It is less than half of what Array can do in term of performance. Test has been done on Win7, 64bit, with 4G RAM, Intel G850, and SAS 9.3.

Haikuo

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 11:12 AM

PROC STDIZE has fast percentile algorithm that you may find interesting.

Solution

04-03-2013
11:07 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-03-2013 11:07 AM

I agree with LinusH, you're better off using proc summary or means in this case and merging in the results. It's also simpler for beginners

data example;

input id exp;

datalines;

1 10

2 20

3 30

run;

*calculate the median;

proc means data=example noprint;

output out=median_value median(exp)=exp_median;

run;

data example2;

set example;

*add in the median value;

if _n_=1 then set median_value;

if exp<exp_median then flag=1;

else flag=0;

run;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2013 08:45 AM

Thnk you so much, it works perfectly !!!