Hello,
I have a (hopefully) very basic problem, but as I'm not an experienced user, i request some help. I have to create a dummy variable, say y, that is 1 is the observation on an underlying variable x is above the median of x and 0 otherwise. I can't genrerate the median as a variable: when i use the median(x) function in a data step to create a new variable, it actually just duplicates the variable x... Can anybody help me ?
Thank you so much in advance,
Jeoffrey
I agree with LinusH, you're better off using proc summary or means in this case and merging in the results. It's also simpler for beginners
data example;
input id exp;
datalines;
1 10
2 20
3 30
run;
*calculate the median;
proc means data=example noprint;
output out=median_value median(exp)=exp_median;
run;
data example2;
set example;
*add in the median value;
if _n_=1 then set median_value;
if exp<exp_median then flag=1;
else flag=0;
run;
Don't really understand what you want to do.
Can you describe it using sample input and output data?
Thank you for the reply ! Well my data are on my other computer (that is not connected on the internet, but the program goes approximately like this:
data project.cis;
set project.cis;
med_x=median(x);
if x >= med_x then z=1;
else if x<med_x then z=0;
run;
My hope would have been that z = 1 for values of x above the median and 0 otherwise. But the variable med_x in the example above is not constant and takes the same values as x ??? I'm a little lost...
Thank you !
Still not getting what your want.
But the median function used in the data step, will pick the median value from a list of values/variables.
In your case, when your list just contain one variable, well, the median on 1 is...1.
My guess is that you probably want to calculate the median of x between rows, then you have to use different techniques. If you need some guidance, again, provide some sample data and desired output.
Indeed that's exactly what i want: compute a median across rows (not columns!) and compare the value in each row to that overall median !
Suppose the following data:
data example;
input id exp;
datalines;
1 10
2 20
3 30
run;
The median of 'exp' is obviously 20. What i'd like to get is a new variable that takes value for observation 1 (below the median) and 1 for observations 2 and 3 (at the median and above)... Is this possible ?
Thank you so much for your help !
Jeoffrey
Still confused, do you want to put 10 in a variable named bellow, and 20 and 30 into a variable "median_above", sort of?
id exp bellow above
1 10 10 .
2 20 . 20
3 30 . 30
Oh thank you for your kind reply !
Sort of, but not exactly: i just want to create a single binary variable that takes value 1 for observations above the median (i.e. ID = 2 and 3) and 0 for values below that median (i.e. ID = 1).. Well if i could already generate the output you've just shown, i thionk i could find my way !
Thank you again,
Jeoffrey
I think this is what you are after, make sure your RAM is large enough to hold your array():
data example;
input id exp;
datalines;
1 10
2 20
3 30
4 32
5 21
;
data _null_;
call symputx('nobs',nobs);
stop;
set example nobs=nobs;
run;
data want;
do _i=1 to &nobs;
set example;
array med(&nobs) _temporary_;
med(_i)=exp;
end;
_m=median(of med(*));
do _i=1 to &nobs;
set example;
ind=ifn(exp<_m,0,1);
output;
end;
/*drop _:;*/
run;
_m is the calculated median.
Haikuo
Thank you soooo much ! Code looks nice.. I'll try it see if it works !
Best regards,
Jeoffrey
For calculating medians, you are probably best off using PROC SUMMARY.
The you need to join/merge the result back to your original data, and then do your binary assignment in that step.
said: For calculating medians, you are probably best off using PROC SUMMARY.
"best off" is a very bold statement.
Here is what I got from a test:
378 data test;
379 do i=1 to 30000000;
380 output;
381 end;
382 run;
NOTE: The data set WORK.TEST has 30000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 1.19 seconds
cpu time 1.21 seconds
383
384 proc summary data=test print median;
385 var i;
386 run;
WARNING: A shortage of memory has caused the quantile computations to terminate prematurely for
QMETHOD=OS. Consider using QMETHOD=P2.
NOTE: The affected statistics will be missing from the corresponding classification levels.
NOTE: There were 30000000 observations read from the data set WORK.TEST.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 9.48 seconds
cpu time 13.58 seconds
387
388
389 data _null_;
390 call symputx('nobs',nobs);
391 stop;
392 set test nobs=nobs;
393 run;
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
394
395 data _null_;
396 do _n_=1 to &nobs;
397 set test;
398 array med(&nobs) _temporary_;
399 med(_n_)=i;
400 end;
401 Median=median(of med(*));
402 put Median=;
403 run;
Median=15000000.5
NOTE: There were 30000000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 4.65 seconds
cpu time 4.55 seconds
Proc Summary ended up failed, regardless of double amount of time consumed. If we roll back the total number of obs to the extend that Proc Summary can run:
404 data test;
405 do i=1 to 20000000;
406 output;
407 end;
408 run;
NOTE: The data set WORK.TEST has 20000000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.76 seconds
cpu time 0.71 seconds
409
410 proc summary data=test print median;
411 var i;
412 run;
NOTE: There were 20000000 observations read from the data set WORK.TEST.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 6.56 seconds
cpu time 9.51 seconds
413
414
415 data _null_;
416 call symputx('nobs',nobs);
417 stop;
418 set test nobs=nobs;
419 run;
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
420
421 data _null_;
422 do _n_=1 to &nobs;
423 set test;
424 array med(&nobs) _temporary_;
425 med(_n_)=i;
426 end;
427 Median=median(of med(*));
428 put Median=;
429 run;
Median=10000000.5
NOTE: There were 20000000 observations read from the data set WORK.TEST.
NOTE: DATA statement used (Total process time):
real time 2.96 seconds
cpu time 2.97 seconds
It is less than half of what Array can do in term of performance. Test has been done on Win7, 64bit, with 4G RAM, Intel G850, and SAS 9.3.
Haikuo
PROC STDIZE has fast percentile algorithm that you may find interesting.
I agree with LinusH, you're better off using proc summary or means in this case and merging in the results. It's also simpler for beginners
data example;
input id exp;
datalines;
1 10
2 20
3 30
run;
*calculate the median;
proc means data=example noprint;
output out=median_value median(exp)=exp_median;
run;
data example2;
set example;
*add in the median value;
if _n_=1 then set median_value;
if exp<exp_median then flag=1;
else flag=0;
run;
Thnk you so much, it works perfectly !!!
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.