Hello,
I am using proc rank to divide a dataset into 20 buckets. The code I'm using is similar to:
proc rank data=dataset_a out=dataset_a_ranked groups=20 ties=low;
by &segment_var;
var &score_var;
ranks score_bucket;
run;
I would like to take the ranks (or buckets) created on dataset_a and apply them to dataset_b. Is there any way to:
1. Use proc rank to output not only the rank (or bucket) of each observation, but also the minimum and maximum values of that rank (or bucket)? For instance,
rank 0 has a minimum value of 1 and a maximum value of 10
rank 1 has a minimum value of 11 and a maximum value of 19
...
rank n has a minimum value of x and a maximum value of y
2. Use the output from proc rank and apply the same ranks from dataset_a and to a second dataset_b? If I use proc rank on dataset_a and then on dataset_b, I will end up with different ranks (or buckets).
I'm thinking there must be a simple trick to doing this. Thanks for any help you could give.
hi ... one approach, make a format using the"rules" from the first set of ranks ... here's an example ...
* use ONLY MALES ... find ranks for age in sashelp.class
proc rank data=sashelp.class (where=(sex='M')) out=mranks groups=3;
var age;
ranks rage;
run;
* create a format using min and max values of age in each rank for MALES;
proc sql;
create view fmt as
select "ranks" as fmtname, min(age) as start, max(age) as end, catt(rage) as label
from mranks
group by rage;
quit;
proc format cntlin=fmt;
run;
* assign ranks to ONLY FEMALES based on rank rules found for MALES;
data franks;
set sashelp.class;
where sex eq 'F';
rage = input(put(age,ranks.),1.);
run;
* take a look;
proc print data=franks noobs;
var name age rage;
run;
Name Age rage
Joyce 11 0
Jane 12 0
Louise 12 0
Alice 13 1
Barbara 13 1
Carol 14 1
Judy 14 1
Janet 15 2
Mary 15 2
in case there are values in the second data set that are not present in the first data set and cannot be assigned to a rank with the format, an "OTHER" condition can be added to the format that will assign a missing value for the rank in the second data set ...
proc sql;
create view fmt as
select "ranks" as fmtname, min(age) as start, max(age) as end, catt(rage) as label
from mranks
group by rage
outer union corr
select distinct "ranks" as fmtname, 99 as start, 99 as end, ' ' as label, "o" as hlo from mranks
quit;
Mike,
Thank you very much. Your answer is very helpful and with some quick testing, it appears to be what I need. To accommodate 20 buckets, instead of 10 buckets, I modified this line:
From:
rage = input(put(age,ranks.),1.);
To:
rage = input(put(age,ranks.),10.);
Since I am relatively weak in writing sql code, I have an additional question. I need to use a "by" variable in my proc rank. My dataset_a contains 4 different segments and I use a macro variable called &segment_var to delineate them. I then use proc rank and the "by" variable to create 20 ranks (or buckets) for each segment. Is there any way for me to modify your sql code to account for the different ranks within each segment?
Thanks, again.
1. Use proc rank to output not only the rank (or bucket) of each observation, but also the minimum and maximum values of that rank (or bucket)? For instance,
rank 0 has a minimum value of 1 and a maximum value of 10
rank 1 has a minimum value of 11 and a maximum value of 19
...
rank n has a minimum value of x and a maximum value of y
What happens when the item in dataset 2 has a value of 10.5? More generally, the ranks/buckets created from data set one won't come out so nice and clean as your example, it might be more along the lines of
rank 0 has a minimum value of 1 and a maximum value of 7
rank 1 has a minimum value of 14 and a maximum value of 23
Now what do you do when you get a value of 10?
I think your entire concept is flawed, as empirical buckets don't have hard and fast upper and lower limits, they have empirical limits, as I have described, and will result in problems if you try to apply this to a 2nd data set. The idea of creating buckets from continuous variables is also a flawed idea, in my opinion, and you might want to re-think what you are doing with this data. If you explain the long-term analysis goal, as well as what the data is, we might be able to propose more meaningful analysis techniques.
hi ... here's another approach to accommodate PROC RANK that uses a by-groups
you can make one format per by-group in one data set, then use PUTN in a data step to apply the correct format based on the by-variable value in the second data set
* some fake data ... 2 genders, age ranges from 1 to 99, 10 segments;
data fake;
do _n_ = 1 to 10000;
gender = rantbl(999,0.5);
age = ceil(99*ranuni(999));
segment = ceil(10*ranuni(999));
output;
end;
run;
* compute ranks for males (gender = 1) in groups by segment (the by variable);
proc sort data=fake (where=(gender=1) out=males (drop=gender);
by segment;
run;
proc rank data=males out=mranks groups=20;
var age;
ranks rage;
by segment;
run;
* create 10 formats ... one for each value of segment (ranks1_, ranks2_, ... , ranks10_);
proc sql;
create view fmt as
select catt("ranks",segment,"_") as fmtname, min(age) as start, max(age) as end, catt(rage) as label
from mranks
group by segment, rage;
quit;
proc format cntlin=fmt fmtlib;
run;
* apply the 10 formats to the data for females;
data franks (drop=gender);
set fake;
where gender eq 2;
* use PUTN to vary the format that is being used based on the value of the variable segment;
rage = input(putn(age,catt('ranks',segment,'_')),2.);
run;
This works.
Thank you very much for you help, Mike.
hi ... you're welcome ... so was that a correct answer ???
Yes, it was. I tested it earlier this morning.
Please mark you quetion as answered. Thank you!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.