Help - Using Proc Rank With Two Datasets

josh_wander · Posted 09-18-2012 11:41 AM

Hello,

I am using proc rank to divide a dataset into 20 buckets. The code I'm using is similar to:

proc rank data=dataset_a out=dataset_a_ranked groups=20 ties=low;

by &segment_var;

var &score_var;

ranks score_bucket;

run;

I would like to take the ranks (or buckets) created on dataset_a and apply them to dataset_b. Is there any way to:

1. Use proc rank to output not only the rank (or bucket) of each observation, but also the minimum and maximum values of that rank (or bucket)? For instance,

rank 0 has a minimum value of 1 and a maximum value of 10

rank 1 has a minimum value of 11 and a maximum value of 19

...

rank n has a minimum value of x and a maximum value of y

2. Use the output from proc rank and apply the same ranks from dataset_a and to a second dataset_b? If I use proc rank on dataset_a and then on dataset_b, I will end up with different ranks (or buckets).

I'm thinking there must be a simple trick to doing this. Thanks for any help you could give.

MikeZdeb · Posted 09-18-2012 12:23 PM

hi ... one approach, make a format using the"rules" from the first set of ranks ... here's an example ...

* use ONLY MALES ... find ranks for age in sashelp.class

proc rank data=sashelp.class (where=(sex='M')) out=mranks groups=3;

var age;

ranks rage;

run;

* create a format using min and max values of age in each rank for MALES;

proc sql;

create view fmt as

select "ranks" as fmtname, min(age) as start, max(age) as end, catt(rage) as label

from mranks

group by rage;

quit;

proc format cntlin=fmt;

run;

* assign ranks to ONLY FEMALES based on rank rules found for MALES;

data franks;

set sashelp.class;

where sex eq 'F';

rage = input(put(age,ranks.),1.);

run;

* take a look;

proc print data=franks noobs;

var name age rage;

run;

Name Age rage

Joyce 11 0

Jane 12 0

Louise 12 0

Alice 13 1

Barbara 13 1

Carol 14 1

Judy 14 1

Janet 15 2

Mary 15 2

in case there are values in the second data set that are not present in the first data set and cannot be assigned to a rank with the format, an "OTHER" condition can be added to the format that will assign a missing value for the rank in the second data set ...

proc sql;

create view fmt as

select "ranks" as fmtname, min(age) as start, max(age) as end, catt(rage) as label

from mranks

group by rage

outer union corr

select distinct "ranks" as fmtname, 99 as start, 99 as end, ' ' as label, "o" as hlo from mranks

quit;

josh_wander · Posted 09-18-2012 01:43 PM

Mike,

Thank you very much. Your answer is very helpful and with some quick testing, it appears to be what I need. To accommodate 20 buckets, instead of 10 buckets, I modified this line:

From:

rage = input(put(age,ranks.),1.);

To:

rage = input(put(age,ranks.),10.);

Since I am relatively weak in writing sql code, I have an additional question. I need to use a "by" variable in my proc rank. My dataset_a contains 4 different segments and I use a macro variable called &segment_var to delineate them. I then use proc rank and the "by" variable to create 20 ranks (or buckets) for each segment. Is there any way for me to modify your sql code to account for the different ranks within each segment?

Thanks, again.

PaigeMiller · Posted 09-18-2012 12:59 PM

1. Use proc rank to output not only the rank (or bucket) of each observation, but also the minimum and maximum values of that rank (or bucket)? For instance,

rank 0 has a minimum value of 1 and a maximum value of 10

rank 1 has a minimum value of 11 and a maximum value of 19

...

rank n has a minimum value of x and a maximum value of y

What happens when the item in dataset 2 has a value of 10.5? More generally, the ranks/buckets created from data set one won't come out so nice and clean as your example, it might be more along the lines of

rank 0 has a minimum value of 1 and a maximum value of 7

rank 1 has a minimum value of 14 and a maximum value of 23

Now what do you do when you get a value of 10?

I think your entire concept is flawed, as empirical buckets don't have hard and fast upper and lower limits, they have empirical limits, as I have described, and will result in problems if you try to apply this to a 2nd data set. The idea of creating buckets from continuous variables is also a flawed idea, in my opinion, and you might want to re-think what you are doing with this data. If you explain the long-term analysis goal, as well as what the data is, we might be able to propose more meaningful analysis techniques.

--
Paige Miller

MikeZdeb · Posted 09-18-2012 02:07 PM

hi ... here's another approach to accommodate PROC RANK that uses a by-groups

you can make one format per by-group in one data set, then use PUTN in a data step to apply the correct format based on the by-variable value in the second data set

* some fake data ... 2 genders, age ranges from 1 to 99, 10 segments;

data fake;

do _n_ = 1 to 10000;

gender = rantbl(999,0.5);

age = ceil(99*ranuni(999));

segment = ceil(10*ranuni(999));

output;

end;

run;

* compute ranks for males (gender = 1) in groups by segment (the by variable);

proc sort data=fake (where=(gender=1) out=males (drop=gender);

by segment;

run;

proc rank data=males out=mranks groups=20;

var age;

ranks rage;

by segment;

run;

* create 10 formats ... one for each value of segment (ranks1_, ranks2_, ... , ranks10_);

proc sql;

create view fmt as

select catt("ranks",segment,"_") as fmtname, min(age) as start, max(age) as end, catt(rage) as label

from mranks

group by segment, rage;

quit;

proc format cntlin=fmt fmtlib;

run;

* apply the 10 formats to the data for females;

data franks (drop=gender);

set fake;

where gender eq 2;

* use PUTN to vary the format that is being used based on the value of the variable segment;

rage = input(putn(age,catt('ranks',segment,'_')),2.);

run;

josh_wander · Posted 10-01-2012 01:40 PM

This works.

Thank you very much for you help, Mike.

MikeZdeb · Posted 10-01-2012 02:46 PM

hi ... you're welcome ... so was that a correct answer ???

josh_wander · Posted 10-01-2012 02:55 PM

Yes, it was. I tested it earlier this morning.

Linlin · Posted 10-01-2012 03:17 PM

Please mark you quetion as answered. Thank you!

Help - Using Proc Rank With Two Datasets

Re: Help - Using Proc Rank With Two Datasets

Re: Help - Using Proc Rank With Two Datasets

Re: Help - Using Proc Rank With Two Datasets

Re: Help - Using Proc Rank With Two Datasets

Re: Help - Using Proc Rank With Two Datasets

Re: Help - Using Proc Rank With Two Datasets

Re: Help - Using Proc Rank With Two Datasets

Re: Help - Using Proc Rank With Two Datasets

Register Today!

SAS Training: Just a Click Away