Help using Base SAS procedures

Help - Using Proc Rank With Two Datasets

Reply
Occasional Contributor
Posts: 5

Help - Using Proc Rank With Two Datasets

Hello,

I am using proc rank to divide a dataset into 20 buckets.  The code I'm using is similar to:

proc rank data=dataset_a out=dataset_a_ranked groups=20 ties=low;

     by &segment_var;

     var &score_var;

     ranks score_bucket;

run;

I would like to take the ranks (or buckets) created on dataset_a and apply them to dataset_b.  Is there any way to:

1.  Use proc rank to output not only the rank (or bucket) of each observation, but also the minimum and maximum values of that rank (or bucket)?  For instance,

          rank 0 has a minimum value of 1 and a maximum value of 10

          rank 1 has a minimum value of 11 and a maximum value of 19

          ...

          rank n has a minimum value of x and a maximum value of y

2.  Use the output from proc rank and apply the same ranks from dataset_a and to a second dataset_b?  If I use proc rank on dataset_a and then on dataset_b, I will end up with different ranks (or buckets).

I'm thinking there must be a simple trick to doing this.  Thanks for any help you could give.

Valued Guide
Posts: 765

Re: Help - Using Proc Rank With Two Datasets

hi ... one approach, make a format using the"rules" from the first set of ranks ... here's an example ...

* use ONLY MALES ... find ranks for age in sashelp.class

proc rank data=sashelp.class (where=(sex='M')) out=mranks groups=3;

var age;

ranks rage;

run;

* create a format using min and max values of age in each rank for MALES;

proc sql;

create view fmt as

select "ranks" as fmtname, min(age) as start, max(age) as end, catt(rage) as label

from mranks

group by rage;

quit;

proc format cntlin=fmt;

run;

* assign ranks to ONLY FEMALES based on rank rules found for MALES;

data franks;

set sashelp.class;

where sex eq 'F';

rage = input(put(age,ranks.),1.);

run;

* take a look;

proc print data=franks noobs;

var name age rage;

run;

Name       Age    rage

Joyce       11      0

Jane        12      0

Louise      12      0

Alice       13      1

Barbara     13      1

Carol       14      1

Judy        14      1

Janet       15      2

Mary        15      2

in case there are values in the second data set that are not present in the first data set and cannot be assigned to a rank with the format, an "OTHER" condition can be added to the format that will assign a missing value for the rank in the second data set ...

proc sql;

create view fmt as

select "ranks" as fmtname, min(age) as start, max(age) as end, catt(rage) as label

from mranks

group by rage

outer union corr

select distinct "ranks" as fmtname, 99 as start, 99 as end, ' ' as label, "o" as hlo from mranks

quit;

Occasional Contributor
Posts: 5

Re: Help - Using Proc Rank With Two Datasets

Mike,

Thank you very much.  Your answer is very helpful and with some quick testing, it appears to be what I need.  To accommodate 20 buckets, instead of 10 buckets, I modified this line:

From:

rage = input(put(age,ranks.),1.);

To:

rage = input(put(age,ranks.),10.);

Since I am relatively weak in writing sql code, I have an additional question.  I need to use a "by" variable in my proc rank.  My dataset_a contains 4 different segments and I use a macro variable called &segment_var to delineate them.  I then use proc rank and the "by" variable to create 20 ranks (or buckets) for each segment.  Is there any way for me to modify your sql code to account for the different ranks within each segment?

Thanks, again.

Trusted Advisor
Posts: 1,615

Re: Help - Using Proc Rank With Two Datasets

1.  Use proc rank to output not only the rank (or bucket) of each observation, but also the minimum and maximum values of that rank (or bucket)?  For instance,

          rank 0 has a minimum value of 1 and a maximum value of 10

          rank 1 has a minimum value of 11 and a maximum value of 19

          ...

          rank n has a minimum value of x and a maximum value of y

What happens when the item in dataset 2 has a value of 10.5? More generally, the ranks/buckets created from data set one won't come out so nice and clean as your example, it might be more along the lines of

rank 0 has a minimum value of 1 and a maximum value of 7

rank 1 has a minimum value of 14 and a maximum value of 23

Now what do you do when you get a value of 10?

I think your entire concept is flawed, as empirical buckets don't have hard and fast upper and lower limits, they have empirical limits, as I have described, and will result in problems if you try to apply this to a 2nd data set. The idea of creating buckets from continuous variables is also a flawed idea, in my opinion, and you might want to re-think what you are doing with this data. If you explain the long-term analysis goal, as well as what the data is, we might be able to propose more meaningful analysis techniques.

Valued Guide
Posts: 765

Re: Help - Using Proc Rank With Two Datasets

hi ... here's another approach to accommodate PROC RANK that uses a by-groups

you can make one format per by-group in one data set, then use PUTN in a data step to apply the correct format based on the by-variable value in the second data set

* some fake data ... 2 genders, age ranges from 1 to 99, 10 segments;

data fake;

do _n_ = 1 to 10000;

   gender = rantbl(999,0.5);

   age = ceil(99*ranuni(999));

   segment = ceil(10*ranuni(999));

   output;

end;

run;

* compute ranks for males (gender = 1) in groups by segment (the by variable);

proc sort data=fake (where=(gender=1) out=males (drop=gender);

by segment;

run;

proc rank data=males out=mranks groups=20;

var age;

ranks rage;

by segment;

run;

* create 10 formats ... one for each value of segment (ranks1_, ranks2_, ... , ranks10_);

proc sql;

create view fmt as

select catt("ranks",segment,"_") as fmtname, min(age) as start, max(age) as end, catt(rage) as label

from mranks

group by segment, rage;

quit;

proc format cntlin=fmt fmtlib;

run;

* apply the 10 formats to the data for females;

data franks (drop=gender);

set fake;

where gender eq 2;

* use PUTN to vary the format that is being used based on the value of the variable segment;

rage = input(putn(age,catt('ranks',segment,'_')),2.);

run;

Occasional Contributor
Posts: 5

Re: Help - Using Proc Rank With Two Datasets

This works.

Thank you very much for you help, Mike.

Valued Guide
Posts: 765

Re: Help - Using Proc Rank With Two Datasets

hi ... you're welcome ... so was that a correct answer ???

Occasional Contributor
Posts: 5

Re: Help - Using Proc Rank With Two Datasets

Yes, it was.  I tested it earlier this morning.

Super Contributor
Posts: 1,636

Re: Help - Using Proc Rank With Two Datasets

Please mark you quetion as answered. Thank you!

Ask a Question
Discussion stats
  • 8 replies
  • 585 views
  • 5 likes
  • 4 in conversation