turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- Continuous variable into categories dynamically?

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-11-2014 03:45 PM

I have continuous data in a variable that looks like the attached text file. Sorry, I cannot paste anything into this forum box. If anyone knows why, please tell me! Too much to do in the html editor

I am dividing it into 20 groups, based on the percentiles. I use proc univariate to get the percentiles:

proc univariate data=d1 ;

var X;

output out=getpercentiles pctlpre=P_ pctlpts=5 to 100 by 5;

run;

I want the resulting percentiles to be the top of the categories. I then write if/then statements to categorize my original data:

if X ge 0 and LTCumExpT2 le 0.019 then bin=1;*note, lowest category starts at 0;

else if X gt 0.019 and X le 0.038 then bin=2;

else if X gt 0.038 and X le 0.078 then bin=3;

else if X gt 0.078 and X le 0.095 then bin=4;

else if X gt 0.095 and X le 0.1265 then bin=5;

else if X gt 0.1265 and X le 0.166 then bin=6;

else if X gt 0.166 and X le 0.254 then bin=7;

....

else if X gt 8.088 and X le 34.435 then bin=20;

Is there a way to get a new variable "bin" with a number from 1 to 20 in it more dynamically? Rather than having to create all those if then statements? I have searched the forums, and most people say to use a format, but that can't be done here, with the gt/le operators, as far as I can tell (trust me, I tried :smileygrin.

If it was just the one variable I had to do this for, I wouldn't mind. But I have to do it for about 20 of them, and was hoping to make it more painless.

Thank you,

Nicole

Accepted Solutions

Solution

09-11-2014
05:14 PM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-11-2014 05:14 PM

You have all the pieces you need. There is no need to type, just put the pieces together.

You may need to tweak this a little bit, but here is the idea (using the names for data sets and variables that appear in your example):

data want;

if _n_=1 then set getpercentiles;

set d1;

if x > P_20 then bin=20;

else if x > P_19 then bin=19;

else if x > P_18 then bin=18;

...

else if x > P_2 then bin=2;

else bin=1;

drop P_:;

run;

You may need to reverse the order, or change the comparisons slightly. But all the pieces are there and you only have to assemble them into the same data set.

If you have to do this for 20 variables, you may want to use different prefixes in the output data sets for each. But first, figure out how to do it for one variable without typing in the cutoffs.

This is not a comment about which solution is best, only an answer to your original question.

Good luck.

All Replies

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-11-2014 04:25 PM

I think PROC RANK with the GROUPS= option can do what you want.

As a side issue, I question the value of performing a statistical analysis on percentiles of continuous data, when you could perform an equivalent analysis on the raw continuous data.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-11-2014 05:04 PM

You can paste data with a browser other than IE.

You can do this with a format, but I agree with Paige, Proc Rank is better option.

For format creation see this paper:

Solution

09-11-2014
05:14 PM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-11-2014 05:14 PM

You have all the pieces you need. There is no need to type, just put the pieces together.

You may need to tweak this a little bit, but here is the idea (using the names for data sets and variables that appear in your example):

data want;

if _n_=1 then set getpercentiles;

set d1;

if x > P_20 then bin=20;

else if x > P_19 then bin=19;

else if x > P_18 then bin=18;

...

else if x > P_2 then bin=2;

else bin=1;

drop P_:;

run;

You may need to reverse the order, or change the comparisons slightly. But all the pieces are there and you only have to assemble them into the same data set.

If you have to do this for 20 variables, you may want to use different prefixes in the output data sets for each. But first, figure out how to do it for one variable without typing in the cutoffs.

This is not a comment about which solution is best, only an answer to your original question.

Good luck.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Astounding

09-12-2014 09:46 AM

Astounding, thank you, that was the solution I needed. Now I can even make a macro to handle it easily. I have only recently become acquainted with the if _n_=1 kind of statement, but I see much future use for it.

Paige, I had tried proc rank with the group option. However, it seemed that proc rank wanted to force a certain number of observations into each group, rather than assign the group (bin) based on the value of my continuous variable. And I agree that this seems like a somewhat backwards/extraneous was of analyzing a variable, however, we are working with a group that does not "believe" our models, and we have to break the math down into it's component steps.

Reeza, I had tried proc format, but with groupings like > x- <=y I got errors in proc format. It did not seem to like the > in front of the range. Maybe I could have changed my logic to make it work.

Thank you all.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-12-2014 12:35 PM

I sympathize with your problems regarding the categorization of continuous variables. For my own peace of mind, I would analyze both ways--a great way to make a problem (=adverse event rate) "disappear" is to categorize the continuous covariates or continuous response variables.

Steve Denham