DATA Step, Macro, Functions and more

How to categorize data points that fall on a category boundary

Accepted Solution Solved
Reply
Regular Contributor
Posts: 216
Accepted Solution

How to categorize data points that fall on a category boundary

Hi

I am not sure where this question belongs, but was wondering if there are any rules concerning how to categorize numbers that fall on numerical category boundaries? I inherited code where a consultant created the below data set that categories the whole numbers that are the values for the durmo_cat variable into categories that would place a number that equal to one of the boundary markers in the category above it. In other words if durmo_cat equals exactly 2 then this value would be classified as a 3.

Are these rules just created by the developer or is there something more involved?

Paul

data s1;

    set workpl;   

    ***duration intervals***;

  if durmo_cat lt 1 then d12=1;

  if 1 le durmo_cat lt 2 then d12=2;

  if 2 le durmo_cat lt 3 then d12=3;

  if 3 le durmo_cat lt 4 then d12=4;

  if 4 le durmo_cat lt 5 then d12=5;

  if 5 le durmo_cat lt 6 then d12=6;

  if 6 le durmo_cat lt 12 then d12=7;

  if 12 le durmo_cat lt 18 then d12=8;

  if 18 le durmo_cat lt 24 then d12=9;

  if 24 le durmo_cat lt 30 then d12=10;

  if 30 le durmo_cat lt 36 then d12=11;

  if 36 le durmo_cat lt 42 then d12=12;

  if 42 le durmo_cat lt 48 then d12=13;

  if 48 le durmo_cat lt 54 then d12=14;

  if 54 le durmo_cat lt 60 then d12=15;

  if 60 le durmo_cat lt 66 then d12=16;

  if 66 le durmo_cat lt 72 then d12=17;

  if durmo_cat ge 72 then d12=18;

run;


Accepted Solutions
Solution
‎06-10-2013 06:44 AM
PROC Star
Posts: 1,558

Re: How to categorize data points that fall on a category boundary

D12 is just a bin number, these is no reason its value should match that of DURMO_CAT.

Personnaly, I would take the bins' cut-off values out of the data step code by using a format:

D12 = put(DURMO_CAT, d12bin.);

This also makes D12 a string, which is also better imho, but you can always convert it back to a numeric if you really need one.

View solution in original post


All Replies
Respected Advisor
Posts: 4,641

Re: How to categorize data points that fall on a category boundary

I don't see anything wrong with this way of creating categories. As of d12 = 7 the category number becomes unrelated to the original durmo_cat value anyway. However, IF could be replaced by ELSE IF, except for the first one. That would be a bit more efficient. - PG

PG
Occasional Contributor
Posts: 7

Re: How to categorize data points that fall on a category boundary

I wouldn't say it becomes unrelated as of d12=7. It's no longer equal to the upper bound, but it's still related in the sense that the higher durmo_cat is, the higher d12 is.

Solution
‎06-10-2013 06:44 AM
PROC Star
Posts: 1,558

Re: How to categorize data points that fall on a category boundary

D12 is just a bin number, these is no reason its value should match that of DURMO_CAT.

Personnaly, I would take the bins' cut-off values out of the data step code by using a format:

D12 = put(DURMO_CAT, d12bin.);

This also makes D12 a string, which is also better imho, but you can always convert it back to a numeric if you really need one.

Regular Contributor
Posts: 216

Re: How to categorize data points that fall on a category boundary

Thank you both. But do you know of any statistical or other rules that provide guidance for situations when you are grouping whole number numerical values in whole number numerical categories when one of the values fall exactly on the category boundary?

For example, if I have a value of 3 and categories of 1-2, 3-4, 5-6. Is there any reason why 3 would not fall in 3-4?

Paul

Super User
Posts: 5,079

Re: How to categorize data points that fall on a category boundary

Previous posters already provided the right answers.  One thing to add ... this code takes bad data and groups it into category 1.  For example, negative numbers and missing values both fall into category 1.

Building upon Chris's suggestion, perhaps it would help you to think of the bins as being labeled "A", "B", "C" ... instead of 1, 2, 3 ...

They are ordered categories, and do not represent numerical amounts.

Occasional Contributor
Posts: 7

Re: How to categorize data points that fall on a category boundary

The consultant's code does not represent some convention or best practice. If I had to guess, I'd say there's another variable in the data that's related to durmo_cat, and this is the consultant's attempt to make that relationship linear or otherwise improve it for his purpose.   

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 242 views
  • 3 likes
  • 5 in conversation