Optimize the classification of data into binomial categories

Reply
Occasional Contributor
Posts: 15

Optimize the classification of data into binomial categories

Hello,

I am hoping someone might be able give me some advice on how to write some code to solve my problem.

I have a set of data that I want to reclassify into 3 categories.

The data is a list of estimated flow of traffic for different road segments.

I want to classify this data into 3 binomial categories: low traffic, med traffic, high traffic.

Then I will use these 3 dummy variables in a regression model to predict locations of known flow.

so it will be set up as follows:

knownflow= low traffic + med traffic + high traffic

What I am trying to do is figure out a way to optimize the classification of the estimated flow data (that match up with the known flow data) to maximize the predictability of the model.

I an using this regression model as a way to validate the classification of the data into the three groups.

I was hoping someone might have a suggestion how to set up a macro to run numerous regression models using different classifications of the estimated flow data

OR

A way to optimize the classification of the data first and then run a model.

Thank you for the help!

Scott

Super User
Posts: 10,508

Re: Optimize the classification of data into binomial categories

First note: three categories is not binomial. By definition binomial is 2.

What will your rule be for considering the classifications "optimized"?

My first cut at this problem would be to look at the 1st and 3rd quartiles with values below the first as low and above third as high. Proc means or summary will give you these by asking for Q1 and Q3.


Occasional Contributor
Posts: 15

Re: Optimize the classification of data into binomial categories

Sorry about the confusion.  I will set each of the variables up as a dummy variable.

I have already run a proc means as you suggested and used the 1st and 3rd quartiles.

I think the best way to determine an optimized classification is to set up a macro that will use different quartiles (systematically) to adjust the classification of the data into the different variables and then use those variables as the independent variables in a regression.

The resulting adjusted r squared values for each model will be used to determine what classification of the variables works best for matching up the categorized estimated flow data to the actual flow data.

Ask a Question
Discussion stats
  • 2 replies
  • 243 views
  • 0 likes
  • 2 in conversation