Solved: using one-hot encoding for multinary class data to create two-class cl...

rbettinger · Posted 06-25-2025 03:33 PM

I want to convert a dataset containing a class variable with > 2 levels into a dataset containing a class variable with only 2 levels. For example,

data two_class ;
set SASHELP.iris ;

class1 = upcase( species ) = 'IRIS SETOSA' ;
class2 = upcase( species ) = 'IRIS VERSICOLOR' ;
class3 = upcase( species ) = 'IRIS VIRGINICA' ;
run ;

proc logistic data=two_class ;
model class1( event='1' ) = SepalWidth SepalLength PetalWidth PetalLength ;
run ;

/* same code but different model statement */
model class2( event='1' ) = < same 4 variables > ;

/* and for class3 */
mode class3( event='1' ) = < same 4 variables > ;

Is this good practice? Is there a better way of finding the one-vs-the-rest power of a single species?

Thanks for your suggestions,

Ross

StatDave · Posted 06-26-2025 02:18 PM

As with AUC, a single overall measure of accuracy can be computed rather obviously as the trace of the full classification matrix - that is, the sum of the main diagonal cell counts as a proportion of the total. For the Crops example, and using the same method illustrated in this note that I referred to:

data acc; 
   set cellcounts;
   acc = (_into_ = crop); 
   run;
proc freq;
   weight count;
   tables acc / binomial(level="1");
   exact binomial;
   run;

View solution in original post

eduardo_silva · Posted 06-25-2025 05:34 PM

It really depends on what you plan to do with the data.

You could group the categories based on your subject matter expertise, for example.

You might choose the most frequent category as one group and combine all the others into a second group.

You can look for similarities between each category and other variables, which seems to be what you're trying to do.

rbettinger · Posted 06-25-2025 06:43 PM

Thank you for a prompt reply, Eduardo. I am trying to compare the performance of the indicated class (Class1, Class2, Class3) with all of the other classes that are "noise" compared to the "signal" that the indicated class represents. There are > 2 classes and I want to represent one of them as the target value , e.g., (Class1 = ( species = 1 )) and the data for the other two species is then grouped into "noise" so that the logistic regression algorithm will find the patterns in the "signal" and extract information from the "noise" of the other two classes.

StatDave · Posted 06-25-2025 07:08 PM

You'll need to clarify your goal. What do you mean by "finding the one-vs-the-rest power of a single species" expressed in terms of a statistical model or test? If you want to test IRIS vs the others with respect to the multivariate means across the 4 variables, then an example of exactly that is shown in the example titled "Multivariate Analysis of Variance" in the PROC GLM documentation. Note that this reverses the model by making SPECIES the predictor of the 4 variables.

Note that you could write a single model with SPECIES as the response variable to deal with all of the class levels simultaneously (model species=SepalWidth SepalLength PetalWidth PetalLength / link=glogit;) rather than create binary response variables and do separate analyses. And you could get predicted probabilities from this single model for each observation being in each of the species (output out=preds predprobs=individual;), but what exactly would you want to test with this?

rbettinger · Posted 06-25-2025 07:49 PM

Thank you, StatDave for asking me to refine my question.

My goal is to compare the performance of a classifier algorithm to classification by logistic regression in the multinary case. When I have a class variable that has n categories, the classification matrix will have n rows and columns. When n = 2, we have the usual classification matrix, and we have the usual TP, FP, FN, TN frequency counts and corresponding statistics like accuracy, precision, recall, and specificity. But when n > 2, life becomes more interesting. I don't know of any corresponding statistics for n > 2 as for when n = 2, so if I compute classification matrices for, e.g., class1 vs aggregated (class2 and class3), I have reduced the multinary problem to n = 2 and can compute the usual stats. I can do the same thing for class 2 vs aggregated (class1 and class3), etc. In this context, "aggregation" means that frequency counts for class1 are compared to frequency counts for class2 and class3 grouped into a single category, so we have the expression "class1 vs the rest" to describe the n=2 classification matrix created.

If I have muddled the statistical waters with the word "power", I apologize. There is no power of test involved here. I just want to simplify the problem to n=2 using logistic regression to summarize the relationship between the designated class and any other classes.

StatDave · Posted 06-26-2025 01:41 PM

Okay, but rather than what you've suggested, consider fitting the GLOGIT model to your nominal, multinomial response and producing the two-way table of actual by predicted values. You can then simply collapse rows and columns of the table as needed for each level to produce the usual statistics you can get with a 2x2 table.

See the example using the Crops data in this note which fits the GLOGIT model and obtains the actual by predicted table. The statements below repeat that and then computes, for the Clover level, the sensitivity (recall), specificity, positive predictive value (precision), and negative predictive value using the SENSPEC option in PROC FREQ. You can do similarly for each of the other Crop levels. Using this approach, you could compute other statistics, like accuracy, as shown in this note.

The _INTO_ variable in the PREDS data set contains the predicted Crop values from the fitted model. The first FREQ step shows the full predicted by actual table. PROC FORMAT creates a format that groups all of the levels that are not Clover together so that the second FREQ step with the FORMAT statement collapses the table into a 2x2 table for Clover vs not Clover and produces the statistics.

proc logistic data=Crops;
  model Crop=x1-x4 / link=glogit;
  output out=preds predprobs=individual;
  run;
proc freq data=preds;
  table _INTO_*crop / out=CellCounts;
  run;
proc format; 
  value $notcl 'Clover'='Clover' other='NotClover';  
  run;
proc freq data=preds;
  format _INTO_ crop $notcl.;
  table _INTO_*crop / senspec;
  run;

ballardw · Posted 06-25-2025 07:44 PM

One way to create groups for analysis is a custom format. There are advantages to using a format if the "class" or "group" is based on a single variable. The first is that you do not have to create any other variables so the concerns about which data set with which recoded variable is not an issue.

It is a bit hard to tell if this suggestion is appropriate because the code you show throws errors, at least in my version of SAS:

29   proc logistic data=two_class ;
30   model class1( event='1' ) = SepalWidth SepalLength PetalWidth
30 ! PetalLength ;
31   run ;

ERROR: All observations have the same response.  No statistics are
       computed.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 150 observations read from the data set WORK.TWO_CLASS.
NOTE: PROCEDURE LOGISTIC used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

If you are trying to model combinations of the characteristics to predict species I'm moderately sure that Proc Logistic isn't the right one.

A format approach might look like:

proc format ;
value  $setosa
'Setosa' = 'Iris Setosa'
other = 'Others'
;

proc logistic data=sashelp.iris;
   model species = SepalWidth SepalLength PetalWidth PetalLength ;  
   format species $setosa.;
run;

Which doesn't throw an error but does have separation of data issues.

A drawback to formats is creating the formats and making sure they are available in the current session (run the Proc Format code or create format catalogs in a permanent library and in the format search path).

Other advantages of formats include:

1) Time. Especially if you have very large data sets, it can take time (and storage space) to add the additional variables to data set.

2) Ease of changing ranges of definitions, especially for numeric values. (Character values and "range" using the < in the proc format are very problematic for general character values but may work some fixed length single case values). Note in the example the keyword OTHER which assigns all values not explicitly listed to a single response.

3) For a fair number of examples the code for Proc Format may be simpler than data step code.

4) Specially structured data sets can be used to create formats.

There are the limits of only a single variable can be involved. If missing values are to be excluded and not treated as the "Other" category an appropriate value clause such as . = 'Missing' or ' '='Missing' may be needed.

Relying on the stored catalogs for use and not maintaining the code to create the formats can be problematic for moving to different versions of SAS. Best is to use the Proc Format CNTLOUT= option to create data sets that can be used to recreate the formats and keep the location handy.

An example of the data set to create three formats, one for each of the species:

proc sql;
   create table temp as
   select distinct(species)
   from sashelp.iris
   ;
quit;

data iriscntrl;
   set temp;
   by species notsorted;
   fmtname= strip(species);    /* check documenation for rules of format names. 
                                  making these may be the hardest part of automated process
                               */
   type='C';
   start =species;
   label =catx(' ','Iris',species);
   output;
   if last.species then do;
      call missing (start);
      label='Other';
      hlo='O'; /* variable to additional informat of how format used. Capital o is for the
                  Other instruction*/
      output;
   end;
run;

proc format cntlin=iriscntrl;
run;

More complicated code is possible. In some cases you would want to sort the control set by the format name to make sure all the start values are together in the set. Otherwise the Proc format results may be odd or fail.

rbettinger · Posted 06-26-2025 01:39 PM

Here, in three tables, is what I am trying to do:

These tables were created by my classifier. I want to compare the classifier performance to the logistic regression performance, so I thought that by computing LR results for "1 species vs the other two", I can form a meaningful comparison. Am I explaining myself clearly?

StatDave · Posted 06-26-2025 01:48 PM

In addition to getting the statistics as I just showed, if you also want the AUC, there is an analog of this for the multinomial case that you can obtain using the MultAUC macro that is available in this note.

StatDave · Posted 06-26-2025 02:18 PM

As with AUC, a single overall measure of accuracy can be computed rather obviously as the trace of the full classification matrix - that is, the sum of the main diagonal cell counts as a proportion of the total. For the Crops example, and using the same method illustrated in this note that I referred to:

data acc; 
   set cellcounts;
   acc = (_into_ = crop); 
   run;
proc freq;
   weight count;
   tables acc / binomial(level="1");
   exact binomial;
   run;

rbettinger · Posted 06-29-2025 12:44 PM

Thank you, StatDave, for putting so much time and effort into producing a complete answer to my question. Ross

using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Re: using one-hot encoding for multinary class data to create two-class classification problem

Registration is open