The short answer is you can get essentially the same results but you need to structure your data somewhat differently. Instead of having only one row of data for each observation in your original data, you will have two rows of data -- one row containing the number of events and one rpw containing the number of non-events. I'll explain this in more detail below:
The scenario you are describing involves using the events/trials syntax of the LOGISTIC (or GENMOD) procedure. Consider the first example data set in the documentation for the GENMOD procedure which includes a character input variable (drug), an interval input variable (x), the number of events (r), and the number of trials (n) for each observation.
/*** BEGIN GENMOD DOCUMENTATION EXCERPT ***/
The following DATA step creates the data set:
data drug; input drug$ x r n @@; datalines; A .1 1 10 A .23 2 12 A .67 1 9 B .2 3 13 B .3 4 15 B .45 5 16 B .78 5 13 C .04 0 10 C .15 0 11 C .56 1 12 C .7 2 12 D .34 5 10 D .6 5 9 D .7 8 10 E .2 12 20 E .34 15 20 E .56 13 15 E .8 17 20 ;
/*** END GENMOD DOCUMENTATION EXCERPT ***/
which generates data that looks like the following:
drug | x | r | n |
A | 0.1 | 1 | 10 |
A | 0.23 | 2 | 12 |
A | 0.67 | 1 | 9 |
B | 0.2 | 3 | 13 |
B | 0.3 | 4 | 15 |
B | 0.45 | 5 | 16 |
B | 0.78 | 5 | 13 |
C | 0.04 | 0 | 10 |
C | 0.15 | 0 | 11 |
C | 0.56 | 1 | 12 |
C | 0.7 | 2 | 12 |
D | 0.34 | 5 | 10 |
D | 0.6 | 5 | 9 |
D | 0.7 | 8 | 10 |
E | 0.2 | 12 | 20 |
E | 0.34 | 15 | 20 |
E | 0.56 | 13 | 15 |
E | 0.8 | 17 | 20 |
You could use the events/trials syntax with the LOGISTIC procedure to analyze this data set as follows:
proc logistic data=drug; class drug; model r/n = x drug; run;
To analyze this same data set in SAS Enterprise Miner, you would need to take the following steps:
1. Create a new data set containing two rows for each row in the original data set (e.g. append the data set to itself).
2. Replace the value of r (# of events) in each duplicate row with the number of nonevents (n - r)
3. Rename the column r to freq (or just specify it as a frequency variable in SAS Enterprise Miner)
4. Add another column named target which contains a 1 if the row contains the number of events and a 0 if the row contains the number of nonevents.
5. Drop the column n containing the total number of events (it is not used directly in the analysis)
6. Add the modified data set as an Input Data Source in SAS Enterprise Miner being sure to specify the following variable information:
drug = nominal input variable role
x = interval input variable role
freq = frequency variable role
target = target variable role (1 if an event, 0 if a non-event)
The data used in SAS Enterprise Miner should appear as follows:
drug | x | freq | target |
A | 0.1 | 1 | 1 |
A | 0.23 | 2 | 1 |
A | 0.67 | 1 | 1 |
B | 0.2 | 3 | 1 |
B | 0.3 | 4 | 1 |
B | 0.45 | 5 | 1 |
B | 0.78 | 5 | 1 |
C | 0.04 | 0 | 1 |
C | 0.15 | 0 | 1 |
C | 0.56 | 1 | 1 |
C | 0.7 | 2 | 1 |
D | 0.34 | 5 | 1 |
D | 0.6 | 5 | 1 |
D | 0.7 | 8 | 1 |
E | 0.2 | 12 | 1 |
E | 0.34 | 15 | 1 |
E | 0.56 | 13 | 1 |
E | 0.8 | 17 | 1 |
A | 0.1 | 9 | 0 |
A | 0.23 | 10 | 0 |
A | 0.67 | 8 | 0 |
B | 0.2 | 10 | 0 |
B | 0.3 | 11 | 0 |
B | 0.45 | 11 | 0 |
B | 0.78 | 8 | 0 |
C | 0.04 | 10 | 0 |
C | 0.15 | 11 | 0 |
C | 0.56 | 11 | 0 |
C | 0.7 | 10 | 0 |
D | 0.34 | 5 | 0 |
D | 0.6 | 4 | 0 |
D | 0.7 | 2 | 0 |
E | 0.2 | 8 | 0 |
E | 0.34 | 5 | 0 |
E | 0.56 | 2 | 0 |
E | 0.8 | 3 | 0 |
If you connect the newly created data source described above to a Regression node and run the flow, you will get approximately (within rounding error) the same results that you would have obtained using the LOGISTIC procedure.
I hope this helps!
Doug
The short answer is you can get essentially the same results but you need to structure your data somewhat differently. Instead of having only one row of data for each observation in your original data, you will have two rows of data -- one row containing the number of events and one rpw containing the number of non-events. I'll explain this in more detail below:
The scenario you are describing involves using the events/trials syntax of the LOGISTIC (or GENMOD) procedure. Consider the first example data set in the documentation for the GENMOD procedure which includes a character input variable (drug), an interval input variable (x), the number of events (r), and the number of trials (n) for each observation.
/*** BEGIN GENMOD DOCUMENTATION EXCERPT ***/
The following DATA step creates the data set:
data drug; input drug$ x r n @@; datalines; A .1 1 10 A .23 2 12 A .67 1 9 B .2 3 13 B .3 4 15 B .45 5 16 B .78 5 13 C .04 0 10 C .15 0 11 C .56 1 12 C .7 2 12 D .34 5 10 D .6 5 9 D .7 8 10 E .2 12 20 E .34 15 20 E .56 13 15 E .8 17 20 ;
/*** END GENMOD DOCUMENTATION EXCERPT ***/
which generates data that looks like the following:
drug | x | r | n |
A | 0.1 | 1 | 10 |
A | 0.23 | 2 | 12 |
A | 0.67 | 1 | 9 |
B | 0.2 | 3 | 13 |
B | 0.3 | 4 | 15 |
B | 0.45 | 5 | 16 |
B | 0.78 | 5 | 13 |
C | 0.04 | 0 | 10 |
C | 0.15 | 0 | 11 |
C | 0.56 | 1 | 12 |
C | 0.7 | 2 | 12 |
D | 0.34 | 5 | 10 |
D | 0.6 | 5 | 9 |
D | 0.7 | 8 | 10 |
E | 0.2 | 12 | 20 |
E | 0.34 | 15 | 20 |
E | 0.56 | 13 | 15 |
E | 0.8 | 17 | 20 |
You could use the events/trials syntax with the LOGISTIC procedure to analyze this data set as follows:
proc logistic data=drug; class drug; model r/n = x drug; run;
To analyze this same data set in SAS Enterprise Miner, you would need to take the following steps:
1. Create a new data set containing two rows for each row in the original data set (e.g. append the data set to itself).
2. Replace the value of r (# of events) in each duplicate row with the number of nonevents (n - r)
3. Rename the column r to freq (or just specify it as a frequency variable in SAS Enterprise Miner)
4. Add another column named target which contains a 1 if the row contains the number of events and a 0 if the row contains the number of nonevents.
5. Drop the column n containing the total number of events (it is not used directly in the analysis)
6. Add the modified data set as an Input Data Source in SAS Enterprise Miner being sure to specify the following variable information:
drug = nominal input variable role
x = interval input variable role
freq = frequency variable role
target = target variable role (1 if an event, 0 if a non-event)
The data used in SAS Enterprise Miner should appear as follows:
drug | x | freq | target |
A | 0.1 | 1 | 1 |
A | 0.23 | 2 | 1 |
A | 0.67 | 1 | 1 |
B | 0.2 | 3 | 1 |
B | 0.3 | 4 | 1 |
B | 0.45 | 5 | 1 |
B | 0.78 | 5 | 1 |
C | 0.04 | 0 | 1 |
C | 0.15 | 0 | 1 |
C | 0.56 | 1 | 1 |
C | 0.7 | 2 | 1 |
D | 0.34 | 5 | 1 |
D | 0.6 | 5 | 1 |
D | 0.7 | 8 | 1 |
E | 0.2 | 12 | 1 |
E | 0.34 | 15 | 1 |
E | 0.56 | 13 | 1 |
E | 0.8 | 17 | 1 |
A | 0.1 | 9 | 0 |
A | 0.23 | 10 | 0 |
A | 0.67 | 8 | 0 |
B | 0.2 | 10 | 0 |
B | 0.3 | 11 | 0 |
B | 0.45 | 11 | 0 |
B | 0.78 | 8 | 0 |
C | 0.04 | 10 | 0 |
C | 0.15 | 11 | 0 |
C | 0.56 | 11 | 0 |
C | 0.7 | 10 | 0 |
D | 0.34 | 5 | 0 |
D | 0.6 | 4 | 0 |
D | 0.7 | 2 | 0 |
E | 0.2 | 8 | 0 |
E | 0.34 | 5 | 0 |
E | 0.56 | 2 | 0 |
E | 0.8 | 3 | 0 |
If you connect the newly created data source described above to a Regression node and run the flow, you will get approximately (within rounding error) the same results that you would have obtained using the LOGISTIC procedure.
I hope this helps!
Doug
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.