I am trying to fit a regression using indicator variables. its not working and i dont understand why.
the code i used is as follows:
data indicatorVariableNCBirth;
set NCBirth;
if momrace = 'white' then white=4;
else white = 0;
if momrace = 'hispanic' then hispanic=3;
else hispanic = 0;
if momrace = 'black' then black = 2;
else black = 0;
if momrace = 'other' then other=1;
else other=0;
run;
Proc print data=indicatorvariablencbirth;
run;
proc reg data=indicatorvariablencbirth;
model Birthweightoz = momrace;
run;
here is the error message i receive after trying to carry out the regression model:
See my comments in red on your code.
data indicatorVariableNCBirth;
set NCBirth;
When creating indicator variables, it's best to use 1/0, not 5/0. Change these to 1/0 binary coding.
Additionally, if your categorical variable has N levels, you need N-1 Indicator variables to represent the variable. Including N is know as overparameterization.
if momrace = 'white' then white=4;
else white = 0;
if momrace = 'hispanic' then hispanic=3;
else hispanic = 0;
if momrace = 'black' then black = 2;
else black = 0;
if momrace = 'other' then other=1;
else other=0;
run;
proc reg data=indicatorvariablencbirth;
model Birthweightoz = momrace; <-Change this to include N-1 of your indicator variables that are code 0/1. Then you'll get estimates.
run;
Give your data structure I would also look at boxplots for the weight by race to visualize the comparison.
Why are you modeling the variable that you've created indicators for? Should you be using the new indicator variables instead?
Your first piece of code is entirely separate from your second. They don't reference the same data set or connected in any way.
I created a sub set of the original data set to include the indicator variables. When i fit the regression model to the created indicator variables it doesnt work. I dont know what I am doing wrong but I know that my output either gives me errors or it creates an off looking output statement. I
I used the following code and got a weird output that is wrong:
proc reg data=indicatorvariablencbirth;
model Birthweightoz = white hispanic black other;
run;
Explain how your data is structured, ideally provide sample data.
Then show what your model should be mathematically and we can help with the code.
What type of output are you expecting to get?
Thats my problem I am not entirely sure what the final regression line is meant to look like but the output data I am getting are straiht lines. For all I know it could be correct but i am getting vertical lines.
Im trying to attach the write up of the output to help explain my confusion.
Whats your basic model?
Birthweight = B1*white + B2*asian + B3*other
I honestly dont know what you mean when you say basic model. But here is a protion of the original data that might help answering my question because I dont know what I am doing wrong.
I really do appreciate all of the help in tying to figure this out thank you.
below is the code i used to import the csv data:
FILENAME CSV "/folders/myfolders/3064data/NCbirths_RaceStudy.csv" TERMSTR=CRLF;
PROC IMPORT DATAFILE=CSV
OUT=NCBirth
DBMS=CSV
REPLACE;
RUN;
/** Print the results. **/
PROC PRINT DATA=NCBirth (obs=100); RUN;
What question are you trying to answer? Do you have a hypothesis?
Are you familiar with linear regression?
Data is great, but you have to know what you want out of it as well.
All i want to do is produce a parameters estimates output table from which I can gather more information on the data.
I want to see if i can accurately use 'momrace' to predict birthweights while using the indicator variables.
The specific error you are receiving because the varaible MOMRACE is character as evidenced by your code:
if momrace = 'white' then white=4;
else white = 0;
Prog Reg requires the variables on the model statement to be numeric.
See my comments in red on your code.
data indicatorVariableNCBirth;
set NCBirth;
When creating indicator variables, it's best to use 1/0, not 5/0. Change these to 1/0 binary coding.
Additionally, if your categorical variable has N levels, you need N-1 Indicator variables to represent the variable. Including N is know as overparameterization.
if momrace = 'white' then white=4;
else white = 0;
if momrace = 'hispanic' then hispanic=3;
else hispanic = 0;
if momrace = 'black' then black = 2;
else black = 0;
if momrace = 'other' then other=1;
else other=0;
run;
proc reg data=indicatorvariablencbirth;
model Birthweightoz = momrace; <-Change this to include N-1 of your indicator variables that are code 0/1. Then you'll get estimates.
run;
Give your data structure I would also look at boxplots for the weight by race to visualize the comparison.
What do you mean when you say N-1 statement. Do mean add an additional variable to include the number of races minus one?
I dont really know what you mean by adding that statement or how to do that.
I don understand taking the 0/5 out of the equation having a simple binary is more appropriate.
Perhaps reading some linear regression tutorials would be helpful.
http://www.ats.ucla.edu/stat/sas/webbooks/reg/chapter3/sasreg3.htm
As well, the SAS Statistical e-course which covers linear regression is free
I figured it out from your pervious post! Thank you so much for the help I wa truely lost and really needed it.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.