BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
glee217
Calcite | Level 5

Hi. I have 4 explanatory variables and 3 of them are categorical, only one of them are numerical variable.

I know if i want to use PROC REG or PROC SGSCATTER  function, i need to change my numerical variables into categorical.

This is my code:


data forreg; set new; if Stage1 = A Then StageA=1 ; Else StageA=0; if Stage1 = B Then StageB=1 ; Else StageB=0; if Stage1 = C Then StageC=1; Else StageC=0; run;

i tried only for Stage1 as an example. But sadly, my code is not working 😕

 

Here's explanation about my variables:

My response variable is 'Exam' (out of 100)

and 

'Assign' : the student’s Assignment mark (out of 20)

'Stage1': the student’s grade at Stage 1 Statistics: A, B or C

'Grade': What grade the student obtained: A, B, C or D

'Pass':Did the student pass the course or not: No or Yes

 

And here's my error:

data forreg;
37 set new;
38 if Stage1 = A Then StageA=1 ; Else StageA=0;
39 if Stage1 = B Then StageB=1 ; Else StageB=0;
40 if Stage1 = C Then StageC=1; Else StageC=0;
41 run;

NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column).
39:6 40:6
NOTE: Variable A is uninitialized.
NOTE: Invalid numeric data, STAGE1='C' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='C' , at line 40 column 6.
GRADE=D PASS=No EXAM=35 DEGREE=Other GENDER=Male ATTEND=No ASSIGN=8 TEST=8.2 B=4 C=1 MC=15 COLOUR=Blue STAGE1=C YEARSSIN=3
REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=1
NOTE: Invalid numeric data, STAGE1='C' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='C' , at line 40 column 6.
GRADE=D PASS=No EXAM=25 DEGREE=BA GENDER=Male ATTEND=No ASSIGN=8.8 TEST=10.9 B=3 C=0 MC=11 COLOUR=Yellow STAGE1=C
YEARSSIN=1.5 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=2
NOTE: Invalid numeric data, STAGE1='B' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='B' , at line 40 column 6.
GRADE=C PASS=Yes EXAM=52 DEGREE=Other GENDER=Female ATTEND=No ASSIGN=12 TEST=15.5 B=8 C=14 MC=15 COLOUR=Yellow STAGE1=B
YEARSSIN=0 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=3
NOTE: Invalid numeric data, STAGE1='A' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='A' , at line 40 column 6.
GRADE=A PASS=Yes EXAM=81 DEGREE=BCom GENDER=Male ATTEND=No ASSIGN=16.8 TEST=17.3 B=13 C=16 MC=26 COLOUR=Yellow STAGE1=A
YEARSSIN=2.5 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=4
NOTE: Invalid numeric data, STAGE1='A' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='A' , at line 40 column 6.
GRADE=A PASS=Yes EXAM=77 DEGREE=BSc GENDER=Female ATTEND=No ASSIGN=16.4 TEST=12.7 B=15 C=18 MC=22 COLOUR=Green STAGE1=A
YEARSSIN=0.5 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=5
NOTE: Invalid numeric data, STAGE1='C' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='C' , at line 40 column 6.
GRADE=D PASS=No EXAM=32 DEGREE=BSc GENDER=Female ATTEND=No ASSIGN=8 TEST=10 B=8 C=4 MC=10 COLOUR=Blue STAGE1=C YEARSSIN=1
REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=6
NOTE: Invalid numeric data, STAGE1='C' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='C' , at line 40 column 6.
GRADE=C PASS=Yes EXAM=46 DEGREE=BSc GENDER=Female ATTEND=No ASSIGN=10.8 TEST=12.7 B=14 C=10 MC=11 COLOUR=Blue STAGE1=C
YEARSSIN=0.5 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=7
NOTE: Invalid numeric data, STAGE1='A' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='A' , at line 40 column 6.
GRADE=A PASS=Yes EXAM=89 DEGREE=BSc GENDER=Female ATTEND=No ASSIGN=20 TEST=18.2 B=16 C=15 MC=29 COLOUR=Yellow STAGE1=A
YEARSSIN=1 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=8
NOTE: Invalid numeric data, STAGE1='C' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='C' , at line 40 column 6.
GRADE=D PASS=No EXAM=36 DEGREE=BSc GENDER=Female ATTEND=No ASSIGN=12.8 TEST=12.7 B=5 C=7 MC=12 COLOUR=Blue STAGE1=C
YEARSSIN=2 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=9
NOTE: Invalid numeric data, STAGE1='B' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='B' , at line 40 column 6.
GRADE=B PASS=Yes EXAM=57 DEGREE=BCom GENDER=Female ATTEND=No ASSIGN=17.6 TEST=14.5 B=10 C=13 MC=17 COLOUR=Green STAGE1=B
YEARSSIN=2 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=10
NOTE: Invalid numeric data, STAGE1='C' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='C' , at line 40 column 6.
GRADE=D PASS=No EXAM=35 DEGREE=BSc GENDER=Male ATTEND=No ASSIGN=14 TEST=4.5 B=3 C=10 MC=11 COLOUR=Green STAGE1=C YEARSSIN=1
REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=11
NOTE: Invalid numeric data, STAGE1='B' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='B' , at line 40 column 6.
GRADE=C PASS=Yes EXAM=44 DEGREE=BSc GENDER=Female ATTEND=No ASSIGN=14.8 TEST=9.1 B=8 C=10 MC=13 COLOUR=Green STAGE1=B
YEARSSIN=1.5 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=12
NOTE: Invalid numeric data, STAGE1='B' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='B' , at line 40 column 6.
GRADE=C PASS=Yes EXAM=52 DEGREE=BSc GENDER=Female ATTEND=No ASSIGN=14.4 TEST=6.4 B=7 C=15 MC=15 COLOUR=Pink STAGE1=B
YEARSSIN=1.5 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=13
NOTE: Invalid numeric data, STAGE1='C' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='C' , at line 40 column 6.
GRADE=D PASS=No EXAM=32 DEGREE=BA GENDER=Female ATTEND=No ASSIGN=9.6 TEST=13.6 B=4 C=12 MC=8 COLOUR=Green STAGE1=C
YEARSSIN=2.5 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=14
NOTE: Invalid numeric data, STAGE1='C' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='C' , at line 40 column 6.
GRADE=C PASS=Yes EXAM=42 DEGREE=BSc GENDER=Male ATTEND=No ASSIGN=11.2 TEST=13.6 B=4 C=10 MC=14 COLOUR=Pink STAGE1=C
YEARSSIN=0.5 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=15
NOTE: Invalid numeric data, STAGE1='B' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='B' , at line 40 column 6.
GRADE=C PASS=Yes EXAM=54 DEGREE=BSc GENDER=Female ATTEND=No ASSIGN=14.8 TEST=10.9 B=9 C=13 MC=16 COLOUR=Blue STAGE1=B
YEARSSIN=3 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=16
NOTE: Invalid numeric data, STAGE1='A' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='A' , at line 40 column 6.
GRADE=A PASS=Yes EXAM=72 DEGREE=BCom GENDER=Male ATTEND=No ASSIGN=17.2 TEST=15.5 B=11 C=17 MC=22 COLOUR=Yellow STAGE1=A
YEARSSIN=2 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=17
NOTE: Invalid numeric data, STAGE1='A' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='A' , at line 40 column 6.
GRADE=A PASS=Yes EXAM=72 DEGREE=Other GENDER=Male ATTEND=No ASSIGN=16.4 TEST=16.4 B=15 C=9 MC=24 COLOUR=Yellow STAGE1=A
YEARSSIN=1 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=18
NOTE: Invalid numeric data, STAGE1='C' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='C' , at line 40 column 6.
GRADE=D PASS=No EXAM=41 DEGREE=BSc GENDER=Male ATTEND=No ASSIGN=5.2 TEST=4.5 B=9 C=4 MC=14 COLOUR=Green STAGE1=C YEARSSIN=4.5
REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=19
NOTE: Invalid numeric data, STAGE1='A' , at line 39 column 6.
NOTE: Invalid numeric data, STAGE1='A' , at line 40 column 6.
WARNING: Limit set by ERRORS= option reached. Further errors of this type will not be printed.
GRADE=D PASS=No EXAM=24 DEGREE=BSc GENDER=Male ATTEND=No ASSIGN=2.8 TEST=14.5 B=0 C=0 MC=12 COLOUR=Yellow STAGE1=A
YEARSSIN=1.5 REPEAT=No A= StageA=0 StageB=0 StageC=0 _ERROR_=1 _N_=20
NOTE: There were 146 observations read from the data set WORK.NEW.
NOTE: The data set WORK.FORREG has 146 observations and 19 variables.
NOTE: DATA statement used (Total process time):
real time 0.20 seconds
cpu time 0.18 seconds

 

Would you help me kindly please?

 

I also attached the original file of image.캡처.JPG

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

In terms of your syntax, you start out by incorrectly omitting quotes:

 

if Stage1=A then StageA=1; else StageA=0;

 

The right way:

 

if Stage1='A' then StageA=1; else StageA=0;

 

In terms of your plan, the other comments you have received are right on the money.  It is rare that you need to create your own dummy variables.  Most regression procedures support a CLASS statement, which creates dummy variables as needed.  That also avoids the mistake of overspecifying.  For example, if your CLASS variable can take on values of "A", "B", or "C" only, your analysis  should specify just two dummy variables, not three.

View solution in original post

9 REPLIES 9
WarrenKuhfeld
Rhodochrosite | Level 12

https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_transreg_se...

 

proc transreg with the DESIGN option gives you control over the names, labels, and coding.  GLIMMIX has options as well.  GLMMOD was designed to give you the GLM internal coding, so it does not give you much control.

Astounding
PROC Star

In terms of your syntax, you start out by incorrectly omitting quotes:

 

if Stage1=A then StageA=1; else StageA=0;

 

The right way:

 

if Stage1='A' then StageA=1; else StageA=0;

 

In terms of your plan, the other comments you have received are right on the money.  It is rare that you need to create your own dummy variables.  Most regression procedures support a CLASS statement, which creates dummy variables as needed.  That also avoids the mistake of overspecifying.  For example, if your CLASS variable can take on values of "A", "B", or "C" only, your analysis  should specify just two dummy variables, not three.

WarrenKuhfeld
Rhodochrosite | Level 12

I must admit my bias.  I wrote transreg.  For that reason and others, I think glmmod is never the answer to questions about how to code.  It is the answer to the question: how do I get the coded variables that glm creates? glmmod was never designed for general purpose coding.  In modeling, people often want to code in different ways.  That is why I added so many options to transreg.  Later, that is why so many options were added to the CLASS and EFFECT statements in other procedures. 

 

Creating three binary variables for a CLASS variable with three values is not a mistake.  It is an option. It is the way you specify a cell means model, which has an implicit intercept  It is not the way you specify a reference cell model. It is how you specify a separate slopes and intercepts model. It is not the way to specify a model with an intercept, slope, changes in intercepts, and changes in slopes.

 

In GLM, the less than full rank model was a way that could handle different patterns of BY groups, missing data, invalid weights and freqs, and anything else that was thrown at it.  One binary variable gets created for each level, and a sequential sweep decides which ones are needed. Again, it was not a mistake. @Astounding is correct though: pick a model and a procedure.  Don't use a DATA step.  If you do use a DATA step, do it the easy way.  There is no reason to have an IF, ELSE, and two assignment statements for each level.  Variable = Boolean-expression; makes a binary variable:

 

StageA = Stage1='A';

Rick_SAS
SAS Super FREQ

For an overview of creating dummy variables in SAS, see "Four ways to create a design matrix in SAS."  As others have said, however, many regression procedures support a CLASS statement, so you might not need to create dummy variables explicitly.

rangarat
Calcite | Level 5

I tried this with my data, and my program's running smoothly, but when running frequencies of the original variable and the new dummies, I see that I have a ton more missing data with the dummy variables.  Why is that?

Astounding
PROC Star

You tried "this"?  What did you try?  Please post the log from what you tried, and explain which variables are missing more often than you would expect.

rangarat
Calcite | Level 5

I tried using the code you've mentioned above with my data, and after creating the dummy variables from the raw variable for which I needed dummy variables and after recoding the dummy variables to the newly specified one in my data dictionary, I have 717 missing data versus the 90 from the raw variable. Here's an example:

739 PROC FREQ DATA = INPHAASE.STAGINGPROCEDURE;
740 TABLE STAGING_PROCEDURE_TYPE;
741 RUN;

NOTE: There were 734 observations read from the data set INPHAASE.STAGINGPROCEDURE.
NOTE: PROCEDURE FREQ used (Total process time):
real time 1.54 seconds
cpu time 0.07 seconds


742 DATA INPHAASE.DUMMYPROCSTAGETYPE;
743 SET INPHAASE.STAGINGPROCEDURE;
744 IF STAGING_PROCEDURE_TYPE = "1" THEN STAGING_PROCEDURE_TYPE_1 = 1; ELSE
744! STAGING_PROCEDURE_TYPE_1 = 0;
745 IF STAGING_PROCEDURE_TYPE = "2" THEN STAGING_PROCEDURE_TYPE_2 = 1; ELSE
745! STAGING_PROCEDURE_TYPE_2 = 0;
746 IF STAGING_PROCEDURE_TYPE = "3" THEN STAGING_PROCEDURE_TYPE_3 = 1; ELSE
746! STAGING_PROCEDURE_TYPE_3 = 0;
747 IF STAGING_PROCEDURE_TYPE = "4" THEN STAGING_PROCEDURE_TYPE_4 = 1; ELSE
747! STAGING_PROCEDURE_TYPE_4 = 0;
748 IF STAGING_PROCEDURE_TYPE = "5" THEN STAGING_PROCEDURE_TYPE_5 = 1; ELSE
748! STAGING_PROCEDURE_TYPE_5 = 0;
749 IF STAGING_PROCEDURE_TYPE = "6" THEN STAGING_PROCEDURE_TYPE_6 = 1; ELSE
749! STAGING_PROCEDURE_TYPE_6 = 0;
750 IF STAGING_PROCEDURE_TYPE = "7" THEN STAGING_PROCEDURE_TYPE_7 = 1; ELSE
750! STAGING_PROCEDURE_TYPE_7 = 0;
751 IF STAGING_PROCEDURE_TYPE = "8" THEN STAGING_PROCEDURE_TYPE_8 = 1; ELSE
751! STAGING_PROCEDURE_TYPE_8 = 0;
752 IF STAGING_PROCEDURE_TYPE = "99" THEN STAGING_PROCEDURE_TYPE_99 = 1; ELSE
752! STAGING_PROCEDURE_TYPE_99 = 0;
753 RUN;

NOTE: There were 734 observations read from the data set INPHAASE.STAGINGPROCEDURE.
NOTE: The data set INPHAASE.DUMMYPROCSTAGETYPE has 734 observations and 291 variables.
NOTE: DATA statement used (Total process time):
real time 2.27 seconds
cpu time 0.18 seconds


754 PROC TABULATE DATA = INPHAASE.DUMMYPROCSTAGETYPE;
755 VAR STAGING_PROCEDURE_TYPE_1-STAGING_PROCEDURE_TYPE_8 STAGING_PROCEDURE_TYPE_99
755! STUDY_ID;
756 TABLE STAGING_PROCEDURE_TYPE_1-STAGING_PROCEDURE_TYPE_8
756! STAGING_PROCEDURE_TYPE_99 / BOX = "Staging Procedure Type";
757 LABEL STAGING_PROCEDURE_TYPE_1 = "TAH"
758 STAGING_PROCEDURE_TYPE_2 = "BSO"
759 STAGING_PROCEDURE_TYPE_3 = "LND"
760 STAGING_PROCEDURE_TYPE_4 = "Omentectomy"
761 STAGING_PROCEDURE_TYPE_5 = "TRS"
762 STAGING_PROCEDURE_TYPE_6 = "USO"
763 STAGING_PROCEDURE_TYPE_7 = "Exploratory Laparotomy/Biopsy"
764 STAGING_PROCEDURE_TYPE_8 = "Bowel Resection"
765 STAGING_PROCEDURE_TYPE_99 = "Unknown";
766 RUN;

NOTE: There were 734 observations read from the data set INPHAASE.DUMMYPROCSTAGETYPE.
NOTE: PROCEDURE TABULATE used (Total process time):
real time 0.79 seconds
cpu time 0.06 seconds


767 PROC FREQ DATA = INPHAASE.DUMMYPROCSTAGETYPE;
768 TABLES STAGING_PROCEDURE_TYPE_1 STAGING_PROCEDURE_TYPE_2 STAGING_PROCEDURE_TYPE_3
768! STAGING_PROCEDURE_TYPE_4 STAGING_PROCEDURE_TYPE_5
769 STAGING_PROCEDURE_TYPE_6 STAGING_PROCEDURE_TYPE_7 STAGING_PROCEDURE_TYPE_8
769! STAGING_PROCEDURE_TYPE_99;
770 RUN;

NOTE: There were 734 observations read from the data set INPHAASE.DUMMYPROCSTAGETYPE.
NOTE: PROCEDURE FREQ used (Total process time):
real time 0.06 seconds
cpu time 0.04 seconds


771 DATA INPHAASE.PROCSTAGETYPE;
772 SET INPHAASE.DUMMYPROCSTAGETYPE;
773 LENGTH PROCSTAGETYPE $100;
774 IF STAGING_PROCEDURE_TYPE_1 = 1 THEN PROCSTAGETYPE = "TAH";
775 IF STAGING_PROCEDURE_TYPE_2 = 1 THEN PROCSTAGETYPE = "BSO";
776 IF STAGING_PROCEDURE_TYPE_3 = 1 THEN PROCSTAGETYPE = "LND";
777 IF STAGING_PROCEDURE_TYPE_4 = 1 THEN PROCSTAGETYPE = "Omentectomy";
778 IF STAGING_PROCEDURE_TYPE_5 = 1 THEN PROCSTAGETYPE = "TRS";
779 IF STAGING_PROCEDURE_TYPE_6 = 1 THEN PROCSTAGETYPE = "USO";
780 IF STAGING_PROCEDURE_TYPE_7 = 1 THEN PROCSTAGETYPE = "Exploratory Laparotomy/Biopsy";
781 IF STAGING_PROCEDURE_TYPE_8 = 1 THEN PROCSTAGETYPE = "Bowel Resection";
782 IF STAGING_PROCEDURE_TYPE_99 = 1 THEN PROCSTAGETYPE = "Unknown";
783 RUN;

NOTE: There were 734 observations read from the data set INPHAASE.DUMMYPROCSTAGETYPE.
NOTE: The data set INPHAASE.PROCSTAGETYPE has 734 observations and 292 variables.
NOTE: DATA statement used (Total process time):
real time 2.14 seconds
cpu time 0.21 seconds


784 PROC FREQ DATA = INPHAASE.PROCSTAGETYPE;
785 TABLE PROCSTAGETYPE PROCSTAGETYPE*PROCSTAGE / NOROW NOPERCENT;
786 RUN;

NOTE: There were 734 observations read from the data set INPHAASE.PROCSTAGETYPE.
NOTE: PROCEDURE FREQ used (Total process time):
real time 0.92 seconds
cpu time 0.03 seconds


Astounding
PROC Star

So which variables are missing more often than you expect?

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 15418 views
  • 5 likes
  • 6 in conversation