proc logistic where all dependent variable observations have same resp...

stat12000 · Posted 08-05-2018 10:46 AM

All

I am trying to verify Microsoft R Client output from a logistic regression model with SAS. The dependent variable (yBinom2) has all values == 0 intentionally (and is realistic my area of work -- e.g., all technologists do not see red blood cells in a normal urine sample). Simulated data are below and attached (delimiter = "|"). When I run the regression model in R, estimation completes (code and output below for logit link function). When run in SAS with the logit link function, I receive the error message "All observations have the same response."

I am most interested in the predicted probabilities for each sample id. Does anyone know why the SAS solution will not estimate? Are there options in SAS to handle the scenario where the dependent variable has all 0's or 1's? Thank you in advance.

simulated data set:

sampleNo	y	x	yBinom1	yBinom2	yBinom3
1	6.68	6.92	1	0	1
2	6.75	6.66	0	0	1
3	6.85	6.93	1	0	1
4	6.86	6.98	1	0	1
5	6.67	6.95	1	0	1
6	6.96	6.69	1	0	1
7	6.91	6.55	0	0	1
8	6.82	6.87	1	0	1
9	6.55	6.81	0	0	1
10	6.87	6.75	1	0	1
11	6.52	6.94	0	0	1
12	6.59	6.79	0	0	1
13	6.6	6.87	1	0	1
14	6.56	6.68	0	0	1
15	6.65	6.53	1	0	1
16	6.68	6.88	0	0	1
17	6.64	6.91	0	0	1
18	6.96	6.99	0	0	1
19	6.9	6.83	1	0	1
20	6.6	6.91	0	0	1

R code:

A <- read.csv("C:/Users/BodnarJ/Desktop/functionalRequirement4_9_X/dataSim_FR_4_9_X.csv", sep = "|", header = TRUE, colClasses="character")

for(j in 2:ncol(A)){ A[,j] <- as.numeric(A[,j]) }

######################################################################
######################################################################
# binomial logistic regression model

vv <- A
logitMod <- glm( yBinom2 ~ x , data=vv , family=binomial(link="logit"))
predicted <- plogis(predict(logitMod, vv)) # predicted scores
vv$prob <- predicted
vv$probFlag <- ifelse(vv$prob > 0.5 , 1 , 0)
vv$resid <- logitMod$residuals

print( vv , row.names = F)

R Output:

sampleNo y x yBinom1 yBinom2 yBinom3 prob probFlag resid

1 6.68 6.92 1 0 1 7.884924e-12 0 -1
2 6.75 6.66 0 0 1 7.884924e-12 0 -1
3 6.85 6.93 1 0 1 7.884924e-12 0 -1
4 6.86 6.98 1 0 1 7.884924e-12 0 -1
5 6.67 6.95 1 0 1 7.884924e-12 0 -1
6 6.96 6.69 1 0 1 7.884924e-12 0 -1
7 6.91 6.55 0 0 1 7.884924e-12 0 -1
8 6.82 6.87 1 0 1 7.884924e-12 0 -1
9 6.55 6.81 0 0 1 7.884924e-12 0 -1
10 6.87 6.75 1 0 1 7.884924e-12 0 -1
11 6.52 6.94 0 0 1 7.884924e-12 0 -1
12 6.59 6.79 0 0 1 7.884924e-12 0 -1
13 6.60 6.87 1 0 1 7.884924e-12 0 -1
14 6.56 6.68 0 0 1 7.884924e-12 0 -1
15 6.65 6.53 1 0 1 7.884924e-12 0 -1
16 6.68 6.88 0 0 1 7.884924e-12 0 -1
17 6.64 6.91 0 0 1 7.884924e-12 0 -1
18 6.96 6.99 0 0 1 7.884924e-12 0 -1
19 6.90 6.83 1 0 1 7.884924e-12 0 -1
20 6.60 6.91 0 0 1 7.884924e-12 0 -1

SAS Code:

proc import datafile="C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv" dbms=csv out=work.anaDS replace;

delimiter="|";
getnames=yes;
guessingrows=7000;
run;

proc logistic data = anaDS ;
model yBinom2 = x / LINK = logit;
run;

SAS Log:

748 proc import datafile="C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv"
748! dbms=csv out=work.anaDS replace;
749 delimiter="|";
750 getnames=yes;
751 guessingrows=7000;
752 run;

753 /**********************************************************************
754 * PRODUCT: SAS
755 * VERSION: 9.4
756 * CREATOR: External File Interface
757 * DATE: 05AUG18
758 * DESC: Generated SAS Datastep Code
759 * TEMPLATE SOURCE: (None Specified.)
760 ***********************************************************************/
761 data WORK.ANADS ;
762 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
763 infile 'C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv' delimiter =
763! '|' MISSOVER DSD lrecl=32767 firstobs=2 ;
764 informat sampleNo best32. ;
765 informat y best32. ;
766 informat x best32. ;
767 informat yBinom1 best32. ;
768 informat yBinom2 best32. ;
769 informat yBinom3 best32. ;
770 format sampleNo best12. ;
771 format y best12. ;
772 format x best12. ;
773 format yBinom1 best12. ;
774 format yBinom2 best12. ;
775 format yBinom3 best12. ;
776 input
777 sampleNo
778 y
779 x
780 yBinom1
781 yBinom2
782 yBinom3
783 ;
784 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
785 run;

NOTE: The infile 'C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv' is:
Filename=C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv,
RECFM=V,LRECL=32767,File Size (bytes)=438,
Last Modified=05Aug2018:08:39:09,
Create Time=05Aug2018:08:26:59

NOTE: 20 records were read from the infile
'C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv'.
The minimum record length was 17.
The maximum record length was 18.
NOTE: The data set WORK.ANADS has 20 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.03 seconds

20 rows created in WORK.ANADS from
C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv.

NOTE: WORK.ANADS data set was successfully created.
NOTE: The data set WORK.ANADS has 20 observations and 6 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.13 seconds
cpu time 0.06 seconds

786
787 proc logistic data = anaDS ;
788 model yBinom2 = x / LINK = logit;
789 run;

ERROR: All observations have the same response. No statistics are computed.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 20 observations read from the data set WORK.ANADS.
NOTE: PROCEDURE LOGISTIC used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds

PaigeMiller · Posted 08-05-2018 10:48 AM

Seems to me you are asking the wrong people, SAS is giving the correct answer. You need to ask the R gurus how R can fit a model in the situation where all values of the dependent variable are constant.

--
Paige Miller

PGStats · Posted 08-05-2018 06:08 PM

You could estimate proportion confidence limits for the intercept only model, as done with proc freq. But for models involving explanatory variables, any estimate you may get will depend heavily on pretty strong assumptions. Proc glimmix will give you estimates, but check yow they depend on convergence criteria:

data test;
do x = 1 to 20;
    event = 0;
    output;
    end;
run;

/* decent estimates */
proc freq data=test;
table event / binomial(cl=exact);
run;

/* Frivolous estimates */
proc glimmix data=test;
model event = / dist=binomial link=logit s cl;
run;

proc glimmix data=test;
model event = / dist=binomial link=logit s cl;
nloptions absgconv=0.000001;
run;

PG

stat12000 · Posted 08-05-2018 08:45 PM

Thank you.

PGStats · Posted 08-06-2018 12:31 AM

Illustration of a possible scenario with all events=0

data test;
mu = 500;
sigma = 75;
do x = 1 to 2000;
    p = logistic( (x-mu)/sigma );
    if x <= 20 then event = 0; else call missing(event);
    output;
    end;
label p="true probability" ;
keep x p event;
run;

proc freq data=test;
table event / binomial(cl=exact);
ods output BinomialCLs=testCL;
run;

data testgraph;
if _n_=1 then set testCL;
set test;
if event = 0 then do;
    lowerLimit = 1-upperCL;
    upperLimit = 1-lowerCL;
    end;
keep x p event lowerLimit upperLimit;
run;

ods listing style=journal;
proc sgplot data=testgraph;
band x=x lower=lowerLimit upper=upperLimit / legendlabel="95% confidence band";
series x=x y=p;
scatter x=x y=event;
xaxis type=log;
yaxis label="event probability";
run;

PG

Reeza · Posted 08-05-2018 08:39 PM

I don't think you can build a statistically valid model with all responses 1 or 0, simply because it means you don't need to predict anything. If you predict that all are 1 or 0 then you're good to go, why bother with a model at all?

@stat12000 wrote:

All

I am trying to verify Microsoft R Client output from a logistic regression model with SAS. The dependent variable (yBinom2) has all values == 0 intentionally (and is realistic my area of work -- e.g., all technologists do not see red blood cells in a normal urine sample). Simulated data are below and attached (delimiter = "|"). When I run the regression model in R, estimation completes (code and output below for logit link function). When run in SAS with the logit link function, I receive the error message "All observations have the same response."

I am most interested in the predicted probabilities for each sample id. Does anyone know why the SAS solution will not estimate? Are there options in SAS to handle the scenario where the dependent variable has all 0's or 1's? Thank you in advance.

simulated data set:

sampleNo y x yBinom1 yBinom2 yBinom3

1 6.68 6.92 1 0 1

2 6.75 6.66 0 0 1

3 6.85 6.93 1 0 1

4 6.86 6.98 1 0 1

5 6.67 6.95 1 0 1

6 6.96 6.69 1 0 1

7 6.91 6.55 0 0 1

8 6.82 6.87 1 0 1

9 6.55 6.81 0 0 1

10 6.87 6.75 1 0 1

11 6.52 6.94 0 0 1

12 6.59 6.79 0 0 1

13 6.6 6.87 1 0 1

14 6.56 6.68 0 0 1

15 6.65 6.53 1 0 1

16 6.68 6.88 0 0 1

17 6.64 6.91 0 0 1

18 6.96 6.99 0 0 1

19 6.9 6.83 1 0 1

20 6.6 6.91 0 0 1

R code:

A <- read.csv("C:/Users/BodnarJ/Desktop/functionalRequirement4_9_X/dataSim_FR_4_9_X.csv", sep = "|", header = TRUE, colClasses="character")

for(j in 2:ncol(A)){ A[,j] <- as.numeric(A[,j]) }

######################################################################
######################################################################
# binomial logistic regression model

vv <- A
logitMod <- glm( yBinom2 ~ x , data=vv , family=binomial(link="logit"))
predicted <- plogis(predict(logitMod, vv)) # predicted scores
vv$prob <- predicted
vv$probFlag <- ifelse(vv$prob > 0.5 , 1 , 0)
vv$resid <- logitMod$residuals

print( vv , row.names = F)

R Output:

sampleNo y x yBinom1 yBinom2 yBinom3 prob probFlag resid

1 6.68 6.92 1 0 1 7.884924e-12 0 -1
2 6.75 6.66 0 0 1 7.884924e-12 0 -1
3 6.85 6.93 1 0 1 7.884924e-12 0 -1
4 6.86 6.98 1 0 1 7.884924e-12 0 -1
5 6.67 6.95 1 0 1 7.884924e-12 0 -1
6 6.96 6.69 1 0 1 7.884924e-12 0 -1
7 6.91 6.55 0 0 1 7.884924e-12 0 -1
8 6.82 6.87 1 0 1 7.884924e-12 0 -1
9 6.55 6.81 0 0 1 7.884924e-12 0 -1
10 6.87 6.75 1 0 1 7.884924e-12 0 -1
11 6.52 6.94 0 0 1 7.884924e-12 0 -1
12 6.59 6.79 0 0 1 7.884924e-12 0 -1
13 6.60 6.87 1 0 1 7.884924e-12 0 -1
14 6.56 6.68 0 0 1 7.884924e-12 0 -1
15 6.65 6.53 1 0 1 7.884924e-12 0 -1
16 6.68 6.88 0 0 1 7.884924e-12 0 -1
17 6.64 6.91 0 0 1 7.884924e-12 0 -1
18 6.96 6.99 0 0 1 7.884924e-12 0 -1
19 6.90 6.83 1 0 1 7.884924e-12 0 -1
20 6.60 6.91 0 0 1 7.884924e-12 0 -1

SAS Code:

proc import datafile="C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv" dbms=csv out=work.anaDS replace;

delimiter="|";
getnames=yes;
guessingrows=7000;
run;

proc logistic data = anaDS ;
model yBinom2 = x / LINK = logit;
run;

SAS Log:

748 proc import datafile="C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv"
748! dbms=csv out=work.anaDS replace;
749 delimiter="|";
750 getnames=yes;
751 guessingrows=7000;
752 run;

753 /**********************************************************************
754 * PRODUCT: SAS
755 * VERSION: 9.4
756 * CREATOR: External File Interface
757 * DATE: 05AUG18
758 * DESC: Generated SAS Datastep Code
759 * TEMPLATE SOURCE: (None Specified.)
760 ***********************************************************************/
761 data WORK.ANADS ;
762 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
763 infile 'C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv' delimiter =
763! '|' MISSOVER DSD lrecl=32767 firstobs=2 ;
764 informat sampleNo best32. ;
765 informat y best32. ;
766 informat x best32. ;
767 informat yBinom1 best32. ;
768 informat yBinom2 best32. ;
769 informat yBinom3 best32. ;
770 format sampleNo best12. ;
771 format y best12. ;
772 format x best12. ;
773 format yBinom1 best12. ;
774 format yBinom2 best12. ;
775 format yBinom3 best12. ;
776 input
777 sampleNo
778 y
779 x
780 yBinom1
781 yBinom2
782 yBinom3
783 ;
784 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection macro variable */
785 run;

NOTE: The infile 'C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv' is:
Filename=C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv,
RECFM=V,LRECL=32767,File Size (bytes)=438,
Last Modified=05Aug2018:08:39:09,
Create Time=05Aug2018:08:26:59

NOTE: 20 records were read from the infile
'C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv'.
The minimum record length was 17.
The maximum record length was 18.
NOTE: The data set WORK.ANADS has 20 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 0.05 seconds
cpu time 0.03 seconds

20 rows created in WORK.ANADS from
C:\Users\BodnarJ\Desktop\functionalRequirement4_9_X\dataSim_FR_4_9_X.csv.

NOTE: WORK.ANADS data set was successfully created.
NOTE: The data set WORK.ANADS has 20 observations and 6 variables.
NOTE: PROCEDURE IMPORT used (Total process time):
real time 0.13 seconds
cpu time 0.06 seconds

786
787 proc logistic data = anaDS ;
788 model yBinom2 = x / LINK = logit;
789 run;

ERROR: All observations have the same response. No statistics are computed.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 20 observations read from the data set WORK.ANADS.
NOTE: PROCEDURE LOGISTIC used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds

stat12000 · Posted 08-05-2018 08:50 PM

So the larger piece of the puzzle is that I am building this type of model that will loop through say 30 parameters. Some of these parameters have all response values = 0, some have all response values = 1, and some have all responses values with a mixture of 0's and 1's. All of these data patterns are clinically expected.

Our other statistician tried the SAS proc logistic path to verify my results using R and her code crashed because of the all 0's and all 1's. Mine did not crash.

So R does this estimation (somehow), and the probabilities are pretty similar to PROC GLIMMIX which is giving probabilities around 4e-8. R is giving probabilities about 7e-12. So clearly these probabilities allow the same conclusion to be reached.

Thank you for your expertise.

Reeza · Posted 08-05-2018 09:04 PM

So R does this estimation (somehow), and the probabilities are pretty similar to PROC GLIMMIX which is giving probabilities around 4e-8. R is giving probabilities about 7e-12. So clearly these probabilities allow the same conclusion to be reached.

Those are 0.

Are you confident enough in the 'somehow' when you have to explain it to someone else is really all that matters. I would also expect any of those variables to be excluded (or fall out with a selection algorithm) from a final model when a full model is fit.

stat12000 · Posted 08-05-2018 09:19 PM

Exclusion of such parameters is not acceptable by the FDA. For such models we are essentially trying to demonstrate that probabilities are high when expected and low also when expected. I clearly understand that the true probability is zero, but a predicted probability cannot be zero; likewise for the predicted probability asymptote for all 1's. I equate this with computation of the statistical power of an effect size. Such a probability has range space 0 < power < 1, exclusive of 0 and 1.

Another need for this type of analysis pertains to the concept of analyte carryover where you want to show the likelihood of detecting elements (RBCs, WBCs, etc.) in high concentration (abnormal) samples is very high, and then the likelihood of detecting elements (RBCs, WBCs, etc.) in low concentration (normal) samples is very low.

PaigeMiller · Posted 08-06-2018 08:54 AM

@stat12000 wrote:

Exclusion of such parameters is not acceptable by the FDA. For such models we are essentially trying to demonstrate that probabilities are high when expected and low also when expected. I clearly understand that the true probability is zero, but a predicted probability cannot be zero; likewise for the predicted probability asymptote for all 1's. I equate this with computation of the statistical power of an effect size. Such a probability has range space 0 < power < 1, exclusive of 0 and 1.

While I have no experience with the FDA, let me say that modeling does not always lead to truth. The truth is, if your data is all zeros, then the probability of zero is 1, regardless of the fact that your modeling method doesn't get that number. In essense, when your data is all zeros, you have the wrong modeling method.

I clearly understand that the true probability is zero, but a predicted probability cannot be zero; likewise for the predicted probability asymptote for all 1's. I equate this with computation of the statistical power of an effect size. Such a probability has range space 0 < power < 1, exclusive of 0 and 1.

Your modeling method fails when there are all zeros in the Y variable. So don't use it.

--
Paige Miller

proc logistic where all dependent variable observations have same response

Re: proc logistic where all dependent variable observations have same response

Re: proc logistic where all dependent variable observations have same response

Re: proc logistic where all dependent variable observations have same response

Re: proc logistic where all dependent variable observations have same response

Re: proc logistic where all dependent variable observations have same response

Re: proc logistic where all dependent variable observations have same response

Re: proc logistic where all dependent variable observations have same response

Re: proc logistic where all dependent variable observations have same response

Re: proc logistic where all dependent variable observations have same response

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!