Re: problem with constants and variables

deleted_user · Posted 10-14-2008 12:41 PM

consider the following code

http://rafb.net/p/b6EolO67.html

it produces

Obs b
1 2
2 0
3 4

no problem what so ever.

now consider this code

http://rafb.net/p/RQIrjE82.html

this code gives me some error saying "Missing values were generated as a result of performing an operation on missing values". it works when i write t = 2, but not with t = intercept. why?

all i want to do is to multiply all the observations in x1 with the intercept (a constant) produced by the logistic procedure.

why doesn't it work?

sbb · Posted 10-14-2008 03:37 PM

Suggest that you consider adding a DATA step PUTLOG statement to list your variables in the last DATA step -- this info should help explain the message you are getting from SAS. Focus on the intercept variable values with each DATA step interation, using the PUTLOG below:

PUTLOG _ALL_;

Scott Barry
SBBWorks, Inc.

Cynthia_sas · Posted 10-14-2008 11:08 PM

Hi:
Looking at your code, I'm confused by several things. So here are some comments that may help you get pointed in a direction. First, when I see code I MUST put clear step boundaries to separate the steps so I can check the SAS log between steps...the RUN; in caps is my addition to the code to separate the step boundaries and the comments with numbers are my annotation. Otherwise, this is the program from your second link:
[pre]
** 1) Create data set A with variables X1, X2, X3;
data a;
input x1 x2 x3;
cards;
1 0 1
0 1 1
1 1 0
;
RUN;

** 2) Create data set B that is an exact copy of A;
data b; set a;
RUN;

** 3) run PROC LOGISTIC step using data set A as input;
** create an output data set est_y1 from the LOGISTIC step;
proc logistic data=a outest=est_y1;
model x1 = x2 x3/clparm=pl clodds=pl selection=backward;
RUN;

** 4) Create data set X from est_y1 and A. This step has problems in that;
** variable B is not getting created as you wish;
data x; set est_y1 a;
t = intercept;
b = x1*t;
RUN;

** 5) print data set X and show only var B;
proc print data = x;
var b;
run;
[/pre]

I would say that you have a program composed of 5 steps. I did not see any issues with Step 1 or Step 2. I do not understand the need for step 2 at all, since you never use data set B again.

Step 3 is a Proc Logistic step and although it seems to work, I do get warnings in the log when I run your code:
[pre]
275 proc logistic data=a outest=est_y1;
276 model x1 = x2 x3/clparm=pl clodds=pl selection=backward;
277 RUN;

NOTE: PROC LOGISTIC is modeling the probability that x1=0. One way to change this to model the probability that x1=1 is to specify
the response variable option EVENT='1'.
WARNING: There is a complete separation of data points in Step 0. The maximum likelihood estimate does not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood
iteration. Validity of the model fit is questionable.
WARNING: There is possibly a quasicomplete separation of data points in step 1. The maximum likelihood estimate may not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood
iteration. Validity of the model fit is questionable.
[/pre]

However, assuming that you are comfortable with the warning, the output data set has one observation and looks like this:
[pre]
Obs _LINK_ _TYPE_ _STATUS_ _NAME_ Intercept x2 x3 _LNLIKE_

1 LOGIT PARMS 0 Converged x1 -0.69315 . . -1.90954

[/pre]

It looks like in your step 4, you want to create the variable T from the INTERCEPT variable in the outest dataset EST_Y1. The issue comes in your step 4. If you look in the log after your version of step 4 runs:
[pre]
279 ** 4) Create data set X from est_y1 and A. This step has problems in that;
280 ** variable B is not getting created as you wish;
281 data x; set est_y1 a;
282 t = intercept;
283 b = x1*t;
284 RUN;

NOTE: Missing values were generated as a result of performing an operation on missing values.
Each place is given by: (Number of times) at (Line):(Column).
4 at 283:7
NOTE: There were 1 observations read from the data set WORK.EST_Y1.
NOTE: There were 3 observations read from the data set WORK.A.
NOTE: The data set WORK.X has 4 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

[/pre]

You see that missing values were created 4 times. If you follow Scott's advice and use the PUTLOG statement by modifying the program above -- you see this (only PUTLOG & PUTLOG output shown):
[pre]

PUTLOG _n_= intercept= x1= x2= x3= t= b=;

_N_=1 Intercept=-0.693147181 x1=. x2=. x3=. t=-0.693147181 b=.
_N_=2 Intercept=. x1=1 x2=0 x3=1 t=. b=.
_N_=3 Intercept=. x1=0 x2=1 x3=1 t=. b=.
_N_=4 Intercept=. x1=1 x2=1 x3=0 t=. b=.

[/pre]

What's happening is that when the data set being read is the EST_Y1 data set, then you have a value for INTERCEPT; however, when the program switches to read from data set A, then INTERCEPT is set to missing (just as X1, X2 and X3 are missing when reading from EST_Y1, but have values when reading from A).

The way the data step works is that the "buffer" area that holds data set variables is reset or initialized to missing between each reading each row from the INPUT file. So what the PUTLOG reveals is that the WARNING Message is exactly correct...when you multiply any number by a missing value, the result is missing.

The reason that it works when you use a constant such as T=2; is that the assignment statement is EXECUTED for every iteration through the data step. So at the "top" of the program, T was missing, but for the assignment statement, T was set to 2. You can prove this to yourself by changing your first program to be as shown below:
[pre]
316 data b;
317 set a;
318 PUTLOG 'Top of DATA step:';
319 PUTLOG _ALL_;
320 t = 2;
321 PUTLOG 'Before Calc B';
322 PUTLOG _ALL_;
323
324 b = x1*t;
325 PUTLOG 'Bottom of DATA step:';
326 PUTLOG _ALL_;
327
328 RUN;

Top of DATA step:
x1=1 t=. b=. _ERROR_=0 _N_=1
Before Calc B
x1=1 t=2 b=. _ERROR_=0 _N_=1
Bottom of DATA step:
x1=1 t=2 b=2 _ERROR_=0 _N_=1
Top of DATA step:
x1=0 t=. b=. _ERROR_=0 _N_=2
Before Calc B
x1=0 t=2 b=. _ERROR_=0 _N_=2
Bottom of DATA step:
x1=0 t=2 b=0 _ERROR_=0 _N_=2
Top of DATA step:
x1=2 t=. b=. _ERROR_=0 _N_=3
Before Calc B
x1=2 t=2 b=. _ERROR_=0 _N_=3
Bottom of DATA step:
x1=2 t=2 b=4 _ERROR_=0 _N_=3
[/pre]

So, in order to fix your program #4, you have to do 2 things:
1) you need to read EST_Y1 and RETAIN the INTERCEPT value when _N_ = 1
and
2) then you need a separate SET statement for data set A.
Something like what's shown below:
[pre]
336 data x;
337 retain intercept;
338 if _n_ = 1 then set est_y1;
339 set a;
340 t = intercept;
341 b = x1*t;
342 PUTLOG _n_= intercept= x1= x2= x3= t= b=;
343 RUN;

_N_=1 intercept=-0.693147181 x1=1 x2=0 x3=1 t=-0.693147181 b=-0.693147181
_N_=2 intercept=-0.693147181 x1=0 x2=1 x3=1 t=-0.693147181 b=0
_N_=3 intercept=-0.693147181 x1=1 x2=1 x3=0 t=-0.693147181 b=-0.693147181
NOTE: There were 1 observations read from the data set WORK.EST_Y1.
NOTE: There were 3 observations read from the data set WORK.A.
NOTE: The data set WORK.X has 3 observations and 11 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds

[/pre]

My suggestion is that you read the documentation on how the SET statement works and how the RETAIN statement works. You might also want to search the SAS documentation for a description of how the "PROGRAM DATA VECTOR" or PDV operates -- that's the buffer that gets reinitialized for every iteration through the data step program.

I'd also suggest that you drop step 2 completely from your code -- it's probably just a left over from your previous program.

cynthia

deleted_user · Posted 10-15-2008 07:40 PM

Thank you Cynthia for your thorough explanation!

problem with constants and variables