BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
george7899
Fluorite | Level 6

I just start to learn SAS. Suppose I have a data set, which contains Rstudent, Cooksdlabel, Dffitsout, etc. variables. The goal is to flag those influential observations. Below is part of the code. 

 

...

L1   if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1;
L2   array dfbetas{*} _dfbetasout: ;
L3   do i=2 to dim(dfbetas);
L4   if dfbetas{i} then flag=1;
L5   end;

...

 

I know L1 adds a new column flag to the existing data set if the conditions are met. However, why not using (Dffitsout ne '')?

Please explain L2 to L4.

 

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

@george7899 wrote:

I just start to learn SAS. Suppose I have a data set, which contains Rstudent, Cooksdlabel, Dffitsout, etc. variables. The goal is to flag those influential observations. Below is part of the code. 

 

...

L1   if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1;
L2   array dfbetas{*} _dfbetasout: ;
L3   do i=2 to dim(dfbetas);
L4   if dfbetas{i} then flag=1;
L5   end;

...

 

I know L1 adds a new column flag to the existing data set if the conditions are met. However, why not using (Dffitsout ne '')?

Please explain L2 to L4.

 

Thanks


It would be easier with the definitions of the variables that are being referenced.  For now let's assume that the variables are of the type that is implied by how they are using.

L1   if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1;

This references 4 variables.  All but the second one are numeric.  It will set FLAG to 1 if the absolute value of RSTUDENT is larger than 3 OR COOKSDLABEL is not empty OR DFFITSOUT is not missing or zero.

L2   array dfbetas{*} _dfbetasout: ;

This one defines an array named DFBETAS that can be used to reference every variable whose name starts with _DFBETASOUT (that the compiler knows about at this point in compiling the data step).  The dimension of the array will be determined by the number of variables found.  The TYPE of the will be determined by the TYPE of the variables found.  All of the variables must be of the same type.  The variables will appear in the array in the order that they have been defined in the data step.  Not in alphabetical order, unless they were originally defined in that order.

L3   do i=2 to dim(dfbetas);

This will start a DO loop that iterate I from 2 to the number of elements in the array just defined.  Not clear why it is skipping being able to point to the first element of the array.

L4   if dfbetas{i} then flag=1;

So if the currently index variable in DFBETAS is not missing or zero then FLAG is set to one.  Note that since this statement is treating the variables in DFBETAS as numbers then that implies that all of the  _DFBETASOUT... variables should be numbers, something we could not tell from just the ARRAY statement.

 

So the effect of the DO loop in L3 to L5 is to set FLAG=1 if any of the elements of DFBETAS (except the first one) has a value that it neither zero or missing.

 

It is still a mystery as to why it is skipping testing the first _DFBETAOUT... variable.  And since it is leaving the order of the elements of the array up to the order that they are defined in data step we cannot tell from this snippet of code which of unknown number of variables it is skipping by starting the DO loop index at 2 instead of 1.

View solution in original post

8 REPLIES 8
Kurt_Bremser
Super User

What type is Dffitsout? The way the code is written I assume it is numeric. Any numeric value that is not zero or missing is considered true.

The following lines define an array over all variables that start with _dfbetasout and do a check over that array similar to the check for _dfbetasout.

george7899
Fluorite | Level 6

Thanks for your quick reply!

 

1) Yes, Dffitsout is numeric. so Dffitsout and Dffitsout ne ' ' are the same thing then. 

2) _DFBETASOUT: is also numeric. So if dfbeta{i} then flag=1 is actually the same thing as if dfbeta{i} ne ' ' then flag=1, which means adding a flag if _DFBETASOUT: has a value. 

 

Please kindly correct me KurtBremser if I were wrong. 

 

Patrick
Opal | Level 21

"2) _DFBETASOUT: is also numeric. So if dfbeta{i} then flag=1 is actually the same thing as if dfbeta{i} ne ' ' then flag=1"

 

If a variable is numeric then a missing is represented as a dot and not a blank in quotes (that's for character variables).

The syntax for a numeric variable should be: if dfbeta{i} ne . then flag=1

 

For a logical test: if dfbeta{i} then....

The test will be FALSE for missing AND for a value of zero. For this reason the test is not the same like if dfbeta{i} ne .

You would need to code something like if dfbeta{i} not in (.,0) then flag=1

Kurt_Bremser
Super User

@george7899 wrote:

Thanks for your quick reply!

 

1) Yes, Dffitsout is numeric. so Dffitsout and Dffitsout ne ' ' are the same thing then. 

2) _DFBETASOUT: is also numeric. So if dfbeta{i} then flag=1 is actually the same thing as if dfbeta{i} ne ' ' then flag=1, which means adding a flag if _DFBETASOUT: has a value. 

 

Please kindly correct me KurtBremser if I were wrong. 

 


You are wrong. A missing numerical value would be converted to it's character representation, which is not the same as a missing character value. Never be sloppy, always treat numbers as numbers and strings as strings.

Tom
Super User Tom
Super User

@george7899 wrote:

I just start to learn SAS. Suppose I have a data set, which contains Rstudent, Cooksdlabel, Dffitsout, etc. variables. The goal is to flag those influential observations. Below is part of the code. 

 

...

L1   if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1;
L2   array dfbetas{*} _dfbetasout: ;
L3   do i=2 to dim(dfbetas);
L4   if dfbetas{i} then flag=1;
L5   end;

...

 

I know L1 adds a new column flag to the existing data set if the conditions are met. However, why not using (Dffitsout ne '')?

Please explain L2 to L4.

 

Thanks


It would be easier with the definitions of the variables that are being referenced.  For now let's assume that the variables are of the type that is implied by how they are using.

L1   if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1;

This references 4 variables.  All but the second one are numeric.  It will set FLAG to 1 if the absolute value of RSTUDENT is larger than 3 OR COOKSDLABEL is not empty OR DFFITSOUT is not missing or zero.

L2   array dfbetas{*} _dfbetasout: ;

This one defines an array named DFBETAS that can be used to reference every variable whose name starts with _DFBETASOUT (that the compiler knows about at this point in compiling the data step).  The dimension of the array will be determined by the number of variables found.  The TYPE of the will be determined by the TYPE of the variables found.  All of the variables must be of the same type.  The variables will appear in the array in the order that they have been defined in the data step.  Not in alphabetical order, unless they were originally defined in that order.

L3   do i=2 to dim(dfbetas);

This will start a DO loop that iterate I from 2 to the number of elements in the array just defined.  Not clear why it is skipping being able to point to the first element of the array.

L4   if dfbetas{i} then flag=1;

So if the currently index variable in DFBETAS is not missing or zero then FLAG is set to one.  Note that since this statement is treating the variables in DFBETAS as numbers then that implies that all of the  _DFBETASOUT... variables should be numbers, something we could not tell from just the ARRAY statement.

 

So the effect of the DO loop in L3 to L5 is to set FLAG=1 if any of the elements of DFBETAS (except the first one) has a value that it neither zero or missing.

 

It is still a mystery as to why it is skipping testing the first _DFBETAOUT... variable.  And since it is leaving the order of the elements of the array up to the order that they are defined in data step we cannot tell from this snippet of code which of unknown number of variables it is skipping by starting the DO loop index at 2 instead of 1.

george7899
Fluorite | Level 6

Thank you all. The array part is clear. _DFBETASOUT1 refers to the intercept, and _DFBETASOUT2 to 8 refer to those 7 predictors. My understanding is that in the influential analyses of a multiple regression, intercept is normally skipped and we only focus on those predictors. i suspect that is why do i=2 (instead of do i=1). 

 

if (ABS(Rstudent)>3) or (Cooksdlabel ne ' ') or Dffitsout then flag=1

All Rstudent, Cooksdlabel and Dffitsout are numerical, and the goal here is to apply flag 1 to those which have values that are not missing or 0. I was confused by OR (Cooksdlabel ne ' ') since that stands for Cooksdlabel not equal to an empty string. I think using OR Cooksdlabel will be clearer (just like or Dffitsout). 

* I guess OR (Cooksdlabel ne ' ') is valid since the log doesn't complain??

* I guess OR Dffitsout can be similarly written as OR (Dffitsout ne ' ')?

Like Kurt Bremser said it is better to treat numbers as number, and treat strings as strings. 

Tom
Super User Tom
Super User

(Cooksdlabel ne ' ')

Is not the same condition as

(Cooksdlabel)

The first one will cause SAS to convert the space into a number, which will be missing.  So it is TRUE only when the variable is missing.

The second one will true when the value is not missing and also not zero.  So they will agree when the value is zero and disagree for every other value.

 

Try your own little test.

data test;
do x=.,0,1;
  length test1-test2 $5;
  if x=' ' then test1='True'; else test1='False';
  if x then test2='True'; else test2='False';
  output;
end;
run;
proc print;
run;
Obs    x    test1    test2

 1     .    True     False
 2     0    False    False
 3     1    False    True

 

george7899
Fluorite | Level 6

Thanks guys. Your replies are very helpful. 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1472 views
  • 5 likes
  • 4 in conversation