Solved: Fitting without outlier but calculate residual for outlier

leonzheng · Posted 08-15-2018 12:21 PM

I have a Data set and need to do some operations

Data set:

PARM x y

3 x0 y0

3.1 x1 y1

2.9 x2 y2

5 x3 y3

First I run a condition filter and get result like this:

PARM x y Filter

3 x0 y0 pass

3.1 x1 y1 pass

2.9 x2 y2 pass

5 x3 y3 fail

Then I need to run a 4th-polynomial fit with points that pass the filter, below is my code:

proc glm data = wafdata noprint;
class x y;
model PARM = x
y
x * x
x * y
y * y
x * x * x
x * x * y
x * y * y
y * y * y
x * x * x * x
x * x * x * y
x * x * y * y
x * y * y * y
y * y * y * y /p;
output out = poly p = prediction r = residual;
ods graphics off;
run;

Then I need to get

PARM x y Filter Residual

3 x0 y0 pass xxx

3.1 x1 y1 pass xxx

2.9 x2 y2 pass xxx

5 x3 y3 fail xxx (this data point is not included in fitting but have residual calculation as well)

The problem I cannot solve is that I need to get residual from this fit for all data point, including points that fail the filter, so I cannot just delete the filter fail points and run fit.

Last time someone told me try use score but I cannot figure out how, is there anyone can help me on this problem in detail, using score or any other method?

I hope I describe clear enough.

Thanks!

PaigeMiller · Posted 08-15-2018 12:55 PM

Create your data set like this

PARM x y Filter original_parm

3 x0 y0 pass 3

3.1 x1 y1 pass 3.1

2.9 x2 y2 pass 2.9

. x3 y3 fail 5

Then the fail observation will not be used in the model fit, but it will produce a predicted value. Then it is a simple calculation to use the predicted value and original_parm to get the residual.

--
Paige Miller

View solution in original post

PaigeMiller · Posted 08-15-2018 12:55 PM

Create your data set like this

PARM x y Filter original_parm

3 x0 y0 pass 3

3.1 x1 y1 pass 3.1

2.9 x2 y2 pass 2.9

. x3 y3 fail 5

Then the fail observation will not be used in the model fit, but it will produce a predicted value. Then it is a simple calculation to use the predicted value and original_parm to get the residual.

--
Paige Miller

leonzheng · Posted 08-15-2018 01:12 PM

I tried your way but it seems not work, predictions for PARM = '.' points are also '.'

PaigeMiller · Posted 08-15-2018 03:02 PM

@leonzheng wrote:
I tried your way but it seems not work, predictions for PARM = '.' points are also '.'

I don't know what this means. Show me the results you are getting.

--
Paige Miller

leonzheng · Posted 08-15-2018 01:23 PM

I there any requirement for "by" in fitting in your way? I changed "by" group and it works... both "by" groups are character bu the working one is simpler (not working one is included in working one)

Reeza · Posted 08-15-2018 03:16 PM

What does a higher order for a categorical variable imply? Since it's 0/1 coding and squaring/cubing etc just result in similar values, ie -1 squared = 1, or -1 cubed = -1. I don't think that makes a lot of sense from a practical perspective. From a technical perspective it's easy enough to implement.

You use a BY statement when you're developing multiple regression models for different groups, for example if you were running this model for two or three different regions or countries you would want a separate model for each country.

SAS cannot calculate a residual with this method, because it does not have the data. However the method is mostly correct, you just need to add a second step to calculate it. Here's an example that works.

data heart;
set sashelp.heart;
where status='Dead';
Answer=AgeAtDeath;
if _n_ in (5, 10, 15) then ageAtDeath=.; *set to missing;
run;

proc glm data=heart;
class bp_status weight_status;
model  ageatDeath =  bp_status
                         weight_status
                         bp_status * bp_status
                         bp_status * weight_status
                         weight_status * weight_status
                         bp_status * bp_status * bp_status
                         bp_status * bp_status * weight_status
                         bp_status * weight_status * weight_status
                         weight_status * weight_status * weight_status
                         bp_status * bp_status * bp_status * bp_status
                         bp_status * bp_status * bp_status * weight_status
                         bp_status * bp_status * weight_status * weight_status
                         bp_status * weight_status * weight_status * weight_status
                         weight_status * weight_status * weight_status * weight_status /p;
output out = poly p = prediction r = residual; ;
run;quit;

data poly;
set poly;

if missing(residual) then residual = Answer - prediction;
run;

@leonzheng wrote:
I there any requirement for "by" in fitting in your way? I changed "by" group and it works... both "by" groups are character bu the working one is simpler (not working one is included in working one)

leonzheng · Posted 08-15-2018 04:37 PM

My actual dataset is much more conplicated, there are many other columns beside the ones I listed, for more detail in the "by" group, let me describe as below:

PARM x y Field ParmName F_ParmName Filter

3 x0 y0 1A AA 1A_AA pass

3.1 x1 y1 1A BB 1A_BB pass

2.9 x2 y2 1B AA 1B_AA pass

5 x3 y3 1B BB 1B_BB fail

Field and ParmName is independent, I contact them as a new column F_ParmName, PARM is the actual value of each ParmName. The dataset is much larger so there are much more than one row of data for each category as I showed. I used by Field fitting, PaigeMiller's way is working but fitting result is wired since by group is not right, but when I try by F_ParmName fitting, the PARM '.' row will have prediction '.' result.

Reeza · Posted 08-15-2018 04:39 PM

The By group isn't relevant to your original question.

If your model isn't working, that's a different issue. From your initial question, does my response work and do what you expected?

If so, how is your current situation different from what you explained? The number of variables or BY statements shouldn't affect the solution I proposed.

@leonzheng wrote:

My actual dataset is much more conplicated, there are many other columns beside the ones I listed, for more detail in the "by" group, let me describe as below:

PARM x y Field ParmName F_ParmName Filter

3 x0 y0 1A AA 1A_AA pass

3.1 x1 y1 1A BB 1A_BB pass

2.9 x2 y2 1B AA 1B_AA pass

5 x3 y3 1B BB 1B_BB fail

Field and ParmName is independent, I contact them as a new column F_ParmName, PARM is the actual value of each ParmName. The dataset is much larger so there are much more than one row of data for each category as I showed. I used by Field fitting, PaigeMiller's way is working but fitting result is wired since by group is not right, but when I try by F_ParmName fitting, the PARM '.' row will have prediction '.' result.

leonzheng · Posted 08-15-2018 05:47 PM

I also thought By statement should not affect the solution, however,

case 1. use your solution with by Field

fitting works, have prediction and residual but fitting result is wired.

case 2. use your solution with by F_ParmName or ParmName

fitting works, but missing part has no prediction, thus cannot calculate residual.

PaigeMiller · Posted 08-15-2018 06:16 PM

@leonzheng wrote:

My actual dataset is much more conplicated, there are many other columns beside the ones I listed, for more detail in the "by" group, let me describe as below:

PARM x y Field ParmName F_ParmName Filter

3 x0 y0 1A AA 1A_AA pass

3.1 x1 y1 1A BB 1A_BB pass

2.9 x2 y2 1B AA 1B_AA pass

5 x3 y3 1B BB 1B_BB fail

Field and ParmName is independent, I contact them as a new column F_ParmName, PARM is the actual value of each ParmName. The dataset is much larger so there are much more than one row of data for each category as I showed. I used by Field fitting, PaigeMiller's way is working but fitting result is wired since by group is not right, but when I try by F_ParmName fitting, the PARM '.' row will have prediction '.' result.

If you are getting a missing value as a prediction, then either you have done something wrong, or your data is causing the missing value. But to be definitive, we need to see the code you used to implement the trick I showed in Message 2 above plus the code to fit the model, and we need to see your data, at least enough of the data to see the variables used the model. Please share this with us.

--
Paige Miller

leonzheng · Posted 08-15-2018 06:45 PM

Here is my code:

data wafdata;
set wafdata;
PARM_fit = PARM;
if Tukey1 = 'fail' then PARM = .;
run;

proc glm data = wafdata noprint;
by F_ParmName;
class x y;
model PARM = x
y
x * x
x * y
y * y
x * x * x
x * x * y
x * y * y
y * y * y
x * x * x * x
x * x * x * y
x * x * y * y
x * y * y * y
y * y * y * y /p;
output out = poly p = prediction r = residual;
ods graphics off;
run;

data poly;
set poly;
if missing(residual) then residual = PARM_fit - prediction;
run;

Part of table wafdata is as this, also attached a xlsx file for whole data

PARMNAME PARM x y Field F_ParmName Filter
MR5 5.702313 -11.475 9.28 10A 10A_MR5 pass
MR5 5.701248 -9.945 9.28 10A 10A_MR5 pass
MR5 5.687179 -8.415 9.28 11A 11A_MR5 pass
MR5 -7777.7 -6.885 9.28 11A 11A_MR5 fail
MR5 5.632577 -5.355 9.28 11A 11A_MR5 pass
MR5 5.627558 -3.825 9.28 11A 11A_MR5 pass
MR5 5.653947 -2.295 9.28 12A 12A_MR5 pass
MR5 5.648395 -0.765 9.28 12A 12A_MR5 pass
MR5 5.740413 -22.185 8.7 14A 14A_MR5 pass
MR5 5.70026 -20.655 8.7 14A 14A_MR5 pass
MR5 5.715317 -19.125 8.7 14A 14A_MR5 pass
MRSUB 7295.616 18.36 39.73 11A 11A_MRSUB pass
MRSUB 7404.475 19.89 39.73 11A 11A_MRSUB pass
MRSUB 7431.816 21.42 39.73 11A 11A_MRSUB pass
MRSUB 1500000000 0.765 51.91 13A 13A_MRSUB fail
MRSUB 7061.483 2.295 51.91 13A 13A_MRSUB pass
MRSUB 7110.674 3.825 51.91 13A 13A_MRSUB pass
MRSUB 7063.209 5.355 51.91 14A 14A_MRSUB pass
MRSUB 7122.559 6.885 51.91 14A 14A_MRSUB pass
MRSUB 7108.894 8.415 51.91 14A 14A_MRSUB pass

Thank you!

PaigeMiller · Posted 08-15-2018 07:08 PM

Remove the CLASS statement.

--
Paige Miller

leonzheng · Posted 08-15-2018 07:11 PM

It worked! Thank you, could you tell me why I cannot have class statement?

PaigeMiller · Posted 08-16-2018 06:43 AM

Your variables X and Y are continuous. They are not class (or categorical) variables. So the putting these variables in the CLASS statement is wrong here.

--
Paige Miller

leonzheng · Posted 08-16-2018 12:37 PM

Thanks!

Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Re: Fitting without outlier but calculate residual for outlier

Registration is open

SAS Training: Just a Click Away