outliers

knighsson · Posted 05-20-2021 10:39 AM

Hello,

I have a question about finding the outliers, below is my SAS code.

ods graphics on;
PROC SURVEYREG DATA= nh.diet_240 nomcar; 
STRATA sdmvstra;  
CLUSTER sdmvpsu;  
CLASS  age RIAGENDR PIR SDDSRVYR RIDRETH1 BMI t240; 
WEIGHT glucwt4yr;
DOMAIN eligible;
model lbxgh= age RIAGENDR PIR SDDSRVYR RIDRETH1 BMI EIEER  t240/adjrsq clparm solution vadjust=none;
lsmeans t240/ lines adjust=tukey;
output out=a1c predicted= predicted residual = residual ;
run;
quit;
ods graphics off;


proc univariate data=a1c normal;
var residual;
qqplot residual/normal(L=1 mu=est sigma=est);
run;

the log:

In my original dataset, I only have 20470 observations, why here it is doubled as 40940?

The output:

How can I find the observation based on "Obs" ? For example How would I know the "age" of "obs=4243"?

Thank you!

PaigeMiller · Posted 05-20-2021 10:51 AM

In my original dataset, I only have 20470 observations, why here it is doubled as 40940?

I don't think we here across the internet can answer that, you have to dig into your data set and code that created it and figure it out. I do know that if SAS says the data set has 40940 observations, then I believe SAS. In particular, you haven't showed us anything that leads me to believe you have only 20470 observations.

As far as identifying outliers from the Extreme Observations table, you can, in PROC UNIVARIATE, use the ID statement which then replaces the observation number in that output with the value of some variable (for example, date, or patient ID, or similar), making the outlier easier to identify.

--
Paige Miller

knighsson · Posted 05-20-2021 02:56 PM

Thank you for your respond.

First, I run the original dataset again and you may see the observations are 20470,

the code:

data nh.diet_240;
set nh.diet_240;
run;

ods graphics on;
PROC SURVEYREG DATA= nh.diet_240 nomcar; 
STRATA sdmvstra;  
CLUSTER sdmvpsu;  
CLASS  age RIAGENDR PIR SDDSRVYR RIDRETH1 BMI t240; 
WEIGHT glucwt4yr;
DOMAIN eligible;
model lbxgh= age RIAGENDR PIR SDDSRVYR RIDRETH1 BMI EIEER  t240/adjrsq clparm solution vadjust=none;
lsmeans t240/ lines adjust=tukey;
output out=a1c predicted= predicted residual = residual ;
run;
quit;
ods graphics off;

proc univariate data=a1c normal;
var residual;
qqplot residual/normal(L=1 mu=est sigma=est);
run;

the log:

In addition, if you see the extreme observations, you can see the largest two values are identical, the second largest two values are identical. similarly the smallest two values and the second smallest two values. So I wonder is my code right to generate the residual plot?

Thank you!

PaigeMiller · Posted 05-21-2021 06:57 AM

I am not too familiar with PROC SURVEYREG, but could the reason the output data set has more observations than you expect be due to your use of the DOMAIN statement?

Have you actually looked in the output data set to see what is happening in there?

If the values are identical and appear in pairs, then any residual plot would be fine.

--
Paige Miller

outliers

Re: outliers

Re: outliers

Re: outliers