BookmarkSubscribeRSS Feed
knighsson
Obsidian | Level 7

Hello,

 

I have a question about finding the outliers, below is my SAS code.

 

ods graphics on;
PROC SURVEYREG DATA= nh.diet_240 nomcar; 
STRATA sdmvstra;  
CLUSTER sdmvpsu;  
CLASS  age RIAGENDR PIR SDDSRVYR RIDRETH1 BMI t240; 
WEIGHT glucwt4yr;
DOMAIN eligible;
model lbxgh= age RIAGENDR PIR SDDSRVYR RIDRETH1 BMI EIEER  t240/adjrsq clparm solution vadjust=none;
lsmeans t240/ lines adjust=tukey;
output out=a1c predicted= predicted residual = residual ;
run;
quit;
ods graphics off;


proc univariate data=a1c normal;
var residual;
qqplot residual/normal(L=1 mu=est sigma=est);
run;

the log:

Capture.PNG

In my original dataset, I only have 20470 observations, why here it is doubled as 40940?

 

The output:

Capture.PNG

How can I find the observation based on "Obs" ? For example How would I know the "age" of "obs=4243"?

 

Thank you!

3 REPLIES 3
PaigeMiller
Diamond | Level 26

In my original dataset, I only have 20470 observations, why here it is doubled as 40940?

 

I don't think we here across the internet can answer that, you have to dig into your data set and code that created it and figure it out. I do know that if SAS says the data set has 40940 observations, then I believe SAS. In particular, you haven't showed us anything that leads me to believe you have only 20470 observations.

 

As far as identifying outliers from the Extreme Observations table, you can, in PROC UNIVARIATE, use the ID statement which then replaces the observation number in that output with the value of some variable (for example, date, or patient ID, or similar), making the outlier easier to identify.

--
Paige Miller
knighsson
Obsidian | Level 7

Thank you for your respond. 

First, I run the original dataset again and you may see the observations are 20470, 

the code:

data nh.diet_240;
set nh.diet_240;
run;

ods graphics on;
PROC SURVEYREG DATA= nh.diet_240 nomcar; 
STRATA sdmvstra;  
CLUSTER sdmvpsu;  
CLASS  age RIAGENDR PIR SDDSRVYR RIDRETH1 BMI t240; 
WEIGHT glucwt4yr;
DOMAIN eligible;
model lbxgh= age RIAGENDR PIR SDDSRVYR RIDRETH1 BMI EIEER  t240/adjrsq clparm solution vadjust=none;
lsmeans t240/ lines adjust=tukey;
output out=a1c predicted= predicted residual = residual ;
run;
quit;
ods graphics off;

proc univariate data=a1c normal;
var residual;
qqplot residual/normal(L=1 mu=est sigma=est);
run;

the log:

Capture.PNG

 

In addition, if you see the extreme observations, you can see the largest two values are identical, the second largest two values are identical. similarly the smallest two values and the second smallest two values. So I wonder is my code right to generate the residual plot? 

 

Thank you!

PaigeMiller
Diamond | Level 26

I am not too familiar with PROC SURVEYREG, but could the reason the output data set has more observations than you expect be due to your use of the DOMAIN statement?

 

Have you actually looked in the output data set to see what is happening in there?

 

If the values are identical and appear in pairs, then any residual plot would be fine.

--
Paige Miller

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 793 views
  • 0 likes
  • 2 in conversation