BookmarkSubscribeRSS Feed
hagml
Calcite | Level 5

Has someone else ever faced this situation:

Im using **proc gee** to get a pulled estimate and then instead of usingthe by statement to have the stratified analysis and proc sort to sort based on the stratified variable, I use where variable=0 and where variable=1.

Now, the weird thing is if I sort my data based on ID before getting the pulled estimates, I get different coefficient estimates, if I don't sort at all I get another set of estimates (which in this case, the pulled estimate doesn`t lay between the two stratified-estimate intervals) and if I sort based on my stratified variable I get another set of coefficient estimates.

I have never heard that we need to sort the data before running proc gee for pulled estimates but also why my estimates are not laying in the interval when my dataset is randomly sorted?, and why am I getting different estimates when I sort the dataset every time something different when I sort based on ID or based on sex (my stratified variable)!?

Pulled estimate:

proc gee data=data;

class x1 sex x2;

model y = x1 sex x2 x3 ;

repeated subject = x1 / type=un;

run;

Stratified estimates:

proc gee data=data;

where sex=1;

class x1 x2;

model y = x1 x2 x3 ;

repeated subject = x1 / type=un;

run;

proc gee data=data;

where sex=0;

class x1 x2;

model y = x1 x2 x3 ;

repeated subject = x1 / type=un;

run;

My expected outcome:

the beta estimates for x2 and x3 in stratified analysis(sex=0/1) <the beta estimates for x2 and x3 in pulled analysis< the beta estimates for x2 and x3 in stratified analysis(sex=0/1)

My remedy: proc sort data=data; by ID (and once by sex); run;

Getting a complete different estimates yet not getting the expected outcome

8 REPLIES 8
ballardw
Super User

I don't see any Sort code.

 

To get "expected" results you need to provide 1) the data set and 2) what the expected result may be.

 

Depending on the actual underlying algorithms some change could be expected from different orders of the data as rounding/internal summary steps could yield different results. The question is how much and is there a practical difference? A difference of $1. when discussing values in $1,000,000,000 ranges is not likely important but if all of the values are less than $10 it would likely be a practical difference.

 

In one shop I worked with we had some model software where we changed the order of the variables on the MODEL statement equivalent (not SAS so different code). The result could vary quite a bit depending on the order the variables appeared. So if we got a "large" difference in result that model was deemed unusable even though some order of the variables would yield extremely good diagnostic values.

hagml
Calcite | Level 5

Thank you.

About the data, I can`t unfortunately share the data and in a hypothetic data again it is possible that the issue I am facing can not be replicated.

 

But the difference are quiet dramatic. i.e positive estimates become negative or instead of 0.22 I get 0.77.

Quentin
Super User

@hagml wrote:

Thank you.

About the data, I can`t unfortunately share the data and in a hypothetic data again it is possible that the issue I am facing can not be replicated.

 

But the difference are quiet dramatic. i.e positive estimates become negative or instead of 0.22 I get 0.77.


Try creating a simulated dataset that replicates the problem.  Then please post the code to create the simulated dataset, and the PROC GEE code that shows the surprising result.  So sort the data one way, run PROC GEE, sort a different way, and run PROC GEE again.  That way you would be providing a fully reproducible example of the problem, which people can use to test and explore.  If sort order matters, I would think you could show it fairly easily.

 

Also, in your real code, do you perhaps have order=data specified somewhere?  In that case, the GEE step would use the order of the data to determine the category used for as the reference category for the CLASS variables.  But if that's the issue, it should be pretty obvious, as it would effect the parameter estimates but not over model statistics.  (I assume, I haven't used PROC GEE).

 

I tried an example I stole from the docs ( https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/statug/statug_code_geeex1.htm ) , but couldn't make sort order change the results.

 

data Resp;
   input Center ID Treatment $ Sex $ Age Baseline Visit1-Visit4;
   datalines;
1  1 P M 46 0 0 0 0 0
1  2 P M 28 0 0 0 0 0
1  3 A M 23 1 1 1 1 1
1  4 P M 44 1 1 1 1 0
1  5 P F 13 1 1 1 1 1
1  6 A M 34 0 0 0 0 0
1  7 P M 43 0 1 0 1 1
1  8 A M 28 0 0 0 0 0
1  9 A M 31 1 1 1 1 1
1 10 P M 37 1 0 1 1 0
1 11 A M 30 1 1 1 1 1
1 12 A M 14 0 1 1 1 0
1 13 P M 23 1 1 0 0 0
1 14 P M 30 0 0 0 0 0
1 15 P M 20 1 1 1 1 1
1 16 A M 22 0 0 0 0 1
1 17 P M 25 0 0 0 0 0
1 18 A F 47 0 0 1 1 1
1 19 P F 31 0 0 0 0 0
1 20 A M 20 1 1 0 1 0
1 21 A M 26 0 1 0 1 0
1 22 A M 46 1 1 1 1 1
1 23 A M 32 1 1 1 1 1
1 24 A M 48 0 1 0 0 0
1 25 P F 35 0 0 0 0 0
1 26 A M 26 0 0 0 0 0
1 27 P M 23 1 1 0 1 1
1 28 P F 36 0 1 1 0 0
1 29 P M 19 0 1 1 0 0
1 30 A M 28 0 0 0 0 0
1 31 P M 37 0 0 0 0 0
1 32 A M 23 0 1 1 1 1
1 33 A M 30 1 1 1 1 0
1 34 P M 15 0 0 1 1 0
1 35 A M 26 0 0 0 1 0
1 36 P F 45 0 0 0 0 0
1 37 A M 31 0 0 1 0 0
1 38 A M 50 0 0 0 0 0
1 39 P M 28 0 0 0 0 0
1 40 P M 26 0 0 0 0 0
1 41 P M 14 0 0 0 0 1
1 42 A M 31 0 0 1 0 0
1 43 P M 13 1 1 1 1 1
1 44 P M 27 0 0 0 0 0
1 45 P M 26 0 1 0 1 1
1 46 P M 49 0 0 0 0 0
1 47 P M 63 0 0 0 0 0
1 48 A M 57 1 1 1 1 1
1 49 P M 27 1 1 1 1 1
1 50 A M 22 0 0 1 1 1
1 51 A M 15 0 0 1 1 1
1 52 P M 43 0 0 0 1 0
1 53 A F 32 0 0 0 1 0
1 54 A M 11 1 1 1 1 0
1 55 P M 24 1 1 1 1 1
1 56 A M 25 0 1 1 0 1
2  1 P F 39 0 0 0 0 0
2  2 A M 25 0 0 1 1 1
2  3 A M 58 1 1 1 1 1
2  4 P F 51 1 1 0 1 1
2  5 P F 32 1 0 0 1 1
2  6 P M 45 1 1 0 0 0
2  7 P F 44 1 1 1 1 1
2  8 P F 48 0 0 0 0 0
2  9 A M 26 0 1 1 1 1
2 10 A M 14 0 1 1 1 1
2 11 P F 48 0 0 0 0 0
2 12 A M 13 1 1 1 1 1
2 13 P M 20 0 1 1 1 1
2 14 A M 37 1 1 0 0 1
2 15 A M 25 1 1 1 1 1
2 16 A M 20 0 0 0 0 0
2 17 P F 58 0 1 0 0 0
2 18 P M 38 1 1 0 0 0
2 19 A M 55 1 1 1 1 1
2 20 A M 24 1 1 1 1 1
2 21 P F 36 1 1 0 0 1
2 22 P M 36 0 1 1 1 1
2 23 A F 60 1 1 1 1 1
2 24 P M 15 1 0 0 1 1
2 25 A M 25 1 1 1 1 0
2 26 A M 35 1 1 1 1 1
2 27 A M 19 1 1 0 1 1
2 28 P F 31 1 1 1 1 1
2 29 A M 21 1 1 1 1 1
2 30 A F 37 0 1 1 1 1
2 31 P M 52 0 1 1 1 1
2 32 A M 55 0 0 1 1 0
2 33 P M 19 1 0 0 1 1
2 34 P M 20 1 0 1 1 1
2 35 P M 42 1 0 0 0 0
2 36 A M 41 1 1 1 1 1
2 37 A M 52 0 0 0 0 0
2 38 P F 47 0 1 1 0 1
2 39 P M 11 1 1 1 1 1
2 40 P M 14 0 0 0 1 0
2 41 P M 15 1 1 1 1 1
2 42 P M 66 1 1 1 1 1
2 43 A M 34 0 1 1 0 1
2 44 P M 43 0 0 0 0 0
2 45 P M 33 1 1 1 0 1
2 46 P M 48 1 1 0 0 0
2 47 A M 20 0 1 1 1 1
2 48 P F 39 1 0 1 0 0
2 49 A M 28 0 1 0 0 0
2 50 P F 38 0 0 0 0 0
2 51 A M 43 1 1 1 1 0
2 52 A F 39 0 1 1 1 1
2 53 A M 68 0 1 1 1 1
2 54 A F 63 1 1 1 1 1
2 55 A M 31 1 1 1 1 1
;

data Resp;
   set Resp;
   Visit=1;  Outcome=Visit1;  output;
   Visit=2;  Outcome=Visit2;  output;
   Visit=3;  Outcome=Visit3;  output;
   Visit=4;  Outcome=Visit4;  output;
run;

proc sort data=Resp ;
  by ID Visit ;
run ;

proc gee data=Resp descend;
   class ID Treatment Center Sex Baseline;
   model Outcome=Treatment Center Sex Age Baseline /
         dist=bin link=logit;
   repeated subject=ID(Center) / corr=exch corrw;
run;

proc sort data=Resp ;
  by age ;
run ;

proc gee data=Resp descend;
   class ID Treatment Center Sex Baseline;
   model Outcome=Treatment Center Sex Age Baseline /
         dist=bin link=logit;
   repeated subject=ID(Center) / corr=exch corrw;
run;
The Boston Area SAS Users Group is hosting free webinars!
Next webinar will be in January 2025. Until then, check out our archives: https://www.basug.org/videos. And be sure to subscribe to our our email list.
mkeintz
PROC Star

In each of your stratified estimates, you filter on a single value of SEX.  Yet in the corresponding PROC GEE code you include SEX in a CLASS statement and a MODEL statement.  Why?  Does the inclusion of this unnecessary and unhelpful predictor impact the GEE algorithm?  As I have no GEE experience, I can't offer an answer.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
hagml
Calcite | Level 5

Thank you for the comment.

 

It was a typo in the body of the question which is now modified.

Rick_SAS
SAS Super FREQ

Run

proc freq data=data;
tables sex / missprint;
run;

Does the output show any missing values for the SEX variable? Or any values other than 0/1? If so, the calls to PROC GEE are using different observations.

 

 

hagml
Calcite | Level 5
Thanks much Rick. We do have 2 missing data in 900 total sample. But this made me think if having missing data in the outcome (which we have a lot) can also affect the estimates? I also tried to use proc genmod as I was getting error in hessian matrix when using proc gee (the error was again pointing out to the missing data) but with proc genmod, although I am not getting that error, the estimates again (!) are completely different from proc gee however, this issue in the question still remains.
Rick_SAS
SAS Super FREQ

You can read about how PROC GEE handles missing values in the response by looking at the doc: SAS Help Center: Weighted Generalized Estimating Equations under the MAR Assumption

 

If you prefer experiments to theory, you can also run an experiment: Use a DATA step to set about 20-30 values of the response variables to missing and rerun the analysis. Study the output to see how the statistics change.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1666 views
  • 3 likes
  • 5 in conversation