Re: What is the OBSMARGINS dataset format for PROC GENMOD?

pblls · Posted 01-18-2022 05:01 AM

Hi all,

I'm trying to specify coefficients for LSMEANS class levels in PROC GENMOD, but am struggling to get this working. Consider the following example:

data cohorts;
   fixed = 1;
   cohort = 1; numer = 10; ldenom = log(100); output;
   cohort = 2; numer =  9; ldenom = log(100); output;
   cohort = 3; numer =  1; ldenom =   log(2); output;
run;

proc genmod data=cohorts;
   class cohort fixed;
   model numer = fixed cohort / dist=poisson offset=ldenom;
   lsmeans fixed / e diff cl;
run;

I would like the LSMEANS coefficients (e option) for cohort to be 0.495, 0.495 and 0.01 respectively based on the denom values, but they default to 0.333 each. I see that OBSMARGINS=<OM-data-set> should let me specify this, and I also see that this dataset should contain 'all model variables except the dependent one' (fixed and cohort in my case), but I don't see how I specify the value of the coefficient itself. Not specifying a dataset doesn't change anything if the input is in this 'one record with all observations per cohort' format, which is unfortunately all I have.

This feels like it should be easy but I'm completely failing to find any example or specification for this OM dataset format, any suggestions?

pblls · Posted 01-18-2022 05:32 AM

After some further testing it seems like it's just the number of observations in the OM dataset that matters, i.e. the following addition should be correct:

data obsm;
   fixed = 1;
   cohort = 1; do i = 1 to 100; output; end;
   cohort = 2; do i = 1 to 100; output; end;
   cohort = 3; do i = 1 to   2; output; end;
   drop i;
run;

proc genmod data=cohorts;
   ...
   lsmeans fixed / ... om=obsm;
run;

The problem is now that this gives a segfault in PROC GENMOD on our central SAS installation, so I guess we'll be reaching out to Tech Support.

Rick_SAS · Posted 01-18-2022 06:19 AM

What version of SAS are you running? Submit

%put &=SYSVLONG4;

and send back the line that appears in the log that looks something like this:

121 %put &=SYSVLONG4;
SYSVLONG4=9.04.01M6P11072018

When I run your program on SAS 9.4m7, it does not crash and it gives an answer, so perhaps this issue has been resolved in a more recent version of SAS.

In your program, the FIXED variable has only one level. That is a degenerate situation. Please try your program when the CLASS variables have more than one level. For example, you might try running the following:

data cohorts;
   do fixed = 1 to 2;
      cohort = 1; numer = 10; ldenom = log(100); output;
      cohort = 2; numer =  9; ldenom = log(100); output;
      cohort = 3; numer =  1; ldenom =   log(2); output;
   end;
run;

proc genmod data=cohorts;
   class cohort fixed;
   model numer = fixed cohort / dist=poisson offset=ldenom;
   lsmeans fixed / e diff cl;
run;

data obsm;
   do fixed = 1 to 2;
      cohort = 1; do i = 1 to 100; output; end;
      cohort = 2; do i = 1 to 100; output; end;
      cohort = 3; do i = 1 to   2; output; end;
   end;
   drop i;
run;

proc genmod data=cohorts;
   class cohort fixed;
   model numer = fixed cohort / dist=poisson offset=ldenom;
   lsmeans fixed /  e diff cl om=obsm;
run;

pblls · Posted 01-18-2022 09:13 AM

We're also on 9.4M7:

SYSVLONG4=9.04.01M7P08052020

My example and your extension both work on a second installation of SAS (and not on the server), but if I try to run it with some actual data (a few thousand observations in the obsm dataset) I also get a segfault on that local environment... both are running the same SAS version.

This will probably not be fixed with just some additional SAS code, so I'll try to get some technical support.

To continue in the spirit of this thread, is there a way to specify the margins without creating that number of observations? Ideally I would like to do something like this and not need a huge number of records:

data obsm;
   do fixed = 1, 2;
      cohort = 1; _MARGIN_ = 100; output;
      cohort = 2; _MARGIN_ = 100; output;
      cohort = 3; _MARGIN_ =   2; output;
   end;
run;

StatDave · Posted 01-18-2022 12:06 PM

It's not at all clear what your ultimate goal is with this, but it appears that you simply have an observed proportion for each of three groups. If the goal is to compare those group proportions, this can be done by fitting an appropriate model and using the NLMeans macro to make pairwise comparisons. Since the data are just counts from an aggregated binary response, the appropriate model is a logistic model. The following code fits the model and then does the comparisons with the NLMeans macro following the discussion in this note:

data cohorts;
input cohort num den;
datalines;
1 10 100
2 9 100
3 1 2
;
proc logistic data=cohorts;
class cohort/param=glm;
model num/den = cohort;
lsmeans cohort / ilink e diff cl;
ods output coef=c;
store log;
run;
%nlmeans(instore=log, coef=c, link=logit)

pblls · Posted 01-19-2022 06:41 AM

Ah, yes, I didn't really go into the 'why' because unfortunately we're mostly stuck with this method for historical reasons. The goal is to provide confidence intervals for an event rate, and without the covariate the model matches very closely with the method of Ulm (10.1093/oxfordjournals.aje.a115507) which was used on the pool of cohorts.

The data are event counts with follow-up time in the denominator, and while I only have aggregates across cohorts individuals may have counts >1, so I'm not sure if logistic regression is appropriate here?

StatDave · Posted 01-19-2022 09:37 AM

If you counted the number of events that occur in each cohort (the numerator) out of a total number of observed individuals in each cohort (the denominator), then at the individual level the response is binomial and the logistic model is appropriate. If, for some reason, you want to assume that the cohorts have the same event probability, then you could simply remove COHORT from the model and estimate the common event probability:

proc logistic data=cohorts;
model num/den = ;
estimate 'pr' intercept 1 / ilink;
run;