turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- sorting in proc genmod

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-04-2013 06:51 PM

We met with a wired issue when we use proc genmod.

The code is straightforward, we have repeated measures data for hospitals, so we put hsp_ID as subject in the repeated statement. Totnum is total numerator for each hospital by quarter and totdenom is total denominator for each hospital by quarter. This step is variable seletion. By throwing one candidate variable at a time and checking the p value, if p value is greater than 0.1, then we remove it and less than 0.1 we will keep it in the model.

Before we run this code, we sort the data by hospital ID. But, when two person work on the same code, they have different output. Then, we figured it out, one person sort the data by hospital ID and quarter, and another person sort the data by hospital ID and status (a variable in the dataset).

proc genmod data=dsn;

class hsp_ID &indvars.;

model totnum/totdenom=&indvars./dist=binomial link=logit;

repeated subject=hsp_ID/type=AR corrw;

run;

Sometimes, both of their p value are greater than 0.1 or less than 0.1, but sometimes, one is greater than 0.1 and another is less than. So, the same dataset and same coding give us different output if we sort the data differently.

Grateful for any thoughts or suggestions!

Kui

Accepted Solutions

Solution

03-05-2013
11:32 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-05-2013 11:32 AM

I don't use GENMOD much, preferring GLIMMIX. In GLIMMIX, you would specify a repeated measures model of this sort as something like:

proc glimmix data=dsn;

class hsp_ID &indvars.;

model totnum/totdenom=&indvars./dist=binomial link=logit;

random quarter /residual subject=hsp_ID type=AR(1); /* For a GEE type model; for a true GLMM with a repeated structure for this distribution, drop 'residual' */

run;

Where I assume that quarter is in the list of &indvars. This approach tells me that I should sort my data by subject and then by the indexing variable (here it is quarter). Because of this schema, I haven't seen the problem you found with GENMOD. Because GENMOD does not specify the indexing variable, I too worry that if the data are not sorted in a way that recognizes the repeated nature then the algorithm may lead to a discrepancy.

Steve Denham

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-05-2013 08:56 AM

Not surprising behavior. These communities are filled with posts pointing out the dangers of model building using such stepwise methods, and they get even worse when the data are not normally distributed. My recommendation is to do something (almost anything) different for the model building.

If you MUST do this, then sorting by hospital ID and then quarter preserves the repeated nature correctly, and so would be a better choice.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to SteveDenham

03-05-2013 11:22 AM

Thanks, Steve, for your suggestions!

I am still confusing why the same dataset only the variables sorted by different order, we got different output? It is really dangerous and I am kind of lost the confidence for using the procecure. :smileyconfused:

Solution

03-05-2013
11:32 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-05-2013 11:32 AM

I don't use GENMOD much, preferring GLIMMIX. In GLIMMIX, you would specify a repeated measures model of this sort as something like:

proc glimmix data=dsn;

class hsp_ID &indvars.;

model totnum/totdenom=&indvars./dist=binomial link=logit;

random quarter /residual subject=hsp_ID type=AR(1); /* For a GEE type model; for a true GLMM with a repeated structure for this distribution, drop 'residual' */

run;

Where I assume that quarter is in the list of &indvars. This approach tells me that I should sort my data by subject and then by the indexing variable (here it is quarter). Because of this schema, I haven't seen the problem you found with GENMOD. Because GENMOD does not specify the indexing variable, I too worry that if the data are not sorted in a way that recognizes the repeated nature then the algorithm may lead to a discrepancy.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to SteveDenham

03-05-2013 11:52 AM

I am grateful for your time and help.

We have two type of "quarter" in the list. One is 1, 2, 3 and 4, which is for the seasonality consideration. Another is continuous quarter, 1, 2,...22, which is 22 quarters in our data. Both of them are candidate variables for the model. The outcome is the rate(totnum/totdenom) changing by quarter.

Which is more appropriate using for indexing variable? How to understand to put the quarter in the random statement?

Thanks,

Kui

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-05-2013 12:43 PM

The second is the repeated measure. I would call the first "season" for obvious reasons (not wanting to confuse the two types of quarter).

Now, shifting gears a bit--why go through the model building effort in this way? What is the ultimate objective? If you want to develop a predictive model, backward stepwise is almost sure to result in a model that has inadequate predictive ability. See Cassell and Flom http://www.nesug.org/Proceedings/nesug09/sa/SA01.pdf, or http://www.denversug.org/presentations/2010CODay/StopStepPresntn.pdf, or Frank Harrell's *Regression Modeling Strategies *(2001) at http://biostat.mc.vanderbilt.edu/twiki/pub/Main/RmS/rms.pdf.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-19-2013 02:45 PM

AR(k) structure makes the ordering of your observation within a cluster matter, that is the problem

Are you sure you need to impose this structure to your cov matrix?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to oloolo

03-19-2013 02:54 PM

Thanks! I changed it to default.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-19-2013 03:00 PM

Well, the default is exchangeable (same as compound symmetry), so sorting will not make a difference. However, if this is truly repeated in time (such as season) then the ordering is important: (spring, summer, fall, winter) is not the same as (summer, spring, winter, fall).

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to SteveDenham

03-22-2013 12:20 PM

Thanks so much for you guys time, help and suggestions!

When I applied the Genmod to compare the slope change for the measures SI3 and SI10, I met with another issue I cannot understand.

The rate is the outcome variable, we use numerator/denominator to present it.

We have a set of independent variables to build the model.

The interaction term measure*time is our most interest output since we want to know which measure rate changes faster. The estimate of time is the slope(rate change) for reference group(SI10) and the estimate of measure*time is the difference slope to the reference group which is for SI3.

In output, the estimate for SI10 is 0.231 and the estimate for SI3 is 0.140, which means the rate change of SI10 is quicker than that of SI3. Is it because we use the logit function? From 0.98 to 0.99 need a bigger value(slope) to make it happen? I cannot use simple linear to visualize the plot?

Thanks again!

Kui