turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Statistical help for running poisson regression on...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-26-2017 01:33 PM

Hello,

My name is Brandon and this is my first time posting on here. I am currently trying to overall determine if increased temperatures are leading to increased hospital admissions. I am running my code using SAS software version 9.4 (TS1M4).

Attached will be the file I am working with, but a little more about what my objective is and my background.

I am trying to create a logistic model which will show the equation for modeling temperatures (the predictor variable) to the number of total hospital admissions (the response variable). Because I have daily counts in my data, years 2009-2015 for every day of the year with such a large sample size, I am assuming that my distribution is poisson (large n, small p). I don't have much background in coding, and this is my first time using SAS, but I know that I am trying to run this using the categorical variable, county (which I believe means it will be reported as a class variable), and using different months.

I would love assistance in getting the correct output and code so that I can determine if increased temperatures, using average temperatures and max temperatures, are leading to increased hospital admissions in total. Then from there I believe I can investigate if the effect modification from each age and two particular diseases if I have the code for one model. (I believe it is just a matter of changing what the predictor and response variables).

Anyways, here is some of the code I have used thus far, but I get really large scaling estimates so I am not sure if I am running this procedure correctly or not. In addition, I get extremely large Wald Chi-Square values (ranging from 0.11 to 5,000,000).

========================================================================================================

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = churchill;

sheet = "Churchill";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = clark;

sheet = "Clark";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = douglas;

sheet = "Douglas";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = elko;

sheet = "Elko";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = esmeralda;

sheet = "Esmeralda";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = eureka;

sheet = "Eureka";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = humboldt;

sheet = "Humboldt";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = lander;

sheet = "Lander";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = lincoln;

sheet = "Lincoln";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = lyon;

sheet = "Lyon";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = mineral;

sheet = "Mineral";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = nye;

sheet = "Nye";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = pershing;

sheet = "Pershing";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = storey;

sheet = "Storey";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = washoe;

sheet = "Washoe";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = white_pine;

sheet = "White_Pine";

run;

proc import datafile = '\\files\users\bwoudhuysen\Desktop\EDdata.XLSX'

dbms = XLSX

replace

out = carson;

sheet = "Carson";

run;

data all; set churchill clark douglas elko esmeralda eureka humboldt lander lincoln lyon mineral nye pershing storey washoe white_pine carson;

run;

/* Add the appropriate variable to the data set */

data all;

set all;

lnPop = log(PopulationSize);

month=month(date); /* If you want to analyze by month . . . */

year=year(date); /* . . . or year . . . */

monyr=year*100 + month; /* . . . or month/year . . . */

if 4<=month<=8 then AprAug=1; /* . . . or something custom. */

else AprAug=0;

run;

/* Taking the month variable as an example, you can add it as a

categorical covariate by adding it to both the CLASS and MODEL Statements */

proc genmod data=all descending;

class county(param=ref ref ='Clark') month(param=ref ref = '3');

model TotalAdmissions = avgtemp county month / dist = poisson TYPE3

link = log

offset=lnPop

PSCALE

/* Or maybe use DSCALE? */

;

run;

/* You could also perform completely separate regressions for each period

by putting the variable in the BY Statement. However, to do that, the

data set must first be sorted.

*/

proc sort data=all;

by month;

proc genmod data=all;

by month;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-27-2017 02:49 AM

Using Poisson regression on such count data is ok, but of some other reason than what you suggest. If you want to estimate the rate of event, then the likelihood function is the same as if the count data were Poisson distributed. This works even if n is not big and p (or the rate) is not small.

Secondly, it may be wise to aggregate your data before using proc genmod. Genmod is not smart enough to aggregate the data before it starts estimation, so if you have many records is faster to do the aggregate first. Just use proc means (with nway option) and summarize days and counts. Then put log(total_days) into the offset.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-27-2017 01:28 PM

This makes sense to aggregate first.

Excuse me for being new, but the what would the code for this look like?

I am guessing,

==================================================================================

data all; set churchill clark douglas elko esmeralda eureka humboldt lander lincoln lyon mineral nye pershing storey washoe white_pine carson;

run;

proc means data = all nway;

run;

proc sort data=all;

by month;

/* Add the appropriate variable to the data set */

data all;

set all;

TotalDays = log(43452);

month=month(date); /* If you want to analyze by month . . . */

year=year(date); /* . . . or year . . . */

monyr=year*100 + month; /* . . . or month/year . . . */

if 4<=month<=8 then AprAug=1; /* . . . or something custom. */

else AprAug=0;

run;

proc genmod data=all descending;

class county(param=ref ref ='Clark') month(param=ref ref = '3');

model TotalAdmissions = avgtemp county month / dist = poisson TYPE3

link = log

offset=TotalDays

DSCALE

;

run;

=====================================================================================

Unfortunately, I am still getting the same scaling issue with my models.

I am unsure if I am just doing this incorrectly or what, but thank you so much for your thoughts.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-28-2017 03:37 AM

You can aggregate your data with proc means, and then run the GENMOD procedure:

```
proc means data=all noprint nway;
var populationsize totaladmissions;
class county month avgtemp ;
output out=summary sum=populationsize totaladmissions;
run;
data summary;
set summary;
lnpop=log(populationsize);
run;
proc genmod data=summary descending;
class county(param=ref ref ='Clark') month(param=ref ref = '3');
model TotalAdmissions = avgtemp county month / dist = poisson TYPE3
link = log
offset=lnPop;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-28-2017 05:54 AM

Wouldn't I still need to use a scaling parameter? In this case, I applied the

DSCALE (after the offset or I could use PSCALE)

option and I get reasonable chi-square values, but my scaling estimate is 46.2103, but isn't this the case of overdispersion?

I feel like we are close I am just worried that the scale means there is an existence of overdispersion.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-28-2017 08:21 AM

You shouldn't use the scale options in such model, not even if you observe a high degree of overdispersion. This is because "overdispersion" is a measure of model fit, which is a check of if data has the Poisson-distribution. But, in this case, Poisson regression is used just because the likelihood function to estimate the rate of events is similar to the likelihood function if you regard data as Poisson distributed. You can therefore here use poisson regression to make interference (estimation and hypothesis testing), but not for something like checking distribution of data. An other way to see that it would be wrong is that the overdispersion statistic would rely much on how much you aggregate data before estimation of parameters, which then would lead to that the degree of aggregation would affect standard errors and confidence intervals - and that would not make sense.

If you actually had Poisson distributed count data then it would make sense to check for overdispersion and other kind of model validation.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-28-2017 08:53 AM

Okay. I think I am understanding.

So for interpretation purposes given the following output (in the attachments)

would I just say that all the counties were significant in having an effect for modeling temperature and hospital admissions.

More notably, the overall effect is given by

Y = -9.9652 + 0.0385X

where Y is equal to the total number of hospital admissions, and X is the temperatue.

Thus, with increase in temperature there is a positive association with the outcome of total hospital admissions?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-28-2017 09:06 AM

All counties (apart from the reference counti) are significant from the reference counti. Further, Counti is overall significant (the type 3 test).

With the equation "Y = -9.9652 + 0.0385X" you estimate the log(rate) in the reference counti, x is avg-temperature. You should add the estimate of the counti difference if you want the estimate the log(rate) in some other counti.

Yes, avg temperature seem to be significant associated with increase in rate.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-28-2017 09:17 AM

Oh okay.

So the log(Totaladmission) = -9.9652 + 0.0385(avg temperature), but what do you mean estimate for county difference?

Like specify what each county is predicting for it?

And also thank you so much for everything you have helped me with so far! It has been super helpful.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-28-2017 11:25 AM

In the class statement you specified a reference level (or otherwise per default it would select one itself). Íf you dont include counti in the equation for your estimation then it is the estimate for the reference level. The estimates for the other counties than the reference level is the difference relative to the reference level. So, if you would estimate the level for some other level than the reference level then you should include the estimated difference in the equation.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-29-2017 03:40 PM

Hi Jacob,

So I took a step back and want to know what and why I am aggregating. It looks like for each county I combined all Januaries for the same average monthly temperature, and so no longer have a daily dataset. I have a monthly dataset and perhaps can answer the question if average monthly temperature is a predictor, but not on daily emergency visits.

Because of the huge scaling I am worried that poisson is now apoor choice (with scaling ~45). So maybe I am just not sure what is up here now, and confusing myself.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-01-2017 03:22 PM - edited 05-01-2017 03:22 PM

Again, I know you said scaling isn't an issue, but I am just worried that it is a potential issue with stating that my poisson regression is not a good model, but I don't know what would be a better model.

Thanks,

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

05-02-2017 03:09 AM - edited 05-02-2017 03:10 AM

Think about if assumptions for Poisson regression is fulfilled. That is, persons should be independent from each other, and the rate should be piecewise constant. Further, for log-linear Poisson regression (the one with log as link-function), you also have to assume the multiplicative structure of the rate of events. But, you can not use deviance to justify Poisson regression as "Poisson regression" in this case is not used to fit Poisson distributed data, but only as a trick to maximize the right likelihood function.