- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am at a bit of a loss. Or perhaps I am once again making things more difficult than they have to be as I am rather good at that.
I'm working with a data set with over 400k observations.
I have a binary variable to indicate whether or not an organization participated (1) in a program or did not (0).
I have another variable which indicates whether an incident happened (1) or did not (0).
I have each of these repeated for 2016-2021.
proc freq data = HAVE;
tables PROGRAM * Incident;
by year;
run;
I can manually calculate the rate per 10,000 and graph that using the code above, but I am looking for a way to calculate the rate within SAS so that I could then determine if there is a difference in the slopes between those that participated in the program and those that did not with regard to incidents. And I'm not sure precisely how to go about that part of things within SAS.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This reads your data and plots the rates per 10k:
data a; input year a b c d;
yearindx=year-2015;
program=1; NumInc=a; rateden=log(c/10000); rate=a/(c/10000); output;
program=0; NumInc=b; rateden=log(d/10000); rate=b/(d/10000); output;
datalines;
2016 33 52 21903 48969
2017 23 37 21637 47737
2018 26 38 21608 47675
2019 30 43 21156 46878
2020 24 56 20109 45540
2021 16 46 19134 43691
;
proc sgplot;
scatter y=rate x=year / group=program;
series y=rate x=year / group=program;
run;
Because the trends are pretty clearly nonlinear, it would be a mistake to just fit lines to the two program groups and compare their slopes. These curves appear approximately cubic. To compare the trends over time, you can follow the approach discussed and illustrated with a rates model in this note, but modified to allow for cubic fits. Note that it is better to not use the actual year values when modeling because they are so large.
proc genmod;
class program;
model NumInc = program|yearindx|yearindx|yearindx / dist=poisson noint offset=rateden;
effectplot / moff obs;
run;
The results show that the program trends don't differ significantly - the differences between programs on the linear, quadratic, and cubic components (program*yearindx, program*yearindx*yearindx, and program*yearindx*yearindx*yearindx) are not significant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Rate per 10,000 ... what? Show an example of your data. I assume that for each observation in your data you have values for variables indicating organization, year, program participation or not, and incident occurred or not. What are the observations - are there multiple observations for each organization and/or year? The answer will probably involve a modeling approach along the lines of what is shown in this note if you really have rates rather than proportions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
So, it's per total visits.
Data would be...
year Participant Incident Visit
2016 1 1 1
2016 1 0 1
2016 0 0 1
2016 0 1 1
2016 1 0 1
2016 0 1 1
2016 1 1 1
2016 0 0 1
2016 1 1 1
output looks like this when I run the table:
YEAR | Program and Incident | Not Program and Incident | Program and Total Visits | Not Program and Total Visits |
2016 | 33 | 52 | 21903 | 48969 |
2017 | 23 | 37 | 21637 | 47737 |
2018 | 26 | 38 | 21608 | 47675 |
2019 | 30 | 43 | 21156 | 46878 |
2020 | 24 | 56 | 20109 | 45540 |
2021 | 16 | 46 | 19134 | 43691 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
When you have 1/0 coded variables then the MEAN of the variables is in effect the percent of 1s in the values.
Percent when multiplied by 100 is "rate per hundred". Multiplied by 10,000 it would be rate per 10,000.
So if you do something like:
proc means data=have nway; class year program; var incident;
output out=summary mean=; run;
You will have the incident percentage per level of the Program variable. Then use the Summary data set created to make the per 10,000 instead of per 100. IF any of your Visit values is ever other than 1 you would use that variable as a FREQ variable to indicate that observation represents more than one visit.
I am confused about your proc freq code with variables Program and Incident and your example data with Participant and Incident though. There seems to be a disconnect in your description and or the "rates" you might be talking about.
It is always a very good idea to very clearly state what the numerator and denominator of a rate should be.
Also confused about "slopes" as there isn't any obvious to me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I'll give that code a shot.
referring to the slopes, this is a graph of the rates when I calculated them manually:
I am trying to figure out how to determine if there is a statistically significant difference between the trend lines.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This reads your data and plots the rates per 10k:
data a; input year a b c d;
yearindx=year-2015;
program=1; NumInc=a; rateden=log(c/10000); rate=a/(c/10000); output;
program=0; NumInc=b; rateden=log(d/10000); rate=b/(d/10000); output;
datalines;
2016 33 52 21903 48969
2017 23 37 21637 47737
2018 26 38 21608 47675
2019 30 43 21156 46878
2020 24 56 20109 45540
2021 16 46 19134 43691
;
proc sgplot;
scatter y=rate x=year / group=program;
series y=rate x=year / group=program;
run;
Because the trends are pretty clearly nonlinear, it would be a mistake to just fit lines to the two program groups and compare their slopes. These curves appear approximately cubic. To compare the trends over time, you can follow the approach discussed and illustrated with a rates model in this note, but modified to allow for cubic fits. Note that it is better to not use the actual year values when modeling because they are so large.
proc genmod;
class program;
model NumInc = program|yearindx|yearindx|yearindx / dist=poisson noint offset=rateden;
effectplot / moff obs;
run;
The results show that the program trends don't differ significantly - the differences between programs on the linear, quadratic, and cubic components (program*yearindx, program*yearindx*yearindx, and program*yearindx*yearindx*yearindx) are not significant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That is was precisely what I was struggling to do. Everything I had tried up to now got me close but didn't seem exactly right.
I can't thank you enough mate, now I get to learn about cubic relationships and I really appreciate the note you attached.