BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
DanielQuay
Quartz | Level 8

I am at a bit of a loss.  Or perhaps I am once again making things more difficult than they have to be as I am rather good at that.

I'm working with a data set with over 400k observations.

I have a binary variable to indicate whether or not an organization participated (1) in a program or did not (0).

I have another variable which indicates whether an incident happened (1) or did not (0).

I have each of these repeated for 2016-2021.

 

proc freq data = HAVE;
tables PROGRAM * Incident;
by year;
run;

 

I can manually calculate the rate per 10,000 and graph that using the code above, but I am looking for a way to calculate the rate within SAS so that I could then determine if there is a difference in the slopes between those that participated in the program and those that did not with regard to incidents.  And I'm not sure precisely how to go about that part of things within SAS.

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

This reads your data and plots the rates per 10k:

 

data a; input year a b c d; 
yearindx=year-2015;
program=1; NumInc=a; rateden=log(c/10000); rate=a/(c/10000); output;
program=0; NumInc=b; rateden=log(d/10000); rate=b/(d/10000); output;
datalines;
2016    33      52      21903   48969
2017    23      37      21637   47737
2018    26      38      21608   47675
2019    30      43      21156   46878
2020    24      56      20109   45540
2021    16      46      19134   43691
;
proc sgplot;
scatter y=rate x=year / group=program;
series y=rate x=year / group=program;
run;

Because the trends are pretty clearly nonlinear, it would be a mistake to just fit lines to the two program groups and compare their slopes. These curves appear approximately cubic. To compare the trends over time, you can follow the approach discussed and illustrated with a rates model in this note, but modified to allow for cubic fits. Note that it is better to not use the actual year values when modeling because they are so large.

proc genmod;
class program;
model NumInc = program|yearindx|yearindx|yearindx / dist=poisson noint offset=rateden;
effectplot / moff obs;
run;

The results show that the program trends don't differ significantly - the differences between programs on the linear, quadratic, and cubic components (program*yearindx, program*yearindx*yearindx, and program*yearindx*yearindx*yearindx) are not significant.

 

View solution in original post

6 REPLIES 6
StatDave
SAS Super FREQ

Rate per 10,000 ... what? Show an example of your data. I assume that for each observation in your data you have values for variables indicating organization, year, program participation or not, and incident occurred or not. What are the observations - are there multiple observations for each organization and/or year? The answer will probably involve a modeling approach along the lines of what is shown in this note if you really have rates rather than proportions. 

DanielQuay
Quartz | Level 8

So, it's per total visits.

Data would be...

 

year     Participant     Incident     Visit

2016          1                         1                1

2016          1                         0                1

2016          0                         0                1

2016          0                         1                1

2016          1                         0                1

2016          0                         1                1

2016          1                         1                1

2016          0                        0                1

2016          1                         1                1

 

output looks like this when I run the table:

YEARProgram and IncidentNot Program and IncidentProgram and Total VisitsNot Program and Total Visits
201633522190348969
201723372163747737
201826382160847675
201930432115646878
202024562010945540
202116461913443691
ballardw
Super User

When you have 1/0 coded variables then the MEAN of the variables is in effect the percent of 1s in the values.

Percent when multiplied by 100 is "rate per hundred". Multiplied by 10,000 it would be rate per 10,000.

 

So if you do something like:

proc means data=have nway;
   class year program;
   var incident;
output out=summary mean=; run;

You will have the incident percentage per level of the Program variable. Then use the Summary data set created to make the per 10,000 instead of per 100. IF any of your Visit values is ever other than 1 you would use that variable as a FREQ variable to indicate that observation represents more than one visit.

 

I am confused about your proc freq code with variables Program and Incident and your example data with Participant and Incident though. There seems to be a disconnect in your description and or the "rates" you might be talking about.

It is always a very good idea to very clearly state what the numerator and denominator of a rate should be.

 

Also confused about "slopes" as there isn't any obvious to me.

 

 

 

DanielQuay
Quartz | Level 8

I'll give that code a shot. 

referring to the slopes, this is a graph of the rates when I calculated them manually:

DanielQuay_0-1691187398124.png

I am trying to figure out how to determine if there is a statistically significant difference between the trend lines.

StatDave
SAS Super FREQ

This reads your data and plots the rates per 10k:

 

data a; input year a b c d; 
yearindx=year-2015;
program=1; NumInc=a; rateden=log(c/10000); rate=a/(c/10000); output;
program=0; NumInc=b; rateden=log(d/10000); rate=b/(d/10000); output;
datalines;
2016    33      52      21903   48969
2017    23      37      21637   47737
2018    26      38      21608   47675
2019    30      43      21156   46878
2020    24      56      20109   45540
2021    16      46      19134   43691
;
proc sgplot;
scatter y=rate x=year / group=program;
series y=rate x=year / group=program;
run;

Because the trends are pretty clearly nonlinear, it would be a mistake to just fit lines to the two program groups and compare their slopes. These curves appear approximately cubic. To compare the trends over time, you can follow the approach discussed and illustrated with a rates model in this note, but modified to allow for cubic fits. Note that it is better to not use the actual year values when modeling because they are so large.

proc genmod;
class program;
model NumInc = program|yearindx|yearindx|yearindx / dist=poisson noint offset=rateden;
effectplot / moff obs;
run;

The results show that the program trends don't differ significantly - the differences between programs on the linear, quadratic, and cubic components (program*yearindx, program*yearindx*yearindx, and program*yearindx*yearindx*yearindx) are not significant.

 

DanielQuay
Quartz | Level 8

That is was precisely what I was struggling to do.  Everything I had tried up to now got me close but didn't seem exactly right.
I can't thank you enough mate, now I get to learn about cubic relationships and I really appreciate the note you attached.

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1101 views
  • 1 like
  • 3 in conversation