Statistical Procedures

DanielQuay · Posted 08-04-2023 02:37 PM

I am at a bit of a loss. Or perhaps I am once again making things more difficult than they have to be as I am rather good at that.

I'm working with a data set with over 400k observations.

I have a binary variable to indicate whether or not an organization participated (1) in a program or did not (0).

I have another variable which indicates whether an incident happened (1) or did not (0).

I have each of these repeated for 2016-2021.

proc freq data = HAVE;
tables PROGRAM * Incident;
by year;
run;

I can manually calculate the rate per 10,000 and graph that using the code above, but I am looking for a way to calculate the rate within SAS so that I could then determine if there is a difference in the slopes between those that participated in the program and those that did not with regard to incidents. And I'm not sure precisely how to go about that part of things within SAS.

StatDave · Posted 08-04-2023 07:13 PM

This reads your data and plots the rates per 10k:

data a; input year a b c d; 
yearindx=year-2015;
program=1; NumInc=a; rateden=log(c/10000); rate=a/(c/10000); output;
program=0; NumInc=b; rateden=log(d/10000); rate=b/(d/10000); output;
datalines;
2016    33      52      21903   48969
2017    23      37      21637   47737
2018    26      38      21608   47675
2019    30      43      21156   46878
2020    24      56      20109   45540
2021    16      46      19134   43691
;
proc sgplot;
scatter y=rate x=year / group=program;
series y=rate x=year / group=program;
run;

Because the trends are pretty clearly nonlinear, it would be a mistake to just fit lines to the two program groups and compare their slopes. These curves appear approximately cubic. To compare the trends over time, you can follow the approach discussed and illustrated with a rates model in this note, but modified to allow for cubic fits. Note that it is better to not use the actual year values when modeling because they are so large.

proc genmod;
class program;
model NumInc = program|yearindx|yearindx|yearindx / dist=poisson noint offset=rateden;
effectplot / moff obs;
run;

The results show that the program trends don't differ significantly - the differences between programs on the linear, quadratic, and cubic components (program*yearindx, program*yearindx*yearindx, and program*yearindx*yearindx*yearindx) are not significant.

View solution in original post

StatDave · Posted 08-04-2023 04:12 PM

Rate per 10,000 ... what? Show an example of your data. I assume that for each observation in your data you have values for variables indicating organization, year, program participation or not, and incident occurred or not. What are the observations - are there multiple observations for each organization and/or year? The answer will probably involve a modeling approach along the lines of what is shown in this note if you really have rates rather than proportions.

DanielQuay · Posted 08-04-2023 04:28 PM

So, it's per total visits.

Data would be...

year Participant Incident Visit

2016 1 1 1

2016 1 0 1

2016 0 0 1

2016 0 1 1

2016 1 0 1

2016 0 1 1

2016 1 1 1

2016 0 0 1

2016 1 1 1

output looks like this when I run the table:

YEAR	Program and Incident	Not Program and Incident	Program and Total Visits	Not Program and Total Visits
2016	33	52	21903	48969
2017	23	37	21637	47737
2018	26	38	21608	47675
2019	30	43	21156	46878
2020	24	56	20109	45540
2021	16	46	19134	43691

ballardw · Posted 08-04-2023 06:11 PM

When you have 1/0 coded variables then the MEAN of the variables is in effect the percent of 1s in the values.

Percent when multiplied by 100 is "rate per hundred". Multiplied by 10,000 it would be rate per 10,000.

So if you do something like:

proc means data=have nway;
   class year program;
   var incident;
   output out=summary mean=;
run;

You will have the incident percentage per level of the Program variable. Then use the Summary data set created to make the per 10,000 instead of per 100. IF any of your Visit values is ever other than 1 you would use that variable as a FREQ variable to indicate that observation represents more than one visit.

I am confused about your proc freq code with variables Program and Incident and your example data with Participant and Incident though. There seems to be a disconnect in your description and or the "rates" you might be talking about.

It is always a very good idea to very clearly state what the numerator and denominator of a rate should be.

Also confused about "slopes" as there isn't any obvious to me.

DanielQuay · Posted 08-04-2023 06:18 PM

I'll give that code a shot.

referring to the slopes, this is a graph of the rates when I calculated them manually:

I am trying to figure out how to determine if there is a statistically significant difference between the trend lines.

StatDave · Posted 08-04-2023 07:13 PM

This reads your data and plots the rates per 10k:

data a; input year a b c d; 
yearindx=year-2015;
program=1; NumInc=a; rateden=log(c/10000); rate=a/(c/10000); output;
program=0; NumInc=b; rateden=log(d/10000); rate=b/(d/10000); output;
datalines;
2016    33      52      21903   48969
2017    23      37      21637   47737
2018    26      38      21608   47675
2019    30      43      21156   46878
2020    24      56      20109   45540
2021    16      46      19134   43691
;
proc sgplot;
scatter y=rate x=year / group=program;
series y=rate x=year / group=program;
run;

Because the trends are pretty clearly nonlinear, it would be a mistake to just fit lines to the two program groups and compare their slopes. These curves appear approximately cubic. To compare the trends over time, you can follow the approach discussed and illustrated with a rates model in this note, but modified to allow for cubic fits. Note that it is better to not use the actual year values when modeling because they are so large.

proc genmod;
class program;
model NumInc = program|yearindx|yearindx|yearindx / dist=poisson noint offset=rateden;
effectplot / moff obs;
run;

The results show that the program trends don't differ significantly - the differences between programs on the linear, quadratic, and cubic components (program*yearindx, program*yearindx*yearindx, and program*yearindx*yearindx*yearindx) are not significant.

DanielQuay · Posted 08-08-2023 09:27 AM

That is was precisely what I was struggling to do. Everything I had tried up to now got me close but didn't seem exactly right.
I can't thank you enough mate, now I get to learn about cubic relationships and I really appreciate the note you attached.

Statistical Procedures

Creating and Comparing Rates between Groups

Re: Creating and Comparing Rates between Groups

Re: Creating and Comparing Rates between Groups

Re: Creating and Comparing Rates between Groups

Re: Creating and Comparing Rates between Groups

Re: Creating and Comparing Rates between Groups

Re: Creating and Comparing Rates between Groups

Re: Creating and Comparing Rates between Groups

Follow Us

What is...

Statistical Procedures

Our biggest data and AI event of the year.

Follow Us

What is...