BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
svh
Lapis Lazuli | Level 10 svh
Lapis Lazuli | Level 10

I am interested in conducting a data simulation that will help me understand the sample size for a study with a repeated measures design in which the subjects are measured at two time points. (I've been working with the book Simulating Data in SAS, but I'm not that familiar with PROC IML, so I have a learning curve ahead of me.)

 

The subjects are administered a survey in which the ordinal outcomes are highly right skewed, and I am trying to understand whether I have enough subjects to detect an effect of an intervention. In my code below, I'm simulating two years of data from a discrete distribution based on the distribution from 2019. I am looking to measure change in 2020 in a group of matched individuals. In the following code, the simulation builds two independent data sets from which I randomly sample (with replacement) to get a sample size of 10 subjects per year (this is the size of the group I'm studying; my full study will have many groups, but I'm trying to simulate a study with one level of group for now). 

 

I don't think it's correct to just concatenate the data sets and assign the integers 1-10 in each year because it is probably more likely for an individual to move to an adjacent category and not three or four points away. How does a data simulation take this into account with a discrete distribution? Does anyone know of SAS white papers that get to this topic?

 

data TimeFirst (keep=Y Group Year);
call streaminit(4321);
array p[7] (0.6 0.35 0.01 0.01 0.01 0.01 0.01); /* probabilities for current distribution */
do i = 1 to 100000;
Y = rand("Table", of p[*]); ;
Group = 'Rural';
Year = '2019';
output;
end;
run;

data TimeSecond (keep=Y Group Year);
call streaminit(4321);
array p[7] (0.5 0.45 0.01 0.01 0.01 0.01 0.01); /* probabilities based on assumption of effect of intervention */
do i = 1 to 100000;
Y = rand("Table", of p[*]); 
Group = 'Rural';
Year = '2020';
output;
end;
run;

data All;
   set Time:;
   run;
proc sort data=All;
   by Group Year;
   run;
proc surveyselect data=All out=All_sample method=urs n=10;
   strata Group Year; /*I am sampling from my simulations*/
   run;
/*The problem here is I need subjects to be matched on a variable like ID, and this makes a sample data set of independent observations.  My goal is to be able to run an anlaysis of the form:
proc mixed data=all_sample;
   class year ID;
   model Y = year;
   repeated  Year / Subject = ID type = cs;
   run;
   
*/
1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You can make your 2020 dataset conditional on the 2019 data, this way:

 


data all;
call streaminit(4321);
array p[7,7]  _temporary_ /* conditional probabilities for second year response */
	(0.5  0.2  0.1  0.05 0.05 0.05 0.05
	 0.1  0.5  0.1  0.1  0.1  0.05 0.05
	 0.1  0.1  0.5  0.1  0.1  0.05 0.05
	 0.05 0.1  0.1  0.5  0.1  0.1  0.05
	 0.05 0.05 0.1  0.1  0.5  0.1  0.1
	 0.05 0.05 0.1  0.1  0.1  0.5  0.1
	 0.05 0.05 0.05 0.05 0.1  0.2  0.5
	 );
set timeFirst;
output;
Y = rand("Table", p[y,1], p[y,2], p[y,3], p[y,4], p[y,5], p[y,6], p[y,7]);
year = '2020';
output;
run;
PG

View solution in original post

5 REPLIES 5
PGStats
Opal | Level 21

You can make your 2020 dataset conditional on the 2019 data, this way:

 


data all;
call streaminit(4321);
array p[7,7]  _temporary_ /* conditional probabilities for second year response */
	(0.5  0.2  0.1  0.05 0.05 0.05 0.05
	 0.1  0.5  0.1  0.1  0.1  0.05 0.05
	 0.1  0.1  0.5  0.1  0.1  0.05 0.05
	 0.05 0.1  0.1  0.5  0.1  0.1  0.05
	 0.05 0.05 0.1  0.1  0.5  0.1  0.1
	 0.05 0.05 0.1  0.1  0.1  0.5  0.1
	 0.05 0.05 0.05 0.05 0.1  0.2  0.5
	 );
set timeFirst;
output;
Y = rand("Table", p[y,1], p[y,2], p[y,3], p[y,4], p[y,5], p[y,6], p[y,7]);
year = '2020';
output;
run;
PG
svh
Lapis Lazuli | Level 10 svh
Lapis Lazuli | Level 10

I think I see what's happening here--the matrix is a set of conditional probabilities, which I would alter based on what I think the change could be at the next time point. 

Rick_SAS
SAS Super FREQ

I don't fully understand your design, but as PGStats says, you might want to generate both years at once. First, generate the 2019 value for a subject, then generate the 2020 value based on a random deviation (and treatment group effect?)  from the 2019 value. 

 

My advice is to write out the mixed model that you are trying to fit. You then will simulate from that model.

 

For an example of this kind of simulation study for power, see Psioda (2012)  The paper is very good except for p. 9-10. You can also ignore the NumIterPer parameter in his study and just set it equal to the Iterations parameter. The added complexity isn't worth the time savings.

svh
Lapis Lazuli | Level 10 svh
Lapis Lazuli | Level 10

In my design, I actually am testing for change in an outcome variable over time (the outcome is the perception of mistreatment in the educational setting). However, I have a random effect because participants are clustered in educational programs. I'm first trying to wrap my mind around how to conduct the simulation with one group--the reality is that mistreatment can vary across programs due to various reasons, so I will need to scale up to simulating the random effect of program. 

Rick_SAS
SAS Super FREQ

For a random effect, sample the effect from N(0, sigma) once for each cluster. That value is used for all measurements within the cluster. 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1592 views
  • 5 likes
  • 3 in conversation