dear sas community,
currently i am seeking to show the adverse effects of cluster sampling when the data in respective clusters is correlated.
for this i want to generate a population with 500 clusters (let it be city disticts) were each cluster has 1000 correlated observations (let it be income of inhabitant).
From this population i want to draw samples (simple random, cluster sample) in a second step and compare their characteristics.
i am struggeling to create correlated data in the clusters.
so far i played around with the rand function to create clusters and data points (code attached).
any help on how to pre-define rho within the clusters would be very much appreciated.
hope this question is not to basic. thanks in advance!
/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */
%let N = 1000; /* sample size */
%let NumSamples = 500; /* number of samples */
data LOR;
call streaminit(123);
do SampleID = 1 to &NumSamples; /* ID variable for each LOR */
do IND = 1 to &N;
tetha = 1000+SampleID*10; /* Average Income */
Lampda = 100; /* Std. Dev in Cluster */
INCOME_SPE = rand("Normal",tetha,lampda);
output;
end;
end;
run;
I guess you could start with this:
/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */
%let N = 1000; /* Cluster size */
%let NumSamples = 500; /* Number of samples */
%let Rho = 0.1; /* Intracluster correlation coefficient */
%let Std = 100; /* Std Dev within clusters */
%let Income = 1000; /* Average income */
data LOR;
call streaminit(123);
betweenStd = sqrt( (&Std)**2 * &Rho / (1-&Rho) );
do SampleID = 1 to &NumSamples; /* ID variable for each LOR */
clusterIncome = rand("NORMAL", &Income, betweenStd);
do IND = 1 to &N;
INCOME_SPE = rand("Normal", clusterIncome, &Std);
output;
end;
end;
run;
/* Estimate variance components */
proc varcomp data=lor;
class sampleId;
model income_spe = sampleId;
ods output Estimates=varEst;
run;
/* Estimate intracluster correlation coefficient */
data rhoEstimate;
set varEst(where=(VarComp="Var(SampleID)") rename=income_spe=betweenVar);
set varEst(where=(VarComp="Var(Error)") rename=income_spe=withinVar);
rhoEst = betweenVar / (betweenVar + withinVar);
label
betweenVar="Variance between clusters"
withinVar="Variance within clusters"
rhoEst = "Intracluster correlation estimate";
drop varcomp;
run;
proc print data=rhoEstimate label noobs; run;
Variance Variance Intracluster between within correlation clusters clusters estimate 1080.5 9982.8 0.097663
Correlation is between two variables. Are you referring to correlation or intra class correlation (ICC)?
I'm not familiar with correlation within a single variable, how are you defining that or calculating it?
PS. This may very well be due to inexperience on my part rather than anything incorrect you're doing or saying.
I guess you could start with this:
/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */
%let N = 1000; /* Cluster size */
%let NumSamples = 500; /* Number of samples */
%let Rho = 0.1; /* Intracluster correlation coefficient */
%let Std = 100; /* Std Dev within clusters */
%let Income = 1000; /* Average income */
data LOR;
call streaminit(123);
betweenStd = sqrt( (&Std)**2 * &Rho / (1-&Rho) );
do SampleID = 1 to &NumSamples; /* ID variable for each LOR */
clusterIncome = rand("NORMAL", &Income, betweenStd);
do IND = 1 to &N;
INCOME_SPE = rand("Normal", clusterIncome, &Std);
output;
end;
end;
run;
/* Estimate variance components */
proc varcomp data=lor;
class sampleId;
model income_spe = sampleId;
ods output Estimates=varEst;
run;
/* Estimate intracluster correlation coefficient */
data rhoEstimate;
set varEst(where=(VarComp="Var(SampleID)") rename=income_spe=betweenVar);
set varEst(where=(VarComp="Var(Error)") rename=income_spe=withinVar);
rhoEst = betweenVar / (betweenVar + withinVar);
label
betweenVar="Variance between clusters"
withinVar="Variance within clusters"
rhoEst = "Intracluster correlation estimate";
drop varcomp;
run;
proc print data=rhoEstimate label noobs; run;
Variance Variance Intracluster between within correlation clusters clusters estimate 1080.5 9982.8 0.097663
Thank you so much in deed this is very helpful for the univariate case in deed!!
Sorry for the late reaction but this really is a very nice piece to understand the logic of the ICC..
Great work.
You need to specify the mean vectors and covariance for each cluster, as well as the relative proportion of observations to draw from each cluster. See "Simulate multivariate clusters in SAS" for an explanation, example, and SAS/IML code.
Thanks also to you Rick for your kind help. I am just about to get into the multivariate simulation.. Very good blog entry for this
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.