topic Re: Simulating clustered data with known correlation / Mixture of Normal Distributions in Statistical Procedures

Simulating clustered data with known correlation / Mixture of Normal Distributions

jrock — Sun, 10 Sep 2017 12:27:26 GMT

dear sas community,
currently i am seeking to show the adverse effects of cluster sampling when the data in respective clusters is correlated.

for this i want to generate a population with 500 clusters (let it be city disticts) were each cluster has 1000 correlated observations (let it be income of inhabitant).
From this population i want to draw samples (simple random, cluster sample) in a second step and compare their characteristics.

i am struggeling to create correlated data in the clusters.

so far i played around with the rand function to create clusters and data points (code attached).

any help on how to pre-define rho within the clusters would be very much appreciated.
hope this question is not to basic. thanks in advance!

/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */

%let N = 1000; /* sample size */

%let NumSamples = 500; /* number of samples */

data LOR;

call streaminit(123);

do SampleID = 1 to &NumSamples; /* ID variable for each LOR */

do IND = 1 to &N;

tetha = 1000+SampleID*10; /* Average Income */

Lampda = 100; /* Std. Dev in Cluster */

INCOME_SPE = rand("Normal",tetha,lampda);

output;

end;

run;

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Reeza — Sun, 10 Sep 2017 19:25:25 GMT

Correlation is between two variables. Are you referring to correlation or intra class correlation (ICC)?

I'm not familiar with correlation within a single variable, how are you defining that or calculating it?

PS. This may very well be due to inexperience on my part rather than anything incorrect you're doing or saying.

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

PGStats — Mon, 11 Sep 2017 04:46:09 GMT

I guess you could start with this:

/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */
%let N = 1000; /* Cluster size */
%let NumSamples = 500; /* Number of samples */
%let Rho = 0.1; /* Intracluster correlation coefficient */
%let Std = 100; /* Std Dev within clusters */
%let Income = 1000; /* Average income */

data LOR;
call streaminit(123);
betweenStd = sqrt( (&Std)**2 * &Rho / (1-&Rho) );
do SampleID = 1 to &NumSamples; /* ID variable for each LOR */
    clusterIncome = rand("NORMAL", &Income, betweenStd);     
    do IND = 1 to &N;
        INCOME_SPE = rand("Normal", clusterIncome, &Std);
        output;
        end;
    end;
run;

/* Estimate variance components */
proc varcomp data=lor;
class sampleId;
model income_spe = sampleId;
ods output Estimates=varEst;
run;

/* Estimate intracluster correlation coefficient */
data rhoEstimate;
set varEst(where=(VarComp="Var(SampleID)") rename=income_spe=betweenVar);
set varEst(where=(VarComp="Var(Error)") rename=income_spe=withinVar);
rhoEst = betweenVar / (betweenVar + withinVar);
label  
    betweenVar="Variance between clusters"
    withinVar="Variance within clusters"
    rhoEst = "Intracluster correlation estimate";
drop varcomp;
run;

proc print data=rhoEstimate label noobs; run;

                        Variance      Variance    Intracluster
                         between        within     correlation
                        clusters      clusters      estimate

                          1080.5        9982.8      0.097663

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Rick_SAS — Wed, 13 Sep 2017 10:04:58 GMT

You need to specify the mean vectors and covariance for each cluster, as well as the relative proportion of observations to draw from each cluster. See "Simulate multivariate clusters in SAS" for an explanation, example, and SAS/IML code.

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

jrock — Sat, 16 Sep 2017 01:23:33 GMT

Thank you so much in deed this is very helpful for the univariate case in deed!!
Sorry for the late reaction but this really is a very nice piece to understand the logic of the ICC..
Great work.

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

jrock — Sat, 16 Sep 2017 01:26:54 GMT

Thanks also to you Rick for your kind help. I am just about to get into the multivariate simulation.. Very good blog entry for this