Solved: Re: Simulating clustered data with known correlation / Mixture of Norm...

jrock · Posted 09-10-2017 08:27 AM

dear sas community,
currently i am seeking to show the adverse effects of cluster sampling when the data in respective clusters is correlated.

for this i want to generate a population with 500 clusters (let it be city disticts) were each cluster has 1000 correlated observations (let it be income of inhabitant).
From this population i want to draw samples (simple random, cluster sample) in a second step and compare their characteristics.

i am struggeling to create correlated data in the clusters.

so far i played around with the rand function to create clusters and data points (code attached).

any help on how to pre-define rho within the clusters would be very much appreciated.
hope this question is not to basic. thanks in advance!

/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */

%let N = 1000; /* sample size */

%let NumSamples = 500; /* number of samples */

data LOR;

call streaminit(123);

do SampleID = 1 to &NumSamples; /* ID variable for each LOR */

do IND = 1 to &N;

tetha = 1000+SampleID*10; /* Average Income */

Lampda = 100; /* Std. Dev in Cluster */

INCOME_SPE = rand("Normal",tetha,lampda);

output;

end;

run;

PGStats · Posted 09-11-2017 12:46 AM

I guess you could start with this:

/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */
%let N = 1000; /* Cluster size */
%let NumSamples = 500; /* Number of samples */
%let Rho = 0.1; /* Intracluster correlation coefficient */
%let Std = 100; /* Std Dev within clusters */
%let Income = 1000; /* Average income */

data LOR;
call streaminit(123);
betweenStd = sqrt( (&Std)**2 * &Rho / (1-&Rho) );
do SampleID = 1 to &NumSamples; /* ID variable for each LOR */
    clusterIncome = rand("NORMAL", &Income, betweenStd);     
    do IND = 1 to &N;
        INCOME_SPE = rand("Normal", clusterIncome, &Std);
        output;
        end;
    end;
run;

/* Estimate variance components */
proc varcomp data=lor;
class sampleId;
model income_spe = sampleId;
ods output Estimates=varEst;
run;

/* Estimate intracluster correlation coefficient */
data rhoEstimate;
set varEst(where=(VarComp="Var(SampleID)") rename=income_spe=betweenVar);
set varEst(where=(VarComp="Var(Error)") rename=income_spe=withinVar);
rhoEst = betweenVar / (betweenVar + withinVar);
label  
    betweenVar="Variance between clusters"
    withinVar="Variance within clusters"
    rhoEst = "Intracluster correlation estimate";
drop varcomp;
run;

proc print data=rhoEstimate label noobs; run;

                        Variance      Variance    Intracluster
                         between        within     correlation
                        clusters      clusters      estimate

                          1080.5        9982.8      0.097663

PG

View solution in original post

Reeza · Posted 09-10-2017 03:25 PM

Correlation is between two variables. Are you referring to correlation or intra class correlation (ICC)?

I'm not familiar with correlation within a single variable, how are you defining that or calculating it?

PS. This may very well be due to inexperience on my part rather than anything incorrect you're doing or saying.

PGStats · Posted 09-11-2017 12:46 AM

I guess you could start with this:

/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */
%let N = 1000; /* Cluster size */
%let NumSamples = 500; /* Number of samples */
%let Rho = 0.1; /* Intracluster correlation coefficient */
%let Std = 100; /* Std Dev within clusters */
%let Income = 1000; /* Average income */

data LOR;
call streaminit(123);
betweenStd = sqrt( (&Std)**2 * &Rho / (1-&Rho) );
do SampleID = 1 to &NumSamples; /* ID variable for each LOR */
    clusterIncome = rand("NORMAL", &Income, betweenStd);     
    do IND = 1 to &N;
        INCOME_SPE = rand("Normal", clusterIncome, &Std);
        output;
        end;
    end;
run;

/* Estimate variance components */
proc varcomp data=lor;
class sampleId;
model income_spe = sampleId;
ods output Estimates=varEst;
run;

/* Estimate intracluster correlation coefficient */
data rhoEstimate;
set varEst(where=(VarComp="Var(SampleID)") rename=income_spe=betweenVar);
set varEst(where=(VarComp="Var(Error)") rename=income_spe=withinVar);
rhoEst = betweenVar / (betweenVar + withinVar);
label  
    betweenVar="Variance between clusters"
    withinVar="Variance within clusters"
    rhoEst = "Intracluster correlation estimate";
drop varcomp;
run;

proc print data=rhoEstimate label noobs; run;

                        Variance      Variance    Intracluster
                         between        within     correlation
                        clusters      clusters      estimate

                          1080.5        9982.8      0.097663

PG

jrock · Posted 09-15-2017 09:23 PM

Thank you so much in deed this is very helpful for the univariate case in deed!!
Sorry for the late reaction but this really is a very nice piece to understand the logic of the ICC..
Great work.

Rick_SAS · Posted 09-13-2017 06:04 AM

You need to specify the mean vectors and covariance for each cluster, as well as the relative proportion of observations to draw from each cluster. See "Simulate multivariate clusters in SAS" for an explanation, example, and SAS/IML code.

jrock · Posted 09-15-2017 09:26 PM

Thanks also to you Rick for your kind help. I am just about to get into the multivariate simulation.. Very good blog entry for this

Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Re: Simulating clustered data with known correlation / Mixture of Normal Distributions

Registration is open