BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
jrock
Calcite | Level 5

dear sas community,
currently i am seeking to show the adverse effects of cluster sampling when the data in respective clusters is correlated.

for this i want to generate a population with 500 clusters (let it be city disticts) were each cluster has 1000 correlated observations (let it be income of inhabitant).
From this population i want to draw samples (simple random, cluster sample) in a second step and compare their characteristics.


i am struggeling to create correlated data in the clusters.

so far i played around with the rand function to create clusters and data points (code attached).

any help on how to pre-define rho within the clusters would be very much appreciated. 
hope this question is not to basic. thanks in advance!

 

 

 

 

/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */

%let N = 1000; /* sample size */

%let NumSamples = 500; /* number of samples */

data LOR;

call streaminit(123);

do SampleID = 1 to &NumSamples; /* ID variable for each LOR */

do IND = 1 to &N;

tetha = 1000+SampleID*10; /* Average Income */

Lampda = 100; /* Std. Dev in Cluster */

INCOME_SPE = rand("Normal",tetha,lampda);

output;

end;

end;

run;

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

I guess you could start with this:

 

/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */
%let N = 1000; /* Cluster size */
%let NumSamples = 500; /* Number of samples */
%let Rho = 0.1; /* Intracluster correlation coefficient */
%let Std = 100; /* Std Dev within clusters */
%let Income = 1000; /* Average income */

data LOR;
call streaminit(123);
betweenStd = sqrt( (&Std)**2 * &Rho / (1-&Rho) );
do SampleID = 1 to &NumSamples; /* ID variable for each LOR */
    clusterIncome = rand("NORMAL", &Income, betweenStd);     
    do IND = 1 to &N;
        INCOME_SPE = rand("Normal", clusterIncome, &Std);
        output;
        end;
    end;
run;

/* Estimate variance components */
proc varcomp data=lor;
class sampleId;
model income_spe = sampleId;
ods output Estimates=varEst;
run;

/* Estimate intracluster correlation coefficient */
data rhoEstimate;
set varEst(where=(VarComp="Var(SampleID)") rename=income_spe=betweenVar);
set varEst(where=(VarComp="Var(Error)") rename=income_spe=withinVar);
rhoEst = betweenVar / (betweenVar + withinVar);
label  
    betweenVar="Variance between clusters"
    withinVar="Variance within clusters"
    rhoEst = "Intracluster correlation estimate";
drop varcomp;
run;

proc print data=rhoEstimate label noobs; run;

 

                        Variance      Variance    Intracluster
                         between        within     correlation
                        clusters      clusters      estimate

                          1080.5        9982.8      0.097663
PG

View solution in original post

5 REPLIES 5
Reeza
Super User

Correlation is between two variables. Are you referring to correlation or intra class correlation (ICC)?

 

I'm not familiar with correlation within a single variable, how are you defining that or calculating it?

 

PS. This may very well be due to inexperience on my part rather than anything incorrect you're doing or saying. 

 

 

PGStats
Opal | Level 21

I guess you could start with this:

 

/* Step 1: Generate a data set that contains 500 clusters with each having 1000 inhabitants */
%let N = 1000; /* Cluster size */
%let NumSamples = 500; /* Number of samples */
%let Rho = 0.1; /* Intracluster correlation coefficient */
%let Std = 100; /* Std Dev within clusters */
%let Income = 1000; /* Average income */

data LOR;
call streaminit(123);
betweenStd = sqrt( (&Std)**2 * &Rho / (1-&Rho) );
do SampleID = 1 to &NumSamples; /* ID variable for each LOR */
    clusterIncome = rand("NORMAL", &Income, betweenStd);     
    do IND = 1 to &N;
        INCOME_SPE = rand("Normal", clusterIncome, &Std);
        output;
        end;
    end;
run;

/* Estimate variance components */
proc varcomp data=lor;
class sampleId;
model income_spe = sampleId;
ods output Estimates=varEst;
run;

/* Estimate intracluster correlation coefficient */
data rhoEstimate;
set varEst(where=(VarComp="Var(SampleID)") rename=income_spe=betweenVar);
set varEst(where=(VarComp="Var(Error)") rename=income_spe=withinVar);
rhoEst = betweenVar / (betweenVar + withinVar);
label  
    betweenVar="Variance between clusters"
    withinVar="Variance within clusters"
    rhoEst = "Intracluster correlation estimate";
drop varcomp;
run;

proc print data=rhoEstimate label noobs; run;

 

                        Variance      Variance    Intracluster
                         between        within     correlation
                        clusters      clusters      estimate

                          1080.5        9982.8      0.097663
PG
jrock
Calcite | Level 5

Thank you so much in deed this is very helpful for the univariate case in deed!! 
Sorry for the late reaction but this really is a very nice piece to understand the logic of the ICC..
Great work.

Rick_SAS
SAS Super FREQ

You need to specify the mean vectors and covariance for each cluster, as well as the relative proportion of observations to draw from each cluster. See "Simulate multivariate clusters in SAS" for an explanation, example, and SAS/IML code. 

 

 

jrock
Calcite | Level 5

Thanks also to you Rick for your kind help. I am just about to get into the multivariate simulation.. Very good blog entry for this

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 3077 views
  • 2 likes
  • 4 in conversation