BookmarkSubscribeRSS Feed
slegleye
Fluorite | Level 6

Hello users.

 

I have a dataset with three variables : one continuous, P, and two dichotomous, E and Y. I wonder how to simulate a variable (continuous or dichotomous) with given correlations with P, E and Y.

 

Does anyone know how to do it ?

 

Best, 

12 REPLIES 12
sbxkoenk
SAS Super FREQ

Hi,

 

( Moved this post from 'Programming' to 'Statistical Procedures' board )

Start here :

 

 

 

Maybe @Rick_SAS wants to add something?

 

Thanks,

Koen

slegleye
Fluorite | Level 6

Thank you for your quick answer.

Unfortunately, all these programs simulate all variables. My problem is that I would like to simulate one variable with given correlations to actual variable in a dataset.

 

Do you know a way to do it ?

 

Best, 

sbxkoenk
SAS Super FREQ

Hello,

 

In that case it becomes a kind of optimization.

Do you have SAS Optimization in SAS VIYA 3.x or 4 or SAS/OR (Operations Research) in SAS 9.4?

 

It's possible with PROC OPTMODEL, like in this blog :

Creating Synthetic Data with SAS/OR
By Jared Erickson on Operations Research with SAS May 17, 2017
https://blogs.sas.com/content/operations/2017/05/17/creating-synthetic-data-sasor/

 

Ciao,

Koen

slegleye
Fluorite | Level 6

Dear Koen. 

 

Thank you for your answere. Unfortunately, I do not have SAS VIYA 3.x or 4 or SAS/OR (Operations Research) in SAS 9.4. 

In addition, I am not sure that I understand correctly the code you mention. 

 

Sorry,

Rick_SAS
SAS Super FREQ

For one variable, you can do it. The geometry and SAS code for creating a variable that has a specified correlation with other variables is shown in Find a vector that has a specified correlation with another vector - The DO Loop

 

For multiple variables, you cannot always find a vector that has a specified correlation. The set of possible correlations with a set {x1, x2, x3} are determined by the geometry of those vectors and the correlations between the variable. The fact that two of your variables are dichotomous (0/1) further restricts the possible correlations that an arbitrary vector can make with the set {x1, x2, x3}.

 

For example, if x1 and x2 are highly correlated, you cannot find a vector that is highly correlated with x1 but is uncorrelated with x2.  Similarly, if x1 and x2 are uncorrelated, you cannot find a vector that is highly correlated with x1 and with x2. So to even get started on this problem we would need to know the correlation matrix for the x_i.

 

So, I ask you to explain more about the source of these variables and the correlations. How do you know that there is a solution for the correlations that you are using? Are you starting from real data? Do you have an empirical correlation from some data that contains an actual Y variable? Are you trying to simulate new Y_i that are related to the x_i in a way that is similar to Y's relationship with the x_i?

 

slegleye
Fluorite | Level 6

Dear Rick. 

 

Thank you for your interest.

In fact, I intend to test the robustness of a causal estimation: does E, the binary exposition, really causes Y, the binary outcome? Data come from a real data set, a survey on youth (n=21000). I have E, Y, and a propensity score modeling E (conditionally on many covariates X), Ps. E and Y are binary, but Ps is continuous. the correlations between E, Y and Ps are given by the data.

The covariates X are also observed in the survey ; but what about an unobserved covariate U? It is still possible that my results based on Ps and E are biased because Ps does not include U. 

 

I want to simulate an U with given correlations with E and Y, but with a null correlation with Ps. 

 

Best, 

MichaelL_SAS
SAS Employee

I think this approach where you simulate only the value of an unmeasured confounder based on the observed data is likely to run into issues. Namely, if you are simulating the U values based on the observed exposure E and outcome Y, while you might be able to create the desired correlations, the causal relationships that produce them are unlikely to correspond to U being an unmeasured confounder. For U to be an unmeasured confounder it would have to be a common cause of E and Y. However, in your simulation E and Y are already known, so U cannot have a truly causal effect on them, so the correlation would come from E or Y effecting U, in which case the casual relationships are the reverse of what you want, and U would not be an unmeasured confounder.

 

I think it is fair to say that how to best perform sensitivity analyses for the effect of unmeasured confounding in observational studies is not a settled question. There are a wide variety of methods discussed in the literature. I think in the case of a binary outcome with the effect measured on the relative risk scale, the E-value as described by VanderWeele and Ding might be the most commonly suggested approach. I believe the appendix to their original paper had example SAS code for the computation of E-values and they have since made a web-app for computing E-values. There are also approaches that are specific to methods like propensity score matching, there is an example of this in the PROC PSMATCH documentation. There are also other approaches where the measured confounders are used to provide some basis for judging what the effect of an unmeasured confounder might be by seeing the effect of omitting each of the measured confounders from the adjustment set. 

slegleye
Fluorite | Level 6

Dear Michael (I hope that is the correct spelling).

 

Thanks for your answer. Your remark about the very nature of U in my simulation task is interesting and points the difficult nature of the problem. My intention is really to simulate an unobserved confounder of E and Y. My understanding of the situation is that : if I simulate a U with chosen correlations with E and Y but no correlation with the propensity score Ps, it would have exactly the observerd properties that I would have in the case where U is a true confounder that generates E and Y (but not Ps). With such a U, I would be able to compute a causal effect of E on Y (conditionally to the covariates X that compose Ps, and U). By varying the correlations between U and E and Y, I would be able to determine the correlations that are sufficient to explain away the effect of E on Y (without U, that is, only on observables). That U is not correlated to Ps ensures that the U is the extra unmeasured variable that is sufficient to do it. 

 

I do not see what different properties would have a genuine confounder of E and Y that I miss with this method.

 

I agree that the literature on sensitivity is abundant and proposes various approaches. I know the meaning and the computation of the E-value by VanderWeele and Ding. But the E-value is a simplification in the sense that it relies on a U with equal correlations (risk-ratios) with E and Y. I would like to make the correlations between U and E and U and Y independent.

 

Best, 

 

another propensity score with all the information: the covariates X that compose Ps and U. The derived estimate of the effect of E on Y would be the "true" causal effect of E on Y. 

MichaelL_SAS
SAS Employee

Sorry, for the delay in responding. 

 

I think the issue I see with the simulation approach is maybe best described with some of the notation from causal diagrams. Given that E and Y values are set, if you are simulating a value of U with the desired correlations given the observed data, the causal structure would likely be one where E->U<-Y, which would make U a collider on a pathway between E and Y. For U to be a common cause of E and Y (and therefore a confounder) you would need the direction of those arrows to reverse and have E<-U->Y, something that I don't think is really possible given the fixed values of E and Y. Note that the documentation for the CAUSALGRAPH procedure provides some more details on graphical causal models, and there is this 2019 SGF paper that also discusses the collider issue in example 2. 

 

In that case where U is a collider, comparing effect estimates that do/do not incorporate it in the adjustment set is studying the effect of inappropriately adjusting for a collider (as doing so opens up a non-casual pathway between E and Y) instead of studying the effect of not adjusting for an unmeasured confounder (as that leaves a non-causal pathway unblocked). In a sense the different assumptions about U result in analyses that are mirror image of one another, i.e. one assumes your current adjustment is correct and would be made incorrect by incorporating U vs the other assumes your current adjustment set is incorrect and would be made correct by incorporating U. 

 

slegleye
Fluorite | Level 6

Thank you for your answer. 

You get the point: all is about the assumption regarding the causal role or U on E and Y.

I explicitly want to simulate a U that causes E and Y (as given in the dataset) but independantly of the observed covariates X ; and not a collider. If U is a collider, then my current estimation that ignores U is correct ; but if U is a confounder, it is not. 

In the simulation I cannot impose a direction (a causality) but only a correlation structure between U, E and Y (and X). If there is causality (U-->E and U--Y) then I would observe the correlation structure that I simulate. More precisely I intend to simulate all that is needed to get my current estimation right without X but false if U is a confounder and to estimate the amount of correlation (causal role) of U on E and Y that would produce a null causal effect of E on Y when U is taken into account.

 

To be honest, I studied causal diagrams and Pearl's theory quite in detail but what I want to do is not in the textbooks I know. But I think it is relevant. 

 

Best, 

Ksharp
Super User
Maybe OP want simulate some data to conform to the correlation coefficient in a published paper.
Ksharp
Super User

Here is a way by Genetic Algorithm .

As Rick said, it is not guaranted to get solution .

And if you have many data,it could cost lots of time to get result.

 

/*
x1 is a binary variable,
x2 is a binary variable,
x3 is a continuous variable.

need create a new variable x4,which has correlation with x1 is 0.04, with x2 is -0.5,with x3 is 0.2
*/

%let corr_x4=  0.04    -0.5  0.2;  




data have(keep=x1 x2 x3);
set sashelp.heart(keep=status sex height obs=100);
x1=ifn(status='Dead',1,0);
x2=ifn(sex='Male',1,0);
rename height=x3;
run;

proc iml;
use have nobs nobs;
read all var {x1 x2 x3};
close;

start function(x) global(x1 ,x2 ,x3,corr_x4);
 all=x1||x2||x3||t(x); 
 corr=corr(all);
 sse=ssq(corr[4,1:3]-corr_x4) ;
 return (sse);
finish;

corr_x4={&corr_x4.};

bounds=j(2,nobs,-1000);
bounds[2,]=1000 ;    

id=gasetup(1,nobs,123456789);
call gasetobj(id,0,"function");
call gasetsel(id,10,1,.95);
call gainit(id,10000,bounds);


niter =  100 ;
do i = 1 to niter;
 call garegen(id);
 call gagetval(value, id);
end;
call gagetmem(mem, value, id, 1);

x4=t(mem);

create want var {x1 x2 x3 x4};
append;
close;

print value[l = "Min Value:(be near zero,be better)"] ;
call gaend(id);
quit;


proc corr data=want pearson;
var x1 x2 x3 x4;
run;

Ksharp_1-1655729124596.png

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 1224 views
  • 11 likes
  • 5 in conversation