02-11-2014 05:42 AM
Hi, I have to solve a problem which is little bit confusing for me, is there anybody here who can help me? I have to correlate two binary data sets .
the problem is ,
We have 100 independent variables which have only two situations: On or Off.
We have 1000 dependent variables which also have only two situations: On or Off.
In the first experiment we see dependent variables based on our given independent setup.
In a second experiment we change the independent variables to another setup and different dependent variables go on.
Which model can be used to predict the outcome of the dependent variable based on a third given independent variable configuration and the results from the two previous experiments.
02-11-2014 08:51 AM
PROC CATMOD could certainly be used in this situation, but I am very skeptical that there's a good answer here with 100 independent binary variables. When you have lots of independent variables, there's bound to be some correlation between them, and this greatly inhibits your ability to get a good predicting model, and that's when you have continuous variables ... with binary variables, I think the situation would be worse. Furthermore, I can't possibly imagine how 100 independent binary variables could do a good job on predicting 1000 dependent binary variables. Who knows, maybe your data has such strong relationships that such a prediction would actually work, but as I said, I am very skeptical.
To move forward, it would certainly be a good idea if you explained in more detail about this experiment
02-11-2014 09:56 AM
Which kind of statistical modeling should I implement? I mean PROC CATMOD is a classification model? I want to use R to implement the model. The problem is dependent binary variables are much bigger than independent variable.
02-11-2014 02:19 PM
Which kind of statistical modeling should I implement?
My point is that the design of this study may prevent you from finding a well-fitting model, and you might want to reconsider the design.
I want to use R to implement the model.
R or SAS, the model seems to be much less important at this time than getting the right design.
But why are you asking about this in a SAS forum?
02-12-2014 09:06 AM
100 independent binary variables cannot possibly span the space of interest unless you have 2**100 data points in your design, and if you have fewer points, there is a major likelihood that your independent variables will be correlated with one another, thus dramatically increasing the mean square error of your parameter estimates.
As an alternative, you would need some sort of major fractional factorial design just to ensure your estimates are balanced and not correlated with each other.
But why do you need 100 independent binary variables? And can you really vary 100 independent binary variables in your study?
And how do you expect 100 independent binary variables to predict 1000 dependent binary variables? Are these dependent variables all highly correlated with one another? If so, then this might work, but do you know if the dependent variables are correlated with each other? (An example of non-binary dependent variables that are highly correlated are spectra, and so in this case you could possibly predict 1000 dependent variables using 100 independent variables)
But anyway, without more details, it seems like your study is: collect huge amounts of data, throw it into SAS and see what the results are; I think there are better ways to go about this.