BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
SASMiner
Calcite | Level 5

I am new to this thread and was hoping someone could help with the following problem.

I have a multivariate dataset where each of the 100 variables in measured in the same unit.

My intention is to run a PROC princomp/factor technique which creates 100 independent variables and then I believe can then run univariate ANOVAs of each of these factors to find out what other variables in my dataset best discriminate this information.

If I run a PROC FACTOR I can see what how much each of the total variation the factors explain. However, when I try to use PROC SCORES to create factor scores each of the factor scores are standardised to variance 1. This means in the multiple univariate ANOVAs I am doing, each factor score has the same weight even though they account for different proportions of variance explained. It appears to me if I multiply the factor scores by the square route of the Eigenvalue it would give me what I require - independent variables whose value reflects the variation they explain. Is this correct ?

Alternatively I can run PRINCOMP and I think this would give the same answer above after the multiplication.

The thing that strikes me about the above is that I have to write code to do this in the factor analysis. Is there a simple way to get non-standardized scores ? Perhaps because it is not there as an option, it is inappropriate to do this.

Also, the PRINCOMP does not do rotations. In the example the dataset is only 100 variables but what if it was much more than this and I wanted to extract only a few variables but for those to maximise the variance. PROC FACTOR can do this but PRINCOMP does not have the varimax option. Do I have no choice but to do it in proc factor and adjust the scores as I have detailed ?

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

And to follow on to , PLS will give the approximate proportion of the response variance explained by each of the factors/components, which I believe is another one of your requests.  Take a look at Example 70.3 in the PROC PLS documentation, where 3 dependent variables are predicted from 30 independent measures.  There is also complete dependency between the three dependent variables (tot=tyr + try).

Steve Denham

View solution in original post

10 REPLIES 10
SteveDenham
Jade | Level 19

I am having difficulty comprehending the first task.  You say you have 100 variables, and you want to create 100 independent variables and then analyze those.  Why?  I have always thought of principal components/factor analysis as a variable reduction mechanism where you started with a large number of variables, and reduced to a much smaller number of independent (principal components) or nearly independent (factors) that explained the bulk of the variability in the original data.

So moving on to the end, check out PROC PLS which gives a variety of methods of fitting a response variable or variables to a large number of predictor variables.  It strikes me as a much better way of working on this, plus it uses cross-validation in fitting.  At least give the documentation and some of the examples a look through to see if it might not be a better approach than testing and testing and testing and testing...

Steve Denham

SASMiner
Calcite | Level 5

Thanks for taking time to respond to this.

I will look at PLS but I would like to pursue the differences between FACTOR and PRINCOMP routines so would still welcome any advice on the differences.

The reason I am running the analysis is to create independent variables so that I can 'add' up the separate ANOVAs for all 100 independent variables to find out for example how much more variation GENDER explains than AGE say. I think I can't do this on the original 100 variables as there are correlations so have to do this first stage. I agree PRINCOMP is mainly used as a data reduction technique and the last paragraph asks how I can use so some of 'varimax' with this  .

SteveDenham
Jade | Level 19

But the independent variables are not going to be nicely defined like GENDER or AGE.  They will be linear combinations of the 100 observed variables.  And even if they were completely independent, you can't "add up" the separate ANOVAs into anything meaningful. Scaling issues and interactions that are completely ignored come into play.

PROC PLS will give the proportion of variance explained by each of the "underlying" variables.  Check out Example 70.3 Choosing a PLS Model by Test Set Validation, in which three response variables are predicted from three or four factors calculated from 30 input variables.

Steve Denham

Message was edited by: Steve Denham

SASMiner
Calcite | Level 5

Apologies for the delay in replying to this but I have not been around.

Perhaps a hypothetical example may help to explain what I am trying to achieve.

Suppose I am trying to see whether which of sex, age and other demographics are important in determining income across some professions.

If I have something like:

Salary £000's
CompanyDirectorManagerDeputy Manager
11126030
22347035
33213015
44502613
5130168

These are all measure on the same basis (i.e. pounds) and I want to give weight to the size of salary.

Using a Factor analysis would remove correlated datasets so effectively the deputy manager column would be dropped as it is perfectly correlated with manager (always half).

Then I could run ANOVAs on director and manager to see how much sex explains for the Var(Manager) and Var(Director) and add these.

SS Sex(Director) +SS Sexl(Manager)

SS Total(Director) +SS Total(Manager)

Is this not what a MANOVA effectively does ?

Would the proc PLS still be more suitable than this ?

PaigeMiller
Diamond | Level 26

The whole idea of using either FACTOR analysis or PRINCIPAL COMPONENTS analysis to determine "independent" predictor variables, and then try to use them to predict a response ... well, it's an old idea, and it probably worked well in some circumstances, and didn't work well in other circumstances, but there are better ideas now, specifically PARTIAL LEAST SQUARES (also called PLS).

But beyond that, I don't really know what your goals are. And I am having trouble with your simplified example, it seems you have simplified it way too much so that you have a single predictor variable calle sex; and in any event if you have only a single predictor variable which is sex, neither PLS nor FACTOR nor PCA will really be of use. You have morphed from multivariate X-variables to multivariate-Y variables.

You have spoken several times about determining how much a predictor explains of the repsonse, and the problem comes in when your orignal predictors (not the PCA or FACTOR or PLS predictors) are correlated, there really is no answer to this question. No answer from a logical point of view, ignoring any statistical methods; and similarly statistical methods cannot answer this question either.

--
Paige Miller
SASMiner
Calcite | Level 5

Perhaps what I am trying to discover benefits from more explanation.

I have more variables than sex (e.g. age, qualifications etc).

If I had just the Company director data I could run an ANOVA and may find for separate main effects age explains 50% of the variation, sex 10% and qualifications 10% so I could give age, say 5 times more importance in whatever I am doing.

If I include manager how does the calculations go here. Can I just formulate something with sums of squares as in my previous reply ?

Now if I add in Deputy manager in some ways I am double counting as there is perfectly correlation here. In some ways it is redundant information.As proof, if I did separate anovas for Deputy manager and Manager I would get the same % accounted for by sex for example.

PaigeMiller
Diamond | Level 26

Perhaps it would benefit me if we can state clearly which are the independent (predictor or X) variables are, which are the dependent (response or Y) variables are.

If I understand you properly, then sex, age, qualifications and etc. are the independent (predictor or X) variables, and the salary for director, manager and deputy manager are the dependent (response or Y) variables.

Now, if this is correct (please confirm), then we have a situation where we have both multivariate X and multivariate Y variables. (If this is not correct, then would you please state what is correct?)

I believe that ANOVA won't help you here, and that your idea of saying Age affects 50% of the variation in salary, and sex affects 10% of the variation in salary, and so on, is problematic. I don't believe you can obtain such a breakdown of your independent (predictor or X) variables unless you have an orthogonal set of independent (predictor or X) variables. The reason I say this is that your independent (predictor or X variables) are correlated with one another. And in a situations where these are correlated, you cannot give an independent estimate (that is uncorrelated with other similar estimates) of how much Age affects the variation in salary.

Which is why both and I keep recommending Partial Least Squares analysis. It was designed to provide better estimates in the case where your independent (predictor or X) variables are correlated with one another. PLS also will handle the case where your deputy manager's salary is perfectly correlated with the manager's salary. PLS can analyze cases where you have multivariate X that are correlated with one another, and you also have multivariate Y that are correlated with one another.

I'm still not clear on your end goals. You state "... so I could give age, say 5 times more importance in whatever I am doing." Are you talking about some mathematical or statistical procedure where you give age 5 times the importance? Or are you talking about a decision making issue, where math and statistics are not used? Can you elaborate?

--
Paige Miller
SteveDenham
Jade | Level 19

And to follow on to , PLS will give the approximate proportion of the response variance explained by each of the factors/components, which I believe is another one of your requests.  Take a look at Example 70.3 in the PROC PLS documentation, where 3 dependent variables are predicted from 30 independent measures.  There is also complete dependency between the three dependent variables (tot=tyr + try).

Steve Denham

SASMiner
Calcite | Level 5

Thanks for the response (and to Steve).

I thought that factor analysis would give me an orthogonal set of data on which I could do the multivariate analysis.

However, from your reply and Steve Denham's subsequent reply I definitely need to look at PLS and will do so.

[I can't elaborate too much on my project and reason for assigning importance except to say if I was for example weighting across multiple repeated surveys it would be more important to do so for one variable (where 5 or sqrt(5)) as this is where the variability can be seen and so will give better estimates.]  

PaigeMiller
Diamond | Level 26

I thought that factor analysis would give me an orthogonal set of data on which I could do the multivariate analysis.

Yes, it will give orthogonal data (new predictors) on which you could do the analysis.

Our point is that there are better techniques. Why? Because both Factor Analysis and PCA determine the orthogonal vectors without regard to their ability to predict Y; you can see examples in the literature, or conjure up your own, where the first (and second and third and ...) vectors are nearly uncorrelated with the Y variables and thus are not predictive of Y. PLS avoids this possibility, if there are vectors of the predictor variables that are "predictive" of Y.

PLS also gives you orthogonal linear combinations of your X variables, but does so in a way that they need to be "predictive" of Y.

And while I realize you can't explain your project in detail, your example with salary reverted to using the individual predictors by themselves, throwing us off the trail. This is not what PLS or Factor Analysis or PCA do. They create orthogonal linear combinations of your predictors, and so your new orthogonal predictors are a weighted sum of ALL of your predictors. It is not clear from your writing that you understand this aspect of what these procedures do.

--
Paige Miller

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1877 views
  • 6 likes
  • 3 in conversation