BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ChuksManuel
Pyrite | Level 9

Hello statisticians,

 

I currently have two datasets.

 

In the first dataset, i carried out a PCA on the dataset and i want to retain  3 principal components . This dataset has 750 people (Subject IDs)

The second dataset has 200 people (subject IDs) and contain my relevant outcomes of interest.

 

I would normally want to merge the two datasets by subject IDs. But the problem would be that the final dataset will contain only 200 samples (since the second dataset has 200 people) and iam afraid  that it will change the initial principal components that i got which is looking pretty good.

 

My question therefore is: Is there a way i can retain the values of the 3 principal components in the first dataset, so that when i merge the two data, i can still use those values in my analysis?

 

I would appreciate any response. 

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

Yes, you can retain the PCA vectors and apply them to the data in the second data set. However, you would have to center and scale the 2nd data set exactly as you centered and scaled the first data set (which may have been the default scaling). But if data set 1 has variables X1-X10, and you subtracted the mean of 55 from X1 and then divided by the standard deviation of 20, then you must use a mean of 55 and a standard deviation of 20 on the 2nd data set. I don't know how good a programmer you are, but this can be done via PROC STDIZE, run it twice, the output statistics of the first run (on data set 1) being used to center and scale data set 2 (the 2nd run of PROC STDIZE)

 

The general method of doing this using of PCA vectors on a new data set is given here:

https://communities.sas.com/t5/SAS-Procedures/Applying-Results-of-Principal-Component-Analysis-on-Ne...

 

This link does not deal with the issue of centering scaling the 2nd data set properly, you will have to do that before this method will give meaningful results.

 

Now, I hate to bring this up, but when you do PCA, you don't necessarily get vectors that are good predictors. That is because PCA does not use prediction of a Y-variable as its criterion for finding the vectors. The vectors you get from PCA could be poor predictors, even if they have clear meaning in your mind; or they could be very good predictors. I never like using PCA to predict Y for this reason, and I always advise against it. You could use PLS, which provides vectors that are good predictors (if the data will allow), but these may or may not have the interpretations that you found. So I guess this all boils down to ... do you want meaningful vectors that may or may not predict Y well, or do you perhaps less meaningful vectors that predict Y well?

--
Paige Miller

View solution in original post

9 REPLIES 9
PaigeMiller
Diamond | Level 26

It's not clear to me what statistical goal you are trying to achieve ... I understand it has something to do with merging results, but I'd appreciate more explanation of the statistical analysis that you'd like to perform once this merge happens..

--
Paige Miller
ChuksManuel
Pyrite | Level 9

Thanks Paige.

I intend to merge 2 datasets. One has 750 observations and the second has 200 observations by Subj ID which is Unique.

I did a PCA analysis on the first data (diet data) and it gave me three components. The second dataset have all my unique variables of interest.

My worry is that if i merge the two datasets by subject ID, the merged dataset will have about 200. And if i did a PCA on the 200 observations, my PCA would change or i may not even see any pattern or my variances for first component will decrease.

My question is: Is there a way i can retain the values of the PCA gotten from the first data (which has 750 people) for use in the merged dataset (which will have 200 people)?

 

PaigeMiller
Diamond | Level 26

@ChuksManuel wrote:

My question is: Is there a way i can retain the values of the PCA gotten from the first data (which has 750 people) for use in the merged dataset (which will have 200 people)?

 


It's still not clear what the "use" is that you intend. Describe in detail what you want to do once the "merge" happens.

--
Paige Miller
ChuksManuel
Pyrite | Level 9

Thanks once again.

 

Here's what i want to do:

dataset 1= 750 people, PC Analysis showed three major components and i want to use the values of these PCAs in my analysis.

dataset 2= 200 people, it has variables that want to regress the PCAs (gotten from dataset 1 ) against

 

Simple regression model; Y= Beta* X

So here;

Y (outcome variable in dataset 2) = Beta * PCA1 (where PCA1 is the first component obtained in dataset 1) and i will repeat the same thing for PCA2, PCA3.

 

The problem: Merging dataset 1 and dataset 2 by Subject ID will reduce the sample size. So If i had to redo a PCA using the merged dataset (dataset that has both dataset1 and dataset2 by subject ID), the variance of the PCA on this merged dataset will reduce.

 

Question: How do i keep the PCAs obtained in dataset 1, so that i can use them in the merged dataset for regression analysis.

 

 

 

 

PaigeMiller
Diamond | Level 26

So, if I am understanding, data set 2 with the 200 observations has both a Y variable (not used in the original PCA) and all of the original variables (not used in the original PCA). You want to apply the PCA vectors from data set 1 to the data in data set 2, and then apply the regression coefficient obtained from the first data set to predict Y using the x-variables in the second data set and the PCA vectors from data set 1.

 

Does that match what you want to do?

 

It seems as if your explanation has three predictions of Y, one prediction using PCA1, then another prediction using PCA2 and another using PCA3. That doesn't seem right to me. It seems that you ought to have one prediction of Y using PCA1 PCA2 and PCA3 together. Can you clarify this point?

--
Paige Miller
ChuksManuel
Pyrite | Level 9

Thanks . That's exactly what i want to do.

 

Each component (PCA1, PCA2 and PCA3) is different.  For example, PCA1 showed people whose diets are carbs. PCA2 showed people whose diets are veggies and PCA3 showed people whose diets are proteins.. So i want to test each PCA against the Y- variable in dataset 2. But my worry is that merging the datasets automatically reduces the sample size and conducting another PCA on this merged data shows no clear dietary pattern.

 

Hence the reason why i'm asking if there's a way i can retain the variances(value) of the PCA gotten in unmerged dataset (Dataset 1) for use in the final analysis (in the merged dataset).

 

 

 

PaigeMiller
Diamond | Level 26

Yes, you can retain the PCA vectors and apply them to the data in the second data set. However, you would have to center and scale the 2nd data set exactly as you centered and scaled the first data set (which may have been the default scaling). But if data set 1 has variables X1-X10, and you subtracted the mean of 55 from X1 and then divided by the standard deviation of 20, then you must use a mean of 55 and a standard deviation of 20 on the 2nd data set. I don't know how good a programmer you are, but this can be done via PROC STDIZE, run it twice, the output statistics of the first run (on data set 1) being used to center and scale data set 2 (the 2nd run of PROC STDIZE)

 

The general method of doing this using of PCA vectors on a new data set is given here:

https://communities.sas.com/t5/SAS-Procedures/Applying-Results-of-Principal-Component-Analysis-on-Ne...

 

This link does not deal with the issue of centering scaling the 2nd data set properly, you will have to do that before this method will give meaningful results.

 

Now, I hate to bring this up, but when you do PCA, you don't necessarily get vectors that are good predictors. That is because PCA does not use prediction of a Y-variable as its criterion for finding the vectors. The vectors you get from PCA could be poor predictors, even if they have clear meaning in your mind; or they could be very good predictors. I never like using PCA to predict Y for this reason, and I always advise against it. You could use PLS, which provides vectors that are good predictors (if the data will allow), but these may or may not have the interpretations that you found. So I guess this all boils down to ... do you want meaningful vectors that may or may not predict Y well, or do you perhaps less meaningful vectors that predict Y well?

--
Paige Miller
ChuksManuel
Pyrite | Level 9

That was helpful

PGStats
Opal | Level 21

If the subject IDs are the same in both datasets for the 200 subjects that they have in common, merging both datasets will keep all 750 subjects with the PCs, plus the outcome variables for the 200 common subjects. The remaining 550 subjects will get missing values for the outcome variables.

 

Something like

 

proc sort data=A; by id; run;
proc sort data=B; by id; run;

data A_and_B;
merge A B; by id;
run;

will do the trick.

PG

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1067 views
  • 1 like
  • 3 in conversation