There is an alternate method of getting PCA vectors and scores from huge data sets, which I tested a long time ago, and my memory says it was much faster than PCA when you only needed a few vectors and had large data sets.
So, assuming my memory is correct, you could try this to get PCA results from PROC PLS. The advantage is that PLS doesn't have to invert matrices and doesn't have to compute eigenvalues/eigenvectors from the entire correlation matrix which is 5359x5359.
proc pls data=beta_diversity nfac=2 details;
ods output xloadings=xloadings
model _numeric_ = _numeric_;
output out=_scores_ xscore=prin;
run;
The idea here is that if you fit a PLS model where the x-variables are the same variables as the y-variables, you get PCA!!! (Raise your hand if you knew that).
Hello @PaigeMiller ,
My hand is NOT raised. Thanks for the tip!
If PROC PRINCOMP is slow, you can also try :
Thanks,
Koen
@kellychan84 when you get an error, you need to SHOW US the log.
We need to see the ENTIRE log for this PROC, all of it for this PROC, every single line for this PROC, from the first line where the log shows PROC PLS all the way down to the last NOTE beneath the log.
Copy the log as text and then paste it into the window that appears when you click on the </> icon
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; NOTE: ODS statements in the SAS Studio environment may disable some output features. 71 72 data beta_diversity; 73 length treatment $20; 74 Infile "/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv" dlm="," dsd truncover 74 ! lrecl=1000000 firstobs=2; 75 input treatment$ ASV1-ASV5359; 76 run; NOTE: The infile "/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv" is: Filename=/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv, Owner Name=u39233094,Group Name=oda, Access Permission=-rw-r--r--, Last Modified=10Feb2022:15:51:11, File Size (bytes)=411643 NOTE: 34 records were read from the infile "/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv". The minimum record length was 10740. The maximum record length was 11170. NOTE: The data set WORK.BETA_DIVERSITY has 34 observations and 5360 variables. NOTE: DATA statement used (Total process time): real time 0.02 seconds user cpu time 0.02 seconds system cpu time 0.00 seconds memory 7997.96k OS Memory 35880.00k Timestamp 02/11/2022 09:21:48 PM Step Count 31 Switch Count 2 Page Faults 0 Page Reclaims 1436 Page Swaps 0 Voluntary Context Switches 19 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 4360 77 proc pls data=beta_diversity nfac=2 details; 78 ods output xloadings=xloadings 79 model _numeric_ = _numeric_; 80 output out=_scores_ xscore=prin; 81 run; ERROR: No MODEL specified. NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK._SCORES_ may be incomplete. When this step was stopped there were 0 observations and 0 variables. WARNING: Data set WORK._SCORES_ was not replaced because this step was stopped. NOTE: PROCEDURE PLS used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 4996.03k OS Memory 34372.00k Timestamp 02/11/2022 09:21:48 PM Step Count 32 Switch Count 1 Page Faults 0 Page Reclaims 521 Page Swaps 0 Voluntary Context Switches 6 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 8 WARNING: Output '_numeric_' was not created. Make sure that the output object name, label, or path is spelled correctly. Also, verify that the appropriate procedure options are used to produce the requested output object. For example, verify that the NOPRINT option is not used. WARNING: Output 'model' was not created. Make sure that the output object name, label, or path is spelled correctly. Also, verify that the appropriate procedure options are used to produce the requested output object. For example, verify that the NOPRINT option is not used. WARNING: Output 'xloadings' was not created. Make sure that the output object name, label, or path is spelled correctly. Also, verify that the appropriate procedure options are used to produce the requested output object. For example, verify that the NOPRINT option is not used. 82 83 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 94 User: u39233094
data beta_diversity;
length treatment $20;
Infile "/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv" dlm="," dsd truncover lrecl=1000000 firstobs=2;
input treatment$ ASV1-ASV5359;
run;
proc pls data=beta_diversity nfac=2 details;
ods output xloadings=xloadings
model _numeric_ = _numeric_;
output out=_scores_ xscore=prin;
run;
Looks like I made a typographical error.
There should be semi-colon at the end of the line
ods output xloadings=xloadings
Hello Paige,
It finally works!! It takes 15 min to finish the program. But there is a warning here. I don't know whether it matters or not. Please see the log attached below.
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; NOTE: ODS statements in the SAS Studio environment may disable some output features. 71 72 data beta_diversity; 73 length treatment $20; 74 Infile "/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv" dlm="," dsd truncover 74 ! lrecl=1000000 firstobs=2; 75 input treatment$ ASV1-ASV5359; 76 run; NOTE: The infile "/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv" is: Filename=/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv, Owner Name=u39233094,Group Name=oda, Access Permission=-rw-r--r--, Last Modified=10Feb2022:15:51:11, File Size (bytes)=411643 NOTE: 34 records were read from the infile "/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv". The minimum record length was 10740. The maximum record length was 11170. NOTE: The data set WORK.BETA_DIVERSITY has 34 observations and 5360 variables. NOTE: DATA statement used (Total process time): real time 0.08 seconds user cpu time 0.02 seconds system cpu time 0.01 seconds memory 8001.78k OS Memory 35368.00k Timestamp 02/14/2022 02:51:14 PM Step Count 24 Switch Count 2 Page Faults 0 Page Reclaims 1666 Page Swaps 0 Voluntary Context Switches 24 Involuntary Context Switches 0 Block Input Operations 808 Block Output Operations 4360 77 proc pls data=beta_diversity nfac=2 details; 78 ods output xloadings=xloadings; 79 model _numeric_ = _numeric_; 80 output out=_scores_ xscore=prin; 81 run; WARNING: Iteration limit reached without convergence. NOTE: The data set WORK.XLOADINGS has 2 observations and 5360 variables. NOTE: There were 34 observations read from the data set WORK.BETA_DIVERSITY. NOTE: The data set WORK._SCORES_ has 34 observations and 5362 variables. NOTE: PROCEDURE PLS used (Total process time): real time 14:46.99 user cpu time 14:40.87 system cpu time 2.20 seconds memory 955894.12k OS Memory 1091020.00k Timestamp 02/14/2022 03:06:01 PM Step Count 25 Switch Count 10 Page Faults 0 Page Reclaims 708366 Page Swaps 0 Voluntary Context Switches 33275 Involuntary Context Switches 1277 Block Input Operations 0 Block Output Operations 29576 82 83 84 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 95
One more question, could you please also teach me how to plot my PCA1and PCA2 after this proc pls? Thank you very much!
One more thing is that there is still only 34 observations in the logs. I don't know why?
Take care @kellychan84 : your analysis is done on 34 observations only. That's NOT what you want!
Regarding :
WARNING: Iteration limit reached without convergence.
There is a MAXITER sub-option to increase number of iterations.
(The default algorithm is NIPLS and 200 is the default number of iterations.)
Cheers,
Koen
So it seems to be reading the data. It read 5,360 variables. The longest line was only 11,170 bytes. Are your variables all only one character long?
NOTE: 34 records were read from the infile "/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv". The minimum record length was 10740. The maximum record length was 11170. NOTE: The data set WORK.BETA_DIVERSITY has 34 observations and 5360 variables.
But it only found 34 observations, not the 182K you mentioned before. Did you truncate the text file in some way? Note you could also just use the OBS= option on the INFILE statement to only read some of lines. Or use the OBS= dataset option when passing the data to a procedure to have it only use some of the observations.
The reason SAS did not see your MODEL statement is because you forgot to end the ODS statement so your MODEL keyword became part of the ODS statement instead of a separate statement. Make sure to place semi-colons at the end of the statements. Line breaks and extra white space mean nothing in SAS code. You have to actually tell SAS where the statements end by using semi-colons.
proc pls data=beta_diversity nfac=2 details;
ods output xloadings=xloadings;
model _numeric_ = _numeric_;
output out=_scores_ xscore=prin;
run;
If you are really trying to 34 observations to derive insights into over 5,000 variables you do not have enough data for your analysis to have any meaning. Perhaps that is why it doesn't end?
@Tom wrote:
If you are really trying to 34 observations to derive insights into over 5,000 variables you do not have enough data for your analysis to have any meaning. Perhaps that is why it doesn't end?
Great catch. This certainly would be a meaningless analysis. However, the algorithm ought to finish really really quickly with just 34 observations.
Hello Tom,
I don't know why it shows only 34 observations. After I add the "dsd truncover lrecl=1000000" coding, the column number finally increases from 5000 to exactly what I have (5359). Paige's proc pls procedure is giving me results finally!!
@kellychan84 wrote:
Hello Tom,
I don't know why it shows only 34 observations. After I add the "dsd truncover lrecl=1000000" coding, the column number finally increases from 5000 to exactly what I have (5359). Paige's proc pls procedure is giving me results finally!!
I'm going out to celebrate! 🙂 😀😁
But please clear up this issue: do you still have 34 observations? If so, the results are meaningless.
Yeah, I am going to celebrate too.
But the log shows that it only reads 34 observations. "NOTE: The data set WORK.BETA_DIVERSITY has 34 observations and 5360 variables." Why is that happening? I will try the solutions others are giving.
You have to perform some basic troubleshooting to figure out why when you read the .csv file produces a SAS data set named beta_diversity that has 34 observations. This is where you need to start.
And if you do figure out how to get the 180,000 observations, then your PCA/PLS should take days if it takes 15 minutes with 34 observations.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.