BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kellychan84
Fluorite | Level 6

Hello,

I am using the following codes to do PCA analysis with large dataset.

 

data beta_diversity;
  length treatment $20;
  Infile "/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv" dlm="," firstobs=2;
  input treatment$ ASV1-ASV5359;
run;
Proc print data=beta_diversity;
run;
ods graphics on;
proc princomp data=beta_diversity         /* use N= option to specify number of PCs */
              STD               /* optional: stdize PC scores to unit variance */
              out=PCAout         /* only needed to demonstate corr(PC, orig vars) */
              plots=(scree profile pattern score);
var _numeric_;  /* or use _NUMERIC_ */
ods output Eigenvectors=EV;  /* to create loadings plot, output this table */
run;
proc sgplot data=PCAout aspect=1;
   scatter x=prin1 y=prin2 / group=treatment;
   xaxis grid label="PC1 (%)";
   yaxis grid label="PC2 (%)";
run;

My data matrix is like this:

 

kellychan84_0-1644598005835.png

The SAS studio kept running and did not show results. I don't know what is wrong with the procedures. Could anyone give me some hints on this. Thank you very much in advance!!

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

There is an alternate method of getting PCA vectors and scores from huge data sets, which I tested a long time ago, and my memory says it was much faster than PCA when you only needed a few vectors and had large data sets.

 

So, assuming my memory is correct, you could try this to get PCA results from PROC PLS. The advantage is that PLS doesn't have to invert matrices and doesn't have to compute eigenvalues/eigenvectors from the entire correlation matrix which is 5359x5359.

 

proc pls data=beta_diversity nfac=2 details;
    ods output xloadings=xloadings
    model _numeric_ = _numeric_;
    output out=_scores_ xscore=prin;
run;

The idea here is that if you fit a PLS model where the x-variables are the same variables as the y-variables, you get PCA!!! (Raise your hand if you knew that). 

 

 

--
Paige Miller

View solution in original post

36 REPLIES 36
AMSAS
SAS Super FREQ

First identify which step is resulting in SAS Studio not responding 
Which Data Step / PROC is running and not responding
Have you tried running it step by step (run the data step, then the PROC PRINT)

 

 

sbxkoenk
SAS Super FREQ

Hello,

 

I don't think a PROC PRINT is a good idea on a dataset with thousands of columns and possibly millions of rows.

Writing this output to the Results window in HTML format takes ages.

 

Browse your data in a data pane (from the library pane) right after the data-step.

Or use the obs= data set option in PROC PRINT.
Like :

Proc print data=beta_diversity(obs=10); run;

Cheers,

Koen

kellychan84
Fluorite | Level 6
Hello Koen, Evan in 10 observations, it did not work. I try 10 mins. I don't know why? If I input only 10 variables of the data set, it comes out quickly.
PaigeMiller
Diamond | Level 26

@kellychan84 wrote:
Hello Koen, Evan in 10 observations, it did not work. I try 10 mins. I don't know why? If I input only 10 variables of the data set, it comes out quickly.

As I said earlier, try running the PROC PRINCOMP with the option N=2 (or N=5 or whatever), you don't need all 5359 dimensions.

--
Paige Miller
ballardw
Super User

How large is "large"? As in how many observations are in your data?

How long did you allow the program to run?

If your data is large you might well want to remove that Proc print. Just creating the output table for a large data set can take a lot of time and resources. Plus if it is that "large" what do you get out of looking at printed output?

Which piece "kept running"? Run one procedure or data step at a time.

 

You may need to go to this link https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_princomp_details05.htm and check on the computation resources.

The above will you show you how much memory may be needed. If the princomp procedure needs more memory than you have in your system then it may be spending a lot of time writing data used for computations to disk and then rereading it.

Also there is a formula showing the relationship of variables and records to time. Not that time goes up to calculate the correlation matrix, very roughly, as the square of the number of variables and for eigenvalues the cube of the number of variables. So with 359 variables I am not surprised that there might be some time to run.

 

 

 

kellychan84
Fluorite | Level 6
Thank you for your reply. I will check the computaion resources later. I calculate there are 182,000 observations of my data. Beause SAS studio is easy to quit and each time I let it run until the paltform quits. Even the print step takes like forever. Will it be better to use SAS PC version to run this kind of large data set?
ballardw
Super User

@kellychan84 wrote:
. Even the print step takes like forever. Will it be better to use SAS PC version to run this kind of large data set?

Why are you bothering to print it at all? 180,000 lines with everything fitting on one line would be something like 2,250 pages. When add in 400 variables that means each line is likely only showing 12 or so values, so you with 5000+ (on rereading your code) variables each observation of the data set would take about 447 lines to display, times the 2250 single lines that would be on the order of a document with maybe a million or more pages. You do not read that sort of stuff. Get rid of the proc print. If the only purpose of the proc print was to verify that the data was read use the OBS= data set option to display maybe 10 observations

Proc print data=beta_diversity (obs=10) ;
run;

But I think the 5000 variables in Princomp still takes a lot of time.

PaigeMiller
Diamond | Level 26

How many observations? How many variables?

 

If you just want to plot PRIN1 and PRIN2, then you can run PROC PRINCOMP with the option N=2, which may take a lot less time depending on how the algorithm is coded in SAS.

--
Paige Miller
kellychan84
Fluorite | Level 6
Hello Paige, it is 182,000 observations with 5359 variables. Even the proc print step, it just keep running without stop.
PaigeMiller
Diamond | Level 26

As others have said, it is pointless (and very time consuming) to do a PROC PRINT on this data. Just remove PROC PRINT from your code and run it again.

--
Paige Miller
kellychan84
Fluorite | Level 6
I only run the first input step. it shows 5000 column, but may data set has 5359 column. That means something wrong here and can not proceed?
Tom
Super User Tom
Super User

Are you sure you told SAS to read the complete lines of data?

Check the notes in SAS log, it will show the minimum and maximum line lengths read from the file.

 

Note that the default line length INFILE will use is currently 32,767 bytes.  If your lines are longer than that then you need to add the LRECL= option to your data step.  There is not any real downsize to setting it longer (other than your data step might require a little more memory while it is running).

While you are at it add other common sense options like DSD (so that missing values and values with commas are treated properly) and TRUNCOVER (so INPUT does not move to a new line if there are too few values on the line.)

data beta_diversity;
  length treatment $20;
  infile "/home/u39233094/sasuser.v94/Thesis/CSV file/Cecal beta-diversity for SAS PCA delete.csv" 
    dlm="," dsd truncover lrel=1000000 firstobs=2
  ;
  input treatment ASV1-ASV5359;
run;

 

kellychan84
Fluorite | Level 6
I change the code according to your suggestions. it is still running.
PaigeMiller
Diamond | Level 26

That should be quicker, but I still don't know how long it will take to compute N=2 dimensions for 180,000 records and 5359 variables.

--
Paige Miller

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 36 replies
  • 1166 views
  • 3 likes
  • 6 in conversation