BookmarkSubscribeRSS Feed
Birgithj
Fluorite | Level 6

Hi all, 

 

I have managed to perform a PCA when data is structured as the small example below. This I have done by using the following code: 

 

ods select NObsNVar SimpleStatistics Corr Eigenvalues
Eigenvectors ScreePlot PatternProfilePlot;

 

proc princomp data=example;
var ORCL KO GM;
run;

 

DateORCLKOGM
2018011%2%3%
2018021%2%3%
2018031%2%3%

 

 

My issue is that I do not have information regarding the Industry Classification code and Company Name of each variable (e.g. ORCL, KO, GM...). I would therefore like to perform the PCA when data is structured as the example below instead: 

 

DateTicker Company nameIndustry classificationReturns
201801ORCLOracle Corp73721%
201802ORCLOracle Corp73721%
201803ORCLOracle Corp73721%
201801KOCoca Cola comp20862%
201802KOCoca Cola comp20862%
201803KOCoca Cola comp20862%
201801GMGeneral Motors37113%
201802GMGeneral Motors37113%
201803GMGeneral Motors37113%

 

 

Can anyone help with a code to do this? Thank you!

 

9 REPLIES 9
PaigeMiller
Diamond | Level 26

Generally, you would not perform Principal Components Analysis on categorical variables. This is a case of — yes you can program SAS to do this — but it doesn't make sense, so you should not do this.

 

What is the goal of this analysis?

--
Paige Miller
Birgithj
Fluorite | Level 6

Hi Paige thanks for your answer.

 

The purpose of the analysis is to perform PCA on stock returns for a number of stocks over a certain period of time. Once I have identified the principle components and their eigenvalues, I would like to be able to analyse the principal components according to the industry that the underlying stocks are in. e.g. PCA 3 predominantly contain stocks that are in Healthcare etc. which I am not able to do when data is structured as in the first example.

Rick_SAS
SAS Super FREQ

The main difference I see in your two data sets is that the second is in "long form" whereas the first is in "wide form." Long form is used for longitudinal data and mixed models, but for PCA you must have the data in wide form because each observation must be a k-dimensional vector. It doesn't make sense to perform PCA if, for example, you have 3 observations for Oracle and 5 observations for GM. 

 

If you have the data in long form, you can convert it to wide form by using PROC TRANSPOSE or by using the SAS DATA step.

However, the conversion from long to wide assumes that each group ("Ticker") has the same number of observations and that the date variables are the same for each group.

Birgithj
Fluorite | Level 6

Hi. Thanks, that's very helpful.

 

We are already able to transform the data to the wide form as in below example. Our problem is, that we are not able to figure out how to keep the information regarding industry classification in the dataset, while still being able to perform the PCA. 

 

DateORCLKOGM
2018011%2%3%
2018021%2%3%
2018031%2%3%

 

So, would it be possible to get the Industry Code (in addition to ticker) in wide form together with the return data, so that we would be able to infer which stocks are in which industries from the PCA output, without having to go through each stock manually one by one?

PaigeMiller
Diamond | Level 26

I would like to be able to analyse the principal components according to the industry that the underlying stocks are in

 

You seem to be choosing Principal Components Analysis without understanding that this method does not fit the problem. Principal Components does not work with categorical variables. The variables in principal components are multiple measurements (variables) of the objects you are studying (in this case, it sounds like the objects are companies) but yet you arrange the data so that the different companies are now variables, not objects. So nothing you have said so far fits principal components. PCA doesn't sound like it is headed in a productive direction. In addition, you would not do PCA if you have only one measurement, in this case "returns".


Perhaps you want to know which industries have the highest (or lowest) mean value of returns, and are the differences between the industries statistically significant. Is that what you want to know? If so, this is not principal components, this is ANOVA (or MANOVA if you really have multiple measurements on each company), and your original format is the right one for that analysis.

--
Paige Miller
Birgithj
Fluorite | Level 6

Hi Paige, 

 

I am sure that principal component analysis is the right method to my problem. Maybe, perhaps, I have not been clear enough regarding what the purpose of the analysis is, and hence what I am trying to analyse. 

 

The situation is, that I have a large dataset consisting of around 500 stocks and their respective returns for the period 2000-2019. Based on PCA I am:

 

1) trying to reduce dimensionality of the observed variation in returns to a limited number of principal components explaining e.g. 70% of the total variation in stock returns.

 

and 

 

2) trying to analyse the composition of each principal component that we choose to keep e.g. in finance this could be interpreted as risk factors. In this case you could e.g. expect PC1 to represent general market risk, PC2 and so on to represent other sources of risk.

 

In order to interpret on the principal components it is vital that we can keep information regarding company specifics in the dataset. So far we have not been able to do so when data is structured in wide-form, which is why we want to know if PCA can be performed in above shown long-form. Other structures of data are also fine, but it is nonetheless vital that we are able to keep information about e.g. sector classification for subsequent analysis.

PaigeMiller
Diamond | Level 26

@Birgithj wrote:

1) trying to reduce dimensionality of the observed variation in returns to a limited number of principal components explaining e.g. 70% of the total variation in stock returns.


Reduce the dimensionality of what variables (plural)? I see in your data only a single variable, Returns. You would not do Principal Components of a single variable.

 

So, the following is what I don't understand.

 

What are the "objects" (in some analyses, "objects" can be different people, or different animals, or different locations, or different chemical samples, etc.) in this PCA study? What are the variables (plural) whose dimensionality you need to reduce?

--
Paige Miller
Birgithj
Fluorite | Level 6

Hi Paige, 

 

As shown in my additional post, the same data has been structures in two ways: 

 

DateORCLKOGM
2018011%2%3%
2018021%2%3%
2018031%2%

3%

 

In this minor example, the variables are ORCL, KO an GM (company tickers). Here PCA works fine.

 

My question is whether or not it is possible to run the same analysis when data instead is structures as below: 

DateTicker Company nameIndustry classificationReturns
201801ORCLOracle Corp73721%
201802ORCLOracle Corp73721%
201803ORCLOracle Corp73721%
201801KOCoca Cola comp20862%
201802KOCoca Cola comp20862%
201803KOCoca Cola comp20862%
201801GMGeneral Motors37113%
201802GMGeneral Motors37113%
201803GMGeneral Motors37113%

 

I am aware that the returns for all companies are placed in the same column and that the same goes for company tickers, but we thought perhaps it was possible to work around it so that we could keep the information stated in the additional columns. I know this i possible when running the analysis i python (using ID variables), but I was wondering it the analysis could be done i SAS using the above data structure. 

PaigeMiller
Diamond | Level 26

I still find myself unable to see any way that PCA fits based upon your explanation, and I also don't see direct answers to the question "what are the objects" and "what are the variables".


I do see other analyses that might work on this data, but that is because I have created a goal in my mind for analyzing this data, in which companies are objects and results is the only measured variable; and that goal does not sound like your goal.

 

--
Paige Miller

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2528 views
  • 0 likes
  • 3 in conversation