Hi all,
I have managed to perform a PCA when data is structured as the small example below. This I have done by using the following code:
ods select NObsNVar SimpleStatistics Corr Eigenvalues
Eigenvectors ScreePlot PatternProfilePlot;
proc princomp data=example;
var ORCL KO GM;
run;
| Date | ORCL | KO | GM | 
| 201801 | 1% | 2% | 3% | 
| 201802 | 1% | 2% | 3% | 
| 201803 | 1% | 2% | 3% | 
My issue is that I do not have information regarding the Industry Classification code and Company Name of each variable (e.g. ORCL, KO, GM...). I would therefore like to perform the PCA when data is structured as the example below instead:
| Date | Ticker | Company name | Industry classification | Returns | 
| 201801 | ORCL | Oracle Corp | 7372 | 1% | 
| 201802 | ORCL | Oracle Corp | 7372 | 1% | 
| 201803 | ORCL | Oracle Corp | 7372 | 1% | 
| 201801 | KO | Coca Cola comp | 2086 | 2% | 
| 201802 | KO | Coca Cola comp | 2086 | 2% | 
| 201803 | KO | Coca Cola comp | 2086 | 2% | 
| 201801 | GM | General Motors | 3711 | 3% | 
| 201802 | GM | General Motors | 3711 | 3% | 
| 201803 | GM | General Motors | 3711 | 3% | 
Can anyone help with a code to do this? Thank you!
Generally, you would not perform Principal Components Analysis on categorical variables. This is a case of — yes you can program SAS to do this — but it doesn't make sense, so you should not do this.
What is the goal of this analysis?
Hi Paige thanks for your answer.
The purpose of the analysis is to perform PCA on stock returns for a number of stocks over a certain period of time. Once I have identified the principle components and their eigenvalues, I would like to be able to analyse the principal components according to the industry that the underlying stocks are in. e.g. PCA 3 predominantly contain stocks that are in Healthcare etc. which I am not able to do when data is structured as in the first example.
The main difference I see in your two data sets is that the second is in "long form" whereas the first is in "wide form." Long form is used for longitudinal data and mixed models, but for PCA you must have the data in wide form because each observation must be a k-dimensional vector. It doesn't make sense to perform PCA if, for example, you have 3 observations for Oracle and 5 observations for GM.
If you have the data in long form, you can convert it to wide form by using PROC TRANSPOSE or by using the SAS DATA step.
However, the conversion from long to wide assumes that each group ("Ticker") has the same number of observations and that the date variables are the same for each group.
Hi. Thanks, that's very helpful.
We are already able to transform the data to the wide form as in below example. Our problem is, that we are not able to figure out how to keep the information regarding industry classification in the dataset, while still being able to perform the PCA.
| Date | ORCL | KO | GM | 
| 201801 | 1% | 2% | 3% | 
| 201802 | 1% | 2% | 3% | 
| 201803 | 1% | 2% | 3% | 
So, would it be possible to get the Industry Code (in addition to ticker) in wide form together with the return data, so that we would be able to infer which stocks are in which industries from the PCA output, without having to go through each stock manually one by one?
I would like to be able to analyse the principal components according to the industry that the underlying stocks are in
You seem to be choosing Principal Components Analysis without understanding that this method does not fit the problem. Principal Components does not work with categorical variables. The variables in principal components are multiple measurements (variables) of the objects you are studying (in this case, it sounds like the objects are companies) but yet you arrange the data so that the different companies are now variables, not objects. So nothing you have said so far fits principal components. PCA doesn't sound like it is headed in a productive direction. In addition, you would not do PCA if you have only one measurement, in this case "returns".
Perhaps you want to know which industries have the highest (or lowest) mean value of returns, and are the differences between the industries statistically significant. Is that what you want to know? If so, this is not principal components, this is ANOVA (or MANOVA if you really have multiple measurements on each company), and your original format is the right one for that analysis.
Hi Paige,
I am sure that principal component analysis is the right method to my problem. Maybe, perhaps, I have not been clear enough regarding what the purpose of the analysis is, and hence what I am trying to analyse.
The situation is, that I have a large dataset consisting of around 500 stocks and their respective returns for the period 2000-2019. Based on PCA I am:
1) trying to reduce dimensionality of the observed variation in returns to a limited number of principal components explaining e.g. 70% of the total variation in stock returns.
and
2) trying to analyse the composition of each principal component that we choose to keep e.g. in finance this could be interpreted as risk factors. In this case you could e.g. expect PC1 to represent general market risk, PC2 and so on to represent other sources of risk.
In order to interpret on the principal components it is vital that we can keep information regarding company specifics in the dataset. So far we have not been able to do so when data is structured in wide-form, which is why we want to know if PCA can be performed in above shown long-form. Other structures of data are also fine, but it is nonetheless vital that we are able to keep information about e.g. sector classification for subsequent analysis.
@Birgithj wrote:
1) trying to reduce dimensionality of the observed variation in returns to a limited number of principal components explaining e.g. 70% of the total variation in stock returns.
Reduce the dimensionality of what variables (plural)? I see in your data only a single variable, Returns. You would not do Principal Components of a single variable.
So, the following is what I don't understand.
What are the "objects" (in some analyses, "objects" can be different people, or different animals, or different locations, or different chemical samples, etc.) in this PCA study? What are the variables (plural) whose dimensionality you need to reduce?
Hi Paige,
As shown in my additional post, the same data has been structures in two ways:
| Date | ORCL | KO | GM | 
| 201801 | 1% | 2% | 3% | 
| 201802 | 1% | 2% | 3% | 
| 201803 | 1% | 2% | 3% | 
In this minor example, the variables are ORCL, KO an GM (company tickers). Here PCA works fine.
My question is whether or not it is possible to run the same analysis when data instead is structures as below:
| Date | Ticker | Company name | Industry classification | Returns | 
| 201801 | ORCL | Oracle Corp | 7372 | 1% | 
| 201802 | ORCL | Oracle Corp | 7372 | 1% | 
| 201803 | ORCL | Oracle Corp | 7372 | 1% | 
| 201801 | KO | Coca Cola comp | 2086 | 2% | 
| 201802 | KO | Coca Cola comp | 2086 | 2% | 
| 201803 | KO | Coca Cola comp | 2086 | 2% | 
| 201801 | GM | General Motors | 3711 | 3% | 
| 201802 | GM | General Motors | 3711 | 3% | 
| 201803 | GM | General Motors | 3711 | 3% | 
I am aware that the returns for all companies are placed in the same column and that the same goes for company tickers, but we thought perhaps it was possible to work around it so that we could keep the information stated in the additional columns. I know this i possible when running the analysis i python (using ID variables), but I was wondering it the analysis could be done i SAS using the above data structure.
I still find myself unable to see any way that PCA fits based upon your explanation, and I also don't see direct answers to the question "what are the objects" and "what are the variables".
I do see other analyses that might work on this data, but that is because I have created a goal in my mind for analyzing this data, in which companies are objects and results is the only measured variable; and that goal does not sound like your goal.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
