Re: Principal Component Analysis based on six variables

MAC1430 · Posted 03-29-2017 02:38 AM

Hi everyone,

Can you please help me to form a composite index based on six variables as provided in the SAS data.

First Step: We standardise these variables to have zero mean and one variance.

Second Step: we have to estimate the first principal component of the six variables and their lags. This gives us a first-stage index with 12 loadings, one for each of the current and lagged variables.

Third Step: We compute the correlation between the first-stage index (obtained from the second step) and the current and lagged values of each of the variables.

Final step: we define composite index as the first principal component of the correlation matrix of six variables—each respective proxy’s lead or lag, whichever has higher correlation with the first-stage index—rescaling the coefficients so that the index has unit variance.

It should provide this answer Indext =− 0.241CEFDt +0.242TURNt−1 +0.253NIPOt +0.257RIPOt−1 +0.112St −0.283PD−ND t−1.

data have;
infile cards expandtabs truncover;
input date pdnd nipo ripo cefd s turn ;
cards;
1962	31.23	298	-0.41	3.88	0.16	0.1404
1963	29.24	83	4.27	9.38	0.11	0.1421
1964	31.16	97	4.92	14.30	0.22	0.1315
1965	12.36	146	11.67	13.90	0.14	0.1440
1966	2.86	84	10.68	15.70	0.14	0.1999
1967	-27.20	100	40.52	2.10	0.11	0.2224
1968	-33.79	368	54.11	-10.91	0.21	0.2274
1969	-11.81	781	13.44	-7.58	0.31	0.1972
1970	14.13	358	0.98	0.35	0.22	0.1805
1971	13.54	391	19.87	13.35	0.29	0.2105
1972	20.46	562	12.56	16.01	0.31	0.1988
1973	28.24	105	-0.06	16.48	0.33	0.1855
1974	19.53	9	0.67	16.66	0.16	0.1497
1975	16.71	14	-1.67	21.53	0.20	0.1958
1976	18.62	35	2.73	19.64	0.21	0.2018
1977	8.04	35	22.04	14.40	0.22	0.1871
1978	-0.23	50	24.42	23.53	0.22	0.2332
1979	-13.98	81	24.82	19.93	0.22	0.2648
1980	-22.89	238	49.60	13.53	0.28	0.3466
1981	-22.81	450	16.87	12.74	0.36	0.3262
1982	-25.62	222	19.30	2.44	0.36	0.4370
1983	-21.76	883	21.15	2.93	0.43	0.5060
1984	-15.95	552	11.51	1.15	0.17	0.4984
1985	-13.08	507	12.30	3.49	0.18	0.5415
1986	-11.08	953	10.40	0.92	0.16	0.6328
1987	-7.66	630	10.61	12.7	0.15	0.7318
1988	-4.16	227	9.81	13.7	0.11	0.5643
1989	-7.56	204	12.53	8.38	0.10	0.5485
1990	-1.49	172	14.61	7.49	0.08	0.4633
1991	-13.49	367	14.02	4.55	0.15	0.4639
1992	-12.19	509	12.34	1.2	0.15	0.4637
1993	-15.20	627	15.16	5.26	0.14	0.5305
1994	-9.21	568	13.46	12.14	0.12	0.5421
1995	-8.45	566	20.48	11.59	0.12	0.5886
1996	-11.02	845	16.90	11.06	0.20	0.6135
1997	-4.90	602	13.74	5.27	0.14	0.6871
1998	-5.39	344	19.90	12.14	0.12	0.7151
1999	-27.98	505	69.53	11.21	0.14	0.7631
2000	-15.83	397	56.43	6.17	0.14	0.9110
2001	-3.11	84	13.36	3.58	0.09	0.8892
2002	12.01	73	7.75	6.09	0.08	0.9627
2003	-0.05	76	11.58	4	0.07	0.9338
2004	-11.08	213	12.02	-0.1	0.08	0.9416
2005	-11.87	194	9.95	3.53	0.05	1.0551
run;

Thanks a lot for your help.

Best,

Cheema

Ksharp · Posted 03-29-2017 06:35 AM

Here is for the first question.

proc standard data=have out=want mean=0 std=1;
var pdnd nipo ripo cefd s turn;
run;

and I don't understand the following questions. so I am calling @Rick_SAS

MAC1430 · Posted 03-29-2017 07:34 PM

Thanks Ksharp.

Rick_SAS · Posted 03-29-2017 08:13 AM

Look at the SAS doc for examples and details of the following steps:

Second Step: Use DATA step and LAG function to define the lagged variables. Then use PROC PRINCOMP and N=1 to compute the first PC. Use the OUT= option to output the first PC scores to a data set.

Third Step: PROC CORR and the WITH statement. You might need to write the output for the final step.

Final Step: ??? I don't understand so I'll defer to someone else.

It'll look something like this (untested):

data Have2;

set Have;

LAGpdnd = lag(pdnd);

LAGnipo = lag(nipo);

...

run;

proc princomp data=Have2 N=1 out=Have3 noprint;

var pdnd nipo ripo cefd s turn LAG:;

run;

proc corr data=Have3;
var Prin1;
with mpg_city mpg_highway weight LAG:;
run;

mkeintz · Posted 03-29-2017 11:34 AM

I believe the OP's final step is meant to reduce the number of elements in the first PC from 12 to 6, where the 6 are chosen from each of the original 6 columns, but using either current or lagged values, whichever has the higher absolute correlation with the 12-element component.

I think one could avoid running the final PROC CORR of the original vars with PRIN1 by using the OUTSTAT option of the PROC PRINCOMP. Among other results, this will give a row called SCORE with _TYPE_='SCORE' and _NAME_='Prin1', and 12 other columns having the score coefficients of each var with PRIN1. Since the score coefficients use the standardized values of the original vars, I believe it can be used just as well as the correlation coefficients for choosing the vars with the highest correlation with Prin1:

proc princomp data=have2 N=1 out=have3 outstat=have3_stats noprint;
  var pdnd nipo ripo cefd turn lag: ;
run;

Then examine the SCORE row of have3_stats.

However, if the goal is to get the optimal combination of 6 current/lagged values of the original data, I don't believe this process guarantees that objective.

To guarantee "optimal" (i.e. the component that would explain the largest multidimensional variance) wouldn't it required running a PC for each of the 2**6 combinations under consideration? Even though it's 2**6, it shouldn't be that expensive to do. Create a correlation/covariance matrix of all 12 vars one time, and then run the covariance through proc princomp for all 2**6 combinations, which is nothing more than a set of matrix manipulations - no data reading required. Then use the result with the largest eigenvalue for prin1 (untested):

proc corr data=have2 sscp out=outcorr;
  var pdnd nipo ripo cefd s turn LAG:;
run;

data _null_;
   vars='pdnd nipo ripo cefd s turn';
   length princompvars $50;
   do I=0 to 2**countw(vars)-1; 
     N=I;
     princompvars=' ';
     do w=1 to countw(vars);
        princompvars=catx(' ',scan(vars,w,' '),princompvars);
        if mod(n,2**w)^=0 then do;
          princompvars=cats('lag',princompvars);
          n=n-mod(n,2**w);
        end;
     end;
     pcoutstatfile=cats('pcout',N);
     call execute ('proc princomp data=corrout N=1 noprint outstat=' || trim(pcoutstatfile) || '; var ' || trim(princompvars) || '; run;');
   end;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

PaigeMiller · Posted 03-29-2017 12:57 PM

To @MAC1430 and anyone else, in the spirit of learning, I'd be curious to know the benefit of performing this rather complicated procedure, what is the purpose of performing the analysis this way? (I realize this question is unrelated to the original question about how to do this in SAS).

Based on my limited understanding of what the method involves, this seems like overmassaging the data and defeating the purpose of using PCA in the first place.

--
Paige Miller

mkeintz · Posted 03-29-2017 02:40 PM

I think by the time one has taken the first PC, you've already massaged the data a whole lot. Frankly I'm intrigued by the idea of trying to include serial cross-covariances variables as part of the variance to be explained. I'd be interested to know the difference between the highest eignevalue that includes serial covariances vs the highest eigenvalue that does not use serial cross-covariances.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

MAC1430 · Posted 03-29-2017 08:11 PM

Basically, this is the procedure to construct sentiment index in Baker and Wurgler (2006). I never used PCA before, so dnt know if the process is accurate but Professor Baker and Wurgler are big Finance Professors, so I assume it's accurate.

PaigeMiller · Posted 03-30-2017 08:10 AM

Thank you, @MAC1430, that paper is on-line at http://people.stern.nyu.edu/jwurgler/papers/wurgler_baker_cross_section.pdf so I will put it on my reading list, and hopefully learn some new things!

--
Paige Miller

MAC1430 · Posted 03-30-2017 04:36 PM

Its really a good paper. Enjoy reading it [😊]

mkeintz · Posted 03-30-2017 05:32 PM

Yes. Any paper that shows something lacking in the notion of complete rational economic behavior is worthy of reading, especially in finance.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

MAC1430 · Posted 03-29-2017 07:35 PM

Thanks, the final step is meant to reduce the number of elements in the first PC from 12 to 6.