Hi everyone,
Can you please help me to form a composite index based on six variables as provided in the SAS data.
First Step: We standardise these variables to have zero mean and one variance.
Second Step: we have to estimate the first principal component of the six variables and their lags. This gives us a first-stage index with 12 loadings, one for each of the current and lagged variables.
Third Step: We compute the correlation between the first-stage index (obtained from the second step) and the current and lagged values of each of the variables.
Final step: we define composite index as the first principal component of the correlation matrix of six variables—each respective proxy’s lead or lag, whichever has higher correlation with the first-stage index—rescaling the coefficients so that the index has unit variance.
It should provide this answer Indext =− 0.241CEFDt +0.242TURNt−1 +0.253NIPOt +0.257RIPOt−1 +0.112St −0.283PD−ND t−1.
data have;
infile cards expandtabs truncover;
input date pdnd nipo ripo cefd s turn ;
cards;
1962 31.23 298 -0.41 3.88 0.16 0.1404
1963 29.24 83 4.27 9.38 0.11 0.1421
1964 31.16 97 4.92 14.30 0.22 0.1315
1965 12.36 146 11.67 13.90 0.14 0.1440
1966 2.86 84 10.68 15.70 0.14 0.1999
1967 -27.20 100 40.52 2.10 0.11 0.2224
1968 -33.79 368 54.11 -10.91 0.21 0.2274
1969 -11.81 781 13.44 -7.58 0.31 0.1972
1970 14.13 358 0.98 0.35 0.22 0.1805
1971 13.54 391 19.87 13.35 0.29 0.2105
1972 20.46 562 12.56 16.01 0.31 0.1988
1973 28.24 105 -0.06 16.48 0.33 0.1855
1974 19.53 9 0.67 16.66 0.16 0.1497
1975 16.71 14 -1.67 21.53 0.20 0.1958
1976 18.62 35 2.73 19.64 0.21 0.2018
1977 8.04 35 22.04 14.40 0.22 0.1871
1978 -0.23 50 24.42 23.53 0.22 0.2332
1979 -13.98 81 24.82 19.93 0.22 0.2648
1980 -22.89 238 49.60 13.53 0.28 0.3466
1981 -22.81 450 16.87 12.74 0.36 0.3262
1982 -25.62 222 19.30 2.44 0.36 0.4370
1983 -21.76 883 21.15 2.93 0.43 0.5060
1984 -15.95 552 11.51 1.15 0.17 0.4984
1985 -13.08 507 12.30 3.49 0.18 0.5415
1986 -11.08 953 10.40 0.92 0.16 0.6328
1987 -7.66 630 10.61 12.7 0.15 0.7318
1988 -4.16 227 9.81 13.7 0.11 0.5643
1989 -7.56 204 12.53 8.38 0.10 0.5485
1990 -1.49 172 14.61 7.49 0.08 0.4633
1991 -13.49 367 14.02 4.55 0.15 0.4639
1992 -12.19 509 12.34 1.2 0.15 0.4637
1993 -15.20 627 15.16 5.26 0.14 0.5305
1994 -9.21 568 13.46 12.14 0.12 0.5421
1995 -8.45 566 20.48 11.59 0.12 0.5886
1996 -11.02 845 16.90 11.06 0.20 0.6135
1997 -4.90 602 13.74 5.27 0.14 0.6871
1998 -5.39 344 19.90 12.14 0.12 0.7151
1999 -27.98 505 69.53 11.21 0.14 0.7631
2000 -15.83 397 56.43 6.17 0.14 0.9110
2001 -3.11 84 13.36 3.58 0.09 0.8892
2002 12.01 73 7.75 6.09 0.08 0.9627
2003 -0.05 76 11.58 4 0.07 0.9338
2004 -11.08 213 12.02 -0.1 0.08 0.9416
2005 -11.87 194 9.95 3.53 0.05 1.0551
run;
Thanks a lot for your help.
Best,
Cheema
Here is for the first question.
proc standard data=have out=want mean=0 std=1;
var pdnd nipo ripo cefd s turn;
run;
and I don't understand the following questions. so I am calling @Rick_SAS
Thanks Ksharp.
Look at the SAS doc for examples and details of the following steps:
Second Step: Use DATA step and LAG function to define the lagged variables. Then use PROC PRINCOMP and N=1 to compute the first PC. Use the OUT= option to output the first PC scores to a data set.
Third Step: PROC CORR and the WITH statement. You might need to write the output for the final step.
Final Step: ??? I don't understand so I'll defer to someone else.
It'll look something like this (untested):
data Have2;
set Have;
LAGpdnd = lag(pdnd);
LAGnipo = lag(nipo);
...
run;
proc princomp data=Have2 N=1 out=Have3 noprint;
var pdnd nipo ripo cefd s turn LAG:;
run;
proc corr data=Have3;
var Prin1;
with mpg_city mpg_highway weight LAG:;
run;
I believe the OP's final step is meant to reduce the number of elements in the first PC from 12 to 6, where the 6 are chosen from each of the original 6 columns, but using either current or lagged values, whichever has the higher absolute correlation with the 12-element component.
I think one could avoid running the final PROC CORR of the original vars with PRIN1 by using the OUTSTAT option of the PROC PRINCOMP. Among other results, this will give a row called SCORE with _TYPE_='SCORE' and _NAME_='Prin1', and 12 other columns having the score coefficients of each var with PRIN1. Since the score coefficients use the standardized values of the original vars, I believe it can be used just as well as the correlation coefficients for choosing the vars with the highest correlation with Prin1:
proc princomp data=have2 N=1 out=have3 outstat=have3_stats noprint;
var pdnd nipo ripo cefd turn lag: ;
run;
Then examine the SCORE row of have3_stats.
However, if the goal is to get the optimal combination of 6 current/lagged values of the original data, I don't believe this process guarantees that objective.
To guarantee "optimal" (i.e. the component that would explain the largest multidimensional variance) wouldn't it required running a PC for each of the 2**6 combinations under consideration? Even though it's 2**6, it shouldn't be that expensive to do. Create a correlation/covariance matrix of all 12 vars one time, and then run the covariance through proc princomp for all 2**6 combinations, which is nothing more than a set of matrix manipulations - no data reading required. Then use the result with the largest eigenvalue for prin1 (untested):
proc corr data=have2 sscp out=outcorr;
var pdnd nipo ripo cefd s turn LAG:;
run;
data _null_;
vars='pdnd nipo ripo cefd s turn';
length princompvars $50;
do I=0 to 2**countw(vars)-1;
N=I;
princompvars=' ';
do w=1 to countw(vars);
princompvars=catx(' ',scan(vars,w,' '),princompvars);
if mod(n,2**w)^=0 then do;
princompvars=cats('lag',princompvars);
n=n-mod(n,2**w);
end;
end;
pcoutstatfile=cats('pcout',N);
call execute ('proc princomp data=corrout N=1 noprint outstat=' || trim(pcoutstatfile) || '; var ' || trim(princompvars) || '; run;');
end;
run;
To @MAC1430 and anyone else, in the spirit of learning, I'd be curious to know the benefit of performing this rather complicated procedure, what is the purpose of performing the analysis this way? (I realize this question is unrelated to the original question about how to do this in SAS).
Based on my limited understanding of what the method involves, this seems like overmassaging the data and defeating the purpose of using PCA in the first place.
I think by the time one has taken the first PC, you've already massaged the data a whole lot. Frankly I'm intrigued by the idea of trying to include serial cross-covariances variables as part of the variance to be explained. I'd be interested to know the difference between the highest eignevalue that includes serial covariances vs the highest eigenvalue that does not use serial cross-covariances.
Thank you, @MAC1430, that paper is on-line at http://people.stern.nyu.edu/jwurgler/papers/wurgler_baker_cross_section.pdf so I will put it on my reading list, and hopefully learn some new things!
Yes. Any paper that shows something lacking in the notion of complete rational economic behavior is worthy of reading, especially in finance.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.