About MarcHuber

MarcHuber · ‎06-05-2024

In this post, I follow up on my previous post about cluster profiling. In that post, “Don't Listen to Ron White. Cluster Profiling is Right!” I provided an argument that profiling after clustering or segmentation is a necessary part of the clustering process. Here, I will add some graphical techniques to help make differences between clusters more apparent, helping us to profile. In the last post, I showed how PROC FASTCLUS in SAS ® reports tables of cluster means and cluster standard deviations. This can be used as a rudimentary start to cluster profiling. We simply look at the mean vectors for each of the clusters and see how the clusters differentiate from one another. I had mentioned that this technique can be misleading because it doesn’t take into account the distributions of the profiling variables, but only a single point. It also becomes difficult to just look at a series of numbers when the number of numbers increases, such as when there are more profiling variables or more clusters. An intuitive graphical technique for analyzing cluster differences is through comparative histograms. The coding for this example uses PROC SGPANEL. I’ll use some simple SAS ® macro coding to reduce redundancy of code. /* Create a macro to generate a profile plot for each profiling variable. */ %macro profilepanel(dsn=,clusvars=); %let k=1; %let dep = %scan(&clusvars, &k); %do %while(&dep NE); proc sgpanel data=&dsn; panelby cluster / columns=1 onepanel; histogram &dep / scale=percent ; density &dep / type=kernel; rowaxis max=100; run; %let k = %eval(&k + 1); %let dep = %scan(&clusvars, &k); %end; %mend profilepanel; The critical part of the code is in the PROC SGPANEL step. The PANELBY statement names the stratification variable. The options COLUMNS=1 and ONEPANEL assure that all cluster histograms are in one plot, with the clusters organized in one column. The HISTOGRAM statement requests histograms, using percent, rather than frequency on the Y axis. The DENSITY statement requests overlaid kernel density curves on the histograms, easing visual comparisons. The ROWAXIS statement specifies that the axis values for the Y Axis (Rows) display a maximum of 100 (for 100 percent). Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Cluster 1 (top of each panel plot) seems to profile as low literacy rates, low primary school enrollment (both male and female), and high fertility rate. Cluster 3 (bottom of each panel plot) profiles as high literacy rates, high primary school enrollment, and low fertility rate. Cluster 2 (middle of each panel plot) profiles as moderate in all measures of literacy, primary school enrollment, and fertility rate. When there are many clusters, you would need to compare many panels. When there are many profiling variables, you would need to generate many panel plots. Trying to simultaneously profile several clusters on several variable becomes a bit unwieldy. Another strategy for illustrating differences between clusters and profiling them employs a four-step strategy: Create K binary indicator variables, one for each of the k clusters. Each will be coded 1 if a member of the cluster and 0 if not. Running k separate logistic regression analyses, using the profiling variables in the x role and the binary cluster variable in the y role. Retain all profiling variables that reach a certain pre-determined alpha level and order them by p-value (smallest first). For each of the k-clusters, create a comparative histogram plot, comparing members of cluster i with members of all other clusters. There is no single procedure for doing this. I am leaving the actual code out here, but I use PROC SQL for step 1, PROC LOGISTIC for steps 2 and 3, and PROC SGPLOT for step 4. I run the code for the various clusters and profiling variables using SAS Macro coding, employing %DO … %UNTIL loops. The results are below. I am adding a variable, popUrban (percent of population in urban areas) to the list of profiling variables, even though it was not used in clustering. You are not limited to profiling using the clustering variables. Cluster 1 vs. Not Cluster 1 Analysis of Effects Eligible for Entry Effect DF Score Chi-Square Pr > ChiSq AdultLiteracypct 1 47.0500 <.0001 FemaleSchoolpct 1 58.8651 <.0001 MaleSchoolpct 1 51.2717 <.0001 totalFR 1 33.0478 <.0001 popUrban 1 13.7502 0.0002 These plots give the same information as the previous plots but organized differently. Also, notice that Cluster 1 seems to include on average a lesser population percentage in urban areas than other clusters. Cluster 2 vs. Not Cluster 2 Analysis of Effects Eligible for Entry Effect DF Score Chi-Square Pr > ChiSq AdultLiteracypct 1 20.2576 <.0001 FemaleSchoolpct 1 21.4569 <.0001 MaleSchoolpct 1 18.8542 <.0001 totalFR 1 35.4597 <.0001 popUrban 1 10.6113 0.0011 Cluster 2 seems to include on average a slightly lesser population percentage in urban areas than other clusters. Cluster 3 vs. Not Cluster 3 Analysis of Effects Eligible for Entry Effect DF Score Chi-Square Pr > ChiSq AdultLiteracypct 1 67.2468 <.0001 FemaleSchoolpct 1 77.6231 <.0001 MaleSchoolpct 1 67.8925 <.0001 totalFR 1 78.1009 <.0001 popUrban 1 26.7824 <.0001 Cluster 3 seems to include on average a slightly greater population percentage in urban areas than other clusters. If you want to learn more about clustering, take a look at our Course: Applied Clustering Techniques (sas.com). I hope that I have convinced you in this post that using graphs to aid in profiling generated clusters is more informative than analyzing tables of summary statistics. I have introduced two methods, but they are by no means the only way to obtain information for profiling. Find more articles from SAS Global Enablement and Learning here.

MarcHuber · ‎04-25-2024

Comedian Ron White is quoted as saying, “I had one DWI (Driving While Impaired charge), which was a bogus charge, because it turns out they were stopping every vehicle driving down that particular sidewalk. That’s profiling. And profiling is wrong.” I must disagree with Mr. White. When you are clustering or segmenting data, then profiling is right. In this post, I will present a counterargument to Mr. White’s thesis, with illustration using an example and explanation. I will illustrate my point using a basic example of cluster analysis and then explain why profiling is not only right, but also necessary in many applications. Don't worry. I'll remind you what cluster profiling is. I am going to assume some basic familiarity with cluster analysis. To briefly review, clustering (cluster analysis or segmentation) is a process of grouping units into homogeneous groups. Group membership isn’t defined prior to clustering. The goal is to find optimal groupings. Therein lies the problem. How do we define optimal? If there were a target (or dependent) variable, we can use it as a “supervisor” to group the units. Optimality would be when the algorithm exactly replicates the groupings in the target variable. That type of analysis is called supervised analysis. We cluster or segment because we have no supervisor. It’s like when kids are in school and the teacher hasn't yet arrived. The unsupervised kids will group themselves however they want, and nobody can tell them what to do! (Does this sound familiar to anyone?) Sounds a bit like anarchy, doesn't it? The difference is that in unsupervised clustering, the analyst would have some idea about which groupings make sense. One of the most common methods for clustering is Ward’'s Method – one of the hierarchical algorithms designed to minimize within cluster variability (increase homogeneity among members) and maximize between cluster variability (increase heterogeneity between clusters). This process doesn’t always produce clusters that even make sense to the analyst, but it works enough that it is very popular. For larger data sets, hierarchical clustering methods are too burdensome on computer processing. We often eschew the term “clustering” in these cases, in favor of the term, “segmentation”. For segmentation we often turn to the k-means algorithm, which finds clusters faster, with much less burden on processing. For simplicity, I’m going to show an example of k-means clustering. I’ll employ SAS ® PROC FASTCLUS for this purpose. I'll use the Demographics data set in the SASHELP library. The data set contains summary demographic information about various geographic regions. Each observation represents a different region. In order to profile, we need to tell PROC FASTCLUS how many clusters we need, using the MAXCLUSTERS option in the PROC FASTCLUS statement. In this example, the clustering variables have already been standardized. For simplicity, I limited the number of clusters to 3. proc fastclus data=work.std_demographics maxiter=25 maxclusters=3 out=work.clus_demographics; var &inputs; run; One of the first tables presented shows the relative frequencies of subjects assigned to each of the three clusters. Note: Since clustering was done on standardized variables, distances and summary statistics are also calculated on standardized variables. The relative ordering of the clusters is maintained, but the actual values should not be interpreted. You can un-standardize the cluster results and calculate summary statistics on the original values. I will not do that here. Cluster Summary Cluster Frequency RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceeded Nearest Cluster Distance Between Cluster Centroids 1 17 0.1618 0.4688 2 0.6536 2 42 0.1564 0.5283 1 0.6536 3 138 0.1060 0.4696 2 0.6714 It wouldn’t take very long to figure out that we still don’t really know much more than we did before clustering. For instance, 42 subjects were assigned to Cluster 2, but how does that inform me about why these people were all clustered together? Another table might help. The Cluster Means table shows us how each cluster differs from others based on the average levels of each of the input variables used for clustering. Without even trying, we’re starting to profile! Cluster Means Cluster AdultLiteracypct FemaleSchoolpct MaleSchoolpct totalFR 1 0.2677115987 0.1661092531 0.2358974359 0.7655838455 2 0.5766163793 0.5670874240 0.5946236559 0.5600568586 3 0.8879247190 0.8960560703 0.8923497268 0.1637464850 From this table, it seems that cluster 3 has the highest average levels of adult literacy, percentage of males enrolled in primary school and percentage of females enrolled in primary school, as well as the lowest average rate of fertility. To truly profile, we would want to describe a typical member of the cluster. Here is where we can become imaginative, hopefully based on a healthy familiarity with the population and measures used in the study. For example, I might say that Cluster 3 represents regions with well-educated populations and low rates of reproduction. The profile for cluster 1 seems opposite to the profile for Cluster 3. The residents of its regions seem to be less well educated, with higher fertility rates. Cluster 2 is between Cluster 3 and Cluster 1 on every measure, so its profile is that of an average-educated, average fertility rate region. Of course, things are not always so simple. The mean is a single representative value for a cluster, but clusters with higher variability might make you more cautious about your profiles. A quick glance at the standard deviations table might help. Cluster Standard Deviations Cluster AdultLiteracypct FemaleSchoolpct MaleSchoolpct totalFR 1 0.2008249959 0.1151262643 0.1619965354 0.1579345609 2 0.1458292396 0.1631853140 0.1636044856 0.1520751919 3 0.1154077906 0.0866904666 0.0988721415 0.1197421649 It is difficult to envision what these differences mean. Also, this example is quite rudimentary, with only 3 clusters to compare on 4 profiling variables. What if you were faced with 10 clusters and wanted to base your profiling on 8 variables? Some graphics might help. I’ll write about a few graphical aids to cluster profiling in a future post post. If you want to learn more about clustering, take a look at our Course: Applied Clustering Techniques (sas.com). In this post I hope that I have effectively contradicted Ron White’s assertion that “profiling is wrong”. In many applications, clustering is not only the right thing to do, but essential to the entire process.

MarcHuber · ‎01-16-2024

In this post, I will explain the matrices and matrix operations used to detect and estimate factors in exploratory factor analysis (EFA). It’s not necessary, but it might be helpful to read my first two blog entries, in sequence. The first two blogs were How Correlation Relates to Linear Regression and Factor Analysis and The Relationship between Factor Analysis and Regression. My goal here is to help you through the fog of matrices and variables that can’t be measured (latent variables, measured as factors) that I encountered when first learning factor analysis. This is the last of the trilogy of factor analysis articles that I have posted this year. My hope is that you will think of it alongside the Lord of the Rings trilogy and not the Beverly Hills Chihuahua trilogy, or even worse, the Jeepers Creepers trilogy. I’ve been using the expression “Jeepers Creepers” since I was 6 and that movie franchise has led me to shift to “Jeez Louise”. I hope there are no horror franchises ruining that one. Let’s get back to our discussion about factor analysis and linear regression. In my first post, I described and illustrated how in simple regression (with a single dependent and a single independent variable), if you z-score standardize both variables, the regression coefficient will also be the correlation between the variables. In my second blog post, I introduced a structural diagram of an exploratory factor analysis system with one factor. I described how the diagram illustrated graphically each of the linear regression equations. There are always k linear regressions, where k is the number of “manifest” (measured) variables in your data set. This is where my audience will diverge (oh, yeah, I forgot about the Divergent Trilogy, didn’t I?). One segment will enter the realm of the Lord of the Rings - if you consider a bit of matrix algebra adventurous. Another segment will shrug this off as a mildly annoying exercise not unlike a tiny dog barking and biting their heels. The third segment will consider this all a ridiculous exercise in gratuitous horror. Okay, here it is. This is the equation for factor analysis, as I have previously introduced it: Y=XB+E The Y matrix contains all the measured variables used for factoring (the manifest variables). The B matrix contains all the regression coefficients (the factor loadings). The E matrix contains the errors (the uniquenesses). The X matrix contains the factors (the latent variables), which are said to “cause” the manifest variables. Of all these matrices, the only one whose values are given to start is the Y matrix. So, for those of you who are worried about having a single equation with one known quantity and three unknowns, then congratulations. You are right. However, we will place some constraints on some estimates to make this workable. How do we go about solving for B? First, we will not be working with raw data, but rather correlation matrices, which we have now learned are simply standardized versions of variance-covariance matrices. Here is the key to the solution. Factor analysis assumes that manifest variables exhibit correlations simply because they are correlated with the set of common factors. So, if A has a 0.5 correlation with Factor1 and B has a 0.8 correlation with Factor1 and there are no other factors, then the correlation between A and B should be 0.5 * 0.8 = 0.4. Let’s look at a structural diagram of an EFA run in PROC FACTOR. ods output factorpattern=BPrime; proc factor data=sashelp.baseball corr method=ml; var nRbi nRuns nHits; pathdiagram designheight=400 nodelabel=varname; run; Select any image to see a larger version. Mobile users: To view the images, select the "Full" version at the bottom of the page. Factor Pattern Factor1 nRBI RBIs in 1986 0.82403 nRuns Runs in 1986 0.94721 nHits Hits in 1986 0.96248 Note: the ordering of the manifest variables in the path diagram in SAS is alphabetical and that ordering cannot be modified, except through renaming or use of variable labels. The coefficients for predicting nHits, nRBI and nRuns from Factor1 are 0.96, 0.82, and 0.95, respectively. They are displayed as regression coefficients in the path diagram and elements of the factor pattern matrix. These are also the correlations between those variables and Factor1. Why is this the case? Well, Factor1 has no known measurement scale, so we can arbitrary set one. It is customary to set the mean to zero and the variance to one, just like a z-standardized variable. So, now we have all of the manifest variables and the factor on a standardized scale, so we can say that those regression coefficients are also correlations. Note: The factor pattern matrix is the B matrix of our factor analysis equations. That will soon 'B' important. Given this, we would expect the correlation between nRBI and nRuns to be about 0.82 * 0.95 = 0.7790, within rounding error, between nRBI and nHits to be about 0.82 * 0.96 = 0.7872, and between nRuns and nHits to be about 0.95 * 0.96 = 0.9120. If I want to calculate all of those expected inter-manifest variable correlations in one matrix operation, I multiply the transpose of the B matrix (B’), [0.82 0.95 0.96] by the B matrix, [0.82 0.95 0.96]. In the SAS code above I saved the B’ matrix (vector because there is only one factor) as a SAS dataset named BPrime. Then I transposed the matrix to obtain B. proc print data=BPrime; run; Obs Variable Label Factor1 1 nRBI RBIs in 1986 0.82403 2 nRuns Runs in 1986 0.94721 3 nHits Hits in 1986 0.96248 proc transpose data=BPrime out=B; id Variable; run; proc print data=B; run; Obs _NAME_ nRBI nRuns nHits 1 Factor1 0.82403 0.94721 0.96248 If you have SAS IML, you can do the matrix algebra more directly, but in this case, I perform the matrix multiplication using a DATA step to produce the B’B matrix. data BprimeB; if _n_=1 then set B; set BPrime; _nRBI=Factor1*nRBI; _nRuns=Factor1*nRuns; _nHits=Factor1*nHits; run; proc print data=BPrimeB; var Variable _nRBI _nRuns _nHits; run; Obs Variable _nRBI _nRuns _nHits 1 nRBI 0.67903 0.78053 0.79311 2 nRuns 0.78053 0.89720 0.91167 3 nHits 0.79311 0.91167 0.92636 The off-diagonal elements of the table are the expected correlations that we calculated above. Let’s compare this table with the correlation matrix obtained using the CORR option in PROC FACTOR. Correlations nRBI nRuns nHits nRBI RBIs in 1986 1.00000 0.78053 0.79311 nRuns Runs in 1986 0.78053 1.00000 0.91167 nHits Hits in 1986 0.79311 0.91167 1.00000 Voila! As promised, the expected correlations between the manifest variables are approximated by the appropriate elements of B’B. We have one thing left to do. We need interpret the elements on the positive diagonal of B’B. Well, if the ones on a diagonal of the correlation matrix are just the variances of the standardized manifest variables, then the diagonal elements of the B’B matrix represent the parts of that variances that are shared with the variable’s factors. What’s left over? The error or (in factor analysis terms) the uniquenesses. The uniquenesses are displayed on the path diagram, just above the boxes for the manifest variables. The uniquenesses are interpreted as the parts of the variances of that manifest variables that are not shared with the factor. They have their own variance covariance matrix. That matris contains the uniquenesses (unique variances) on the diagonal and unique covariances off the diagonal. In factor analysis, we assume there is no covariance among uniquenesses. Therefore, there are zeroes off the diagonal. Using this logic and factor analysis assumptions, we end up with this equation: R = B’B + U, where R is the covariance matrix of the manifest variables. In practice we typically use the standardized covariance matrix – the correlation matrix. So, for our example, here are the matrices, along with the matrix equation R = B’B + U: Correlations nRBI nRuns nHits nRBI RBIs in 1986 1.00000 0.78053 0.79311 nRuns Runs in 1986 0.78053 1.00000 0.91167 nHits Hits in 1986 0.79311 0.91167 1.00000 = Variable _nRBI _nRuns _nHits nRBI 0.67903 0.78053 0.79311 nRun 0.78053 0.89720 0.91167 nHits 0.79311 0.91167 0.92636 + Variable _nRBI _nRuns _nHits nRBI 0.32097 0.00000 0.000000 nRun 0.00000 0.10280 0.000000 nHit 0.00000 0.00000 0.073637 In exploratory factor analysis, we start with R, a correlation matrix of the manifest variables. Then we make guesses about the U matrix values. We subtract U from R to obtain an approximation for the reduced covariance matrix, B’B. Then we select from among several factor extraction methods to find the B matrix that best reproduces the off-diagonal elements using B’B. Note: In this example, B’B + U = R, exactly. When we only have 3 manifest variables for one factor and we use maximum likelihood, that will always happen. You can try a fourth variable and see that B’B + U does not exactly equal R. In this post, I have walked through the matrix algebra involved in a basic exploratory factor analysis with one latent factor and three manifest variables. I have explained why the specific matrix operations are performed by relating the process to my earlier blog topics about correlation and regression. My goal in this trilogy of posts on exploratory factor analysis has been to offer a more intuitive explanation of not only what a factor represents, but how factors are built. I have not attempted to explain the factor extraction methodologies, including principal, maximum likelihood, and non-parametric. Nor have I attempted to explain EFA with more than one factor. When there is more than one factor, you will likely need to rotate the B matrix to obtain a Factor Pattern matrix that can be more easily interpreted. If you REALLY want to learn more, please look at the Course: Multivariate Statistics for Understanding Complex Data (sas.com) . Find more articles from SAS Global Enablement and Learning here.

MarcHuber · ‎12-01-2023

In this post I will talk about the conceptual basis for exploratory factor analysis (EFA). I’ll be describing regression models related to EFA using matrix algebra. You’d benefit from a working knowledge of matrix algebra, but you won’t necessarily need to follow the mathematics to understand the concepts. In my last post I talked about the relationship between Pearson correlation coefficients and simple linear regression slope coefficients. Specifically, I explained and demonstrated how correlation coefficients are identical to simple regression coefficients that result from first standardizing both the Y and the X variables. If you missed that discussion, it might be helpful to read it before continuing here. And, if you are interested in learning in detail about exploratory factor analysis, you can look at our Course: Multivariate Statistics for Understanding Complex Data (sas.com) At first glance, factor analysis might seem to be unrelated to linear regression. However, a simple EFA example using PROC FACTOR will illustrate how factor analysis can be thought of as a series of linear regressions. I’ll use the baseball data set from the SASHELP library. You can find documentation for PROC FACTOR at SAS Help Center: The FACTOR Procedure, SAS ® PROC FACTOR to Perform Exploratory Factor Analysis PROC FACTOR is the primary tool in SAS for performing exploratory factor analysis. I'll start here with some basic options with a single factor. I use the ODS OUTPUT statement to save some of output table data into SAS ® datasets. The ODS SELECT statement instructs SAS ® which output objects (tables or graphs) to display in the results window of the user interface. In the PROC FACTOR statement: The N=1 option requests exactly one factor. The METHOD=ML option requests the maximum likelihood for parameter estimation. The CORR option requests the display of the correlation matrix of the input variables. The RESIDUALS option requests display of the residuals (to be explained further on). The VAR statement names the variables I am using to discover factors. I am starting with are the career statistics for the baseball players, which are the only variables beginning with the letter c. The PATHDIAGRAM statement produces a plot showing the factor solution, including parameter estimates. The DECP=4 option requests statistics in the plot to be displayed to four decimal places. Note: You can start exploratory factor analysis with a correlation matrix instead of the data matrix of observations by variables. ods output factorpattern=Loadings corr=XCorrs ResCorrUniqueDiag=U; ods select factorpattern pathdiagram corr ResCorrUniqueDiag; proc factor data=sashelp.baseball n=1 method=ml corr residuals ; var c:; pathdiagram decp=4; run; Let me jump to the path diagram in the program output. The Path Diagram Baseball statistics factor structure. The equation most people recognize as linear regression is y i =β 0 +β 1 x 1 +...+β 0 x k+ ε i . In the equation, y i is the response variable value for individual i, x 1 through x k are the explanatory variables, β 0 is the y-intercept, β 1 through β k are the regression coefficients, and ε i is the error associated with individual i. The path diagram displays 6 regression equations. The unidirectional arrows show the causal paths of the regression relationships. The response variables are the input variables named in the VAR statement PROC FACTOR. These measured variables are referred to as manifest variables in factor analysis. The manifest variable names are displayed in rectangles. The double-headed arrows show the variances of factors and the errors for the manifest variables (referred to as uniquenesses in factor analysis). For example, the equation for Career Walks (CrBB) is CrBB=0.9012*Factor1 + 0.1878. "Where are the y-intercepts?" you might ask. The equations are for standard normal distributed variables. By default, we use standardized manifest variables. We simply presume the factor to be normally distributed with a mean of zero and a variance of one. If you read my previous post, you will see how, when all regression variables are on a standard normal scale, the y-intercept will be zero. Factors and Latent Variables At this point, you might be wondering what Factor1 is. A factor in factor analysis represents a latent construct. We use the term latent variable for a variable which, while not itself directly measured, can be indirectly inferred by the presence of a group of manifest variables that it causes. The classic example is the concept of “intelligence”. We don’t measure intelligence directly. We might indirectly measure it using any number of intelligence tests. Each test might be composed of items that can be objectively scored. Because we use manifest variables to indicate the presence of latent variables, factor analysts often refer to those manifest variables as indicator variables. Because we don't actually measure a latent variable, we can set its mean and variance to anything we want. A common practice is to set the mean of a factor to zero and a variance of one. We then assume a standard normal distribution. Another common practice is to name the factor something that represents its apparent latent construct. The naming is usually based on looking at the concept shared by the manifest variables that load highly onto it. In this case, all of the manifest variables have high loadings on Factor1. They all seem to measure career batting statistics. Therefore, I might call this factor "Career Batting". Hollywood Factor Analysis? Whenever I think of latent variables and how manifest variables indirectly indicate their presence, I think about the 1990 movie, "Ghost", in which main character, Sam, has been killed. Sam's ghost is trying to communicate with Sam's girlfriend, Molly. Let's think of Sam's ghost as a latent variable. Molly cannot see him or sense him in any way. However, Sam's ghost has learned some tricks, including sliding a penny from the floor up a door. I think of the behavior of the penny as a manifest variable. Eventually, the physical evidence, which Molly can see, convinces her that the ghost does, in fact, exist. The latent variable (the ghost) is the causal agent of the physical behavior (the manifest variables). However, we must start with the manifest variables in order to infer the presence of the latent variable. So, I guess this is how you ruin a perfectly good classic romantic movie. You interpret it as an exercise in machine learning. I apologize to so many people - my wife among them. The Table of Manifest Variables Correlations Correlations CrAtBat CrHits CrHome CrRuns CrRbi CrBB CrAtBat Career Times at Bat 1.00000 0.99489 0.79222 0.98069 0.94741 0.90035 CrHits Career Hits 0.99489 1.00000 0.77573 0.98208 0.94254 0.88452 CrHome Career Home Runs 0.79222 0.77573 1.00000 0.82093 0.92799 0.80619 CrRuns Career Runs 0.98069 0.98208 0.82093 1.00000 0.94314 0.92677 CrRbi Career RBIs 0.94741 0.94254 0.92799 0.94314 1.00000 0.88500 CrBB Career Walks 0.90035 0.88452 0.80619 0.92677 0.88500 1.00000 There is a high degree of correlation among these variables. Without at least modest correlations among your variables, exploratory factor analysis would be as fruitless as my backyard peach tree after the birds and squirrels get their beaks and claws on them. The next table is the Factor Pattern matrix. The Factor Pattern Matrix Factor Pattern Factor1 CrAtBat Career Times at Bat 0.99782 CrHits Career Hits 0.99655 CrHome Career Home Runs 0.79667 CrRuns Career Runs 0.98476 CrRbi Career RBIs 0.95008 CrBB Career Walks 0.90125 The factor pattern matrix in this case also serves as the factor structure matrix. A factor structure matrix contains the simple correlations between the manifest variables and the factors. They are the same as the regression coefficients we saw on the path diagram. I explained how this could be true in a previous post. Correlations are the same as regression coefficients when all variables are on a standard normal scale, with a mean of zero and a variance of one. The final table produced by this program is the Residual Correlations table with uniquenesses on the diagonal. The Residual Correlations Table (With Uniquenesses on the Diagonal) Residual Correlations With Uniqueness on the Diagonal CrAtBat CrHits CrHome CrRuns CrRbi CrBB CrAtBat Career Times at Bat 0.00435 0.00051 -0.00271 -0.00193 -0.00059 0.00106 CrHits Career Hits 0.00051 0.00688 -0.01819 0.00072 -0.00426 -0.01362 CrHome Career Home Runs -0.00271 -0.01819 0.36532 0.03640 0.17109 0.08820 CrRuns Career Runs -0.00193 0.00072 0.03640 0.03024 0.00754 0.03926 CrRbi Career RBIs -0.00059 -0.00426 0.17109 0.00754 0.09735 0.02875 CrBB Career Walks 0.00106 -0.01362 0.08820 0.03926 0.02875 0.18775 If you look closely, the diagonal values on this table match the uniquenesses (numbers on the double-headed arrows) reported in the path diagram. Matrix Algebra, Summary, and a Sneak Preview The matrix version of the linear regression equations in both linear regression and EFA is Y=XB+E. Let’s look at what each of the elements of the equation represents in EFA. In EFA, the Y matrix contains all the measured variables used for factoring (the manifest variables). The B matrix contains all the regression coefficients (the factor loadings). The E matrix contains the errors (the uniquenesses). The X matrix contains the factors (the latent variables), which are said to “cause” the manifest variables. This is just a preview of the matrix algebra that I will explain in the next post. In this post, I have introduced the fundamental concepts of exploratory factor analysis with the aid of linear regression and correlation. I showed a simple example using one factor and described how we can interpret the basic output from PROC FACTOR as a series of regression equations. I have yet to fully explain how the coefficients are estimated. For that, I will need to go into a little more detail about the matrices that I introduced at the end of this post. For those interested in moving to the next level of understanding of exploratory factory analysis, look out for my next post.

MarcHuber · ‎08-25-2023

The purpose of this blog is to illustrate the relationship between a Pearson correlation coefficient and a slope parameter from a simple regression model. How Does Correlation Relate to Linear Regression (and Factor Analysis) For well over a decade, I’ve been teaching a course in multivariate methods, which begins with a discussion of both principal components analysis (PCA) and Exploratory Factor Analysis (EFA). Most of the course participants have been comfortable with PCA, but when it comes to factor analysis, they often feel challenged. I show them all the related matrix algebra, but I know that’s not really the problem. Any basic statistics text can provide that. I want them to understand conceptually what is actually happening when we say that we “infer factors from an observed covariance matrix”. The matrix algebra of exploratory factor analysis looks exactly like that of linear regression and you can think of exploratory factor analysis as a series of simultaneous regression models. However, it is not so obvious why correlation matrices in factor analysis have anything to do with the implied regression models. So, I’m going to present here the first step in a two-step approach to understanding exploratory factor analysis - an explanation of the relationship between regression coefficients and Pearson correlation coefficients. The beauty of presenting it this way is that even if you don’t care about factor analysis at all, but only want to understand linear regression a bit better, there’s something here for you. I will mostly avoid mathematical formulas because, as I said, you can find those anywhere. Instead, I’ll describe concepts and use various statistical procedures in SAS ® to explain my points. Let me start with a fairly simple set of regression models. I’ll be using the baseball data set in the sashelp library. I’ll regress the variable Salary on explanatory variables, nRBI (runs batted in) and nHome (number of home runs), using data from American Major League Baseball in the 1986 season. It’s not important that you know what those measures are, but it might make it a bit more interesting if you do. Most of my course participants are aware that linear regression is related to Pearson correlations, but they might have forgotten how. I’ll start with a correlation matrix of all three variables to be used in my regression models. Documentation for PROC CORR can be found here. proc corr data=sashelp.baseball nosimple; var Salary nRBI nHome; run; Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations Salary nRBI nHome Salary 1987 Salary in $ Thousands 1.00000 263 0.51723 <.0001 263 0.39885 <.0001 263 nRBI RBIs in 1986 0.51723 <.0001 263 1.00000 322 0.85394 <.0001 322 nHome Home Runs in 1986 0.39885 <.0001 263 0.85394 <.0001 322 1.00000 322 I’ll just mention the fact that the variables are all correlated among each other to one degree or another. Now I’ll regress Salary on each of the explanatory variables in separate models and just show tables relevant to this discussion. PROC REG documentation can be found here. proc reg data=sashelp.baseball; model Salary=nRBI; run; Root MSE 386.82654 R-Square 0.2675 Dependent Mean 535.92588 Adj R-Sq 0.2647 Coeff Var 72.17911 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t| Intercept Intercept 1 61.43175 54.13626 1.13 0.2575 nRBI RBIs in 1986 1 9.07446 0.92941 9.76 <.0001 proc reg data=sashelp.baseball; model Salary=nHome; run; Root MSE 414.47484 R-Square 0.1591 Dependent Mean 535.92588 Adj R-Sq 0.1559 Coeff Var 77.33809 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t| Intercept Intercept 1 294.86052 42.78033 6.89 <.0001 nHome Home Runs in 1986 1 20.38591 2.90120 7.03 <.0001 At this point, it’s not entirely obvious how the regression results are related to the correlation results, but we’ll get there. I promise. Let me just point out that each of the explanatory variables, nRBI and nHome, have p-values <.0001, as seen in the parameter estimates tables for their respective models. These values would be considered “statistically significant” in most organizations. The next step is to combine the models by including both measures in a single model. proc reg data=sashelp.baseball; model Salary=nRBI nHome; run; Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 2 14600270 7300135 49.02 <.0001 Error 260 38718842 148919 Corrected Total 262 53319113 Root MSE 385.89976 R-Square 0.2738 Dependent Mean 535.92588 Adj R-Sq 0.2682 Coeff Var 72.00618 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t| Intercept Intercept 1 34.66715 56.87140 0.61 0.5427 nRBI RBIs in 1986 1 11.33615 1.76860 6.41 <.0001 nHome Home Runs in 1986 1 -7.73751 5.15245 -1.50 0.1344 At this point, some questions arise. Why are the regression coefficients (parameter estimates) for nRBI and nHome so very different from what they were in the individual models? In addition, why is the coefficient for nHome now negative? This seems to indicate that for each extra home run a player hits, his expected salary is reduced by over $7,700.00! Wow, if this were really true, I imagine we’d never see another home run hit in Major League Baseball ® . However, the p-value for this estimate is so high, few researchers would consider this value to be “statistically significant”. Still, this is perplexing. The standard explanation is the parameter estimates and p-values are estimated separately for each explanatory variable adjusting for the effect of the other explanatory variable. But what does that mean? Here’s where correlation plays one of its roles in a linear regression model. If the parameter estimate of an explanatory variable changes after adjusting for another explanatory variable, you can infer that the two variables are correlated, to some degree at least. If my friend, Peter, were to leave SAS ® , that would take some adjustment on my part, because we have a great relationship (a COR-relationship, if you will). However, if the night security guard were to leave SAS ® , my life wouldn’t be affected at all, unless, I guess, someone stole my computer at night now that there were no security guard. Oh, you get the point. Anyway, don’t just trust my word. Let’s see what the numbers say. In order to demonstrate my assertion that adjusted parameter estimates are no different from raw parameter estimates in the case where the explanatory variables do not correlate, let me create perfectly uncorrelated variables, based on nRBI and nHome. I’ll do this by using principal components analysis. Principal components are linear combinations of the input variables, where components are generated perfectly uncorrelated. I’ll use PROC PRINCOMP with an OUT= option to produce two new uncorrelated measures from the two correlated variables. By the way, there are several missing salaries in the Baseball data, so I am limiting the analysis to those records that have non-missing values (there are no missing values for either nHome or nRBI). Documentation for PROC PRINCOMP can be found here. proc princomp data=sashelp.baseball out=work.bases prefix=measure noprint; var nrbi nhome; where Salary ne .; run; Let’s see the correlations among the two new measures and Salary. proc corr data=work.bases nosimple; var Salary Measure1 Measure2; run; Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum Label Salary 263 535.92588 451.11868 140949 67.50000 2460 1987 Salary in $ Thousands measure1 263 0 1.36072 0 -2.08519 3.85143 measure2 263 0 0.38527 0 -0.90799 1.09361 Pearson Correlation Coefficients, N = 263 Prob > |r| under H0: Rho=0 Salary measure1 measure2 Salary 1987 Salary in $ Thousands 1.00000 0.47605 <.0001 0.21727 0.0004 measure1 0.47605 <.0001 1.00000 0.00000 1.0000 measure2 0.21727 0.0004 0.00000 1.0000 1.00000 The new measures are both correlated with Salary, but as was my plan, they are perfectly uncorrelated with one another. Keep in mind, however, that the two new measures are linear combinations of nHome and nRBI. The pair of measures contain all the information of those original variables and we will see evidence of that in the next regression models. Notice that the means of both measures are zero, but their standard deviations (and therefore variances) differ. So, what happens when we run the regression models? proc reg data=work.bases; model Salary=Measure1; run; Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t| Intercept Intercept 1 535.92588 24.50978 21.87 <.0001 measure1 1 157.82356 18.04668 8.75 <.0001 The raw (unadjusted) parameter estimate for Measure1 is 157.82356. proc reg data=work.bases; model Salary=Measure2; run; Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t| Intercept Intercept 1 535.92588 27.20462 19.70 <.0001 measure2 1 254.40347 70.74567 3.60 0.0004 The raw parameter estimate for Measure2 is 254.40347. proc reg data=work.bases; model Salary=Measure: ; run; Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 2 14600270 7300135 49.02 <.0001 Error 260 38718842 148919 Corrected Total 262 53319113 Root MSE 385.89976 R-Square 0.2738 Dependent Mean 535.92588 Adj R-Sq 0.2682 Coeff Var 72.00618 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t| Intercept Intercept 1 535.92588 23.79560 22.52 <.0001 measure1 1 157.82356 17.52082 9.01 <.0001 measure2 1 254.40347 61.88050 4.11 <.0001 The adjusted parameter estimates for Measure1 and Measure2 are precisely the same as the unadjusted parameter estimates. Without correlation among the predictor variables, the apparent conundrum of the changing parameter estimates doesn’t exist. This is partly why “balanced and complete design” is used in experimental design. Balanced and complete design assures that the independent variables are uncorrelated. Another interesting outcome shown in these tables is that the F-value and p-value (Pr > F) from the Analysis of Variance table are precisely the same as the values that occurred when using the original variables of nRBI and nHome. The R-Square and Adjusted R-Square are also the same. Linear transformations, such as the ones used to create principal component scores do not affect the explanatory power of a model. Now, let’s go deeper into the relationship between Pearson correlation coefficients and regression parameters. This is where the magic happens. Okay, so it’s not magic, but would you be excited by more talk about matrices and linear transformations? To this point, I have created explanatory variables that are perfectly uncorrelated. In the next step, I’ll standardize all variables, including the dependent variable, Salary, so that each has a mean or zero and a variance and standard deviation of one (often known as “z-score standardization”). I’ll use PROC STDIZE and its STD method. This method is the z-score method. Documentation for PROC STDIZE can be found here. proc stdize method=std data=work.bases out=work.bases2; var Salary measure1 measure2; run; Let’s look at the new correlation matrix. proc corr data=work.bases2; var Salary Measure1 Measure2; run; Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum Label Salary 263 0 1.00000 0 -1.03837 4.26512 1987 Salary in $ Thousands measure1 263 0 1.00000 0 -1.53242 2.83043 measure2 263 0 1.00000 0 -2.35675 2.83852 Pearson Correlation Coefficients, N = 263 Prob > |r| under H0: Rho=0 Salary measure1 measure2 Salary 1987 Salary in $ Thousands 1.00000 0.47605 <.0001 0.21727 0.0004 measure1 0.47605 <.0001 1.00000 0.00000 1.0000 measure2 0.21727 0.0004 0.00000 1.0000 1.00000 Are you surprised that the Pearson correlation coefficients are all identical to the ones I obtained using the unstandardized variables? Well, the not-so-well-kept secret is that Pearson correlations are simply covariances on variables that have been z-score standardized. If you don’t standardize the variables yourself, they will be standardized during the process of calculating the Pearson correlations, anyway. Let’s see how this affects the regression models. proc reg data=work.bases2; model Salary=Measure1; run; Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t| Intercept Intercept 1 1.54097E-16 0.05433 0.00 1.0000 measure1 1 0.47605 0.05443 8.75 <.0001 The first thing you might notice is that the parameter estimate for Measure1, 0.47605, is exactly the same as the Pearson correlation coefficient for Measure1 with Salary. What is not obvious is that the intercept is now zero, or at least it should be. The only reason it isn’t is because of lack of precision. The value 1.54097E-16 is infinitesimally close to zero and should be zero if the parameters were estimated precisely. Oh, well. proc reg data=work.bases2; model Salary=Measure2; run; Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t| Intercept Intercept 1 1.18586E-16 0.06030 0.00 1.0000 measure2 1 0.21727 0.06042 3.60 0.0004 Similarly, the parameter estimate for Measure2 is the same as its Pearson correlation coefficient with Salary. These results are not simply coincidence. It is a result of the linear transformations involved in z-score standardization of both X and Y variables in a regression model – the simple regression model (using one explanatory variable) will have an intercept of zero and a parameter estimate equal to the Pearson correlation coefficient with the dependent variable. Now, let me put this all together in one final model. proc reg data=work.bases2; model Salary=Measure1 Measure2; run; Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 2 71.74296 35.87148 49.02 <.0001 Error 260 190.25704 0.73176 Corrected Total 262 262.00000 Root MSE 0.85543 R-Square 0.2738 Dependent Mean 1.41838E-16 Adj R-Sq 0.2682 Coeff Var 6.031008E17 Parameter Estimates Variable Label DF Parameter Estimate Standard Error t Value Pr > |t| Intercept Intercept 1 1.30845E-16 0.05275 0.00 1.0000 measure1 1 0.47605 0.05285 9.01 <.0001 measure2 1 0.21727 0.05285 4.11 <.0001 Once again, the F value and p-value from the Analysis of Variance table and the R-Square and Adjusted R-Square values are identical to the value that resulted from the first model I ran, where I’d used unstandardized variables, nHome and nRBI as explanatory variables. The linear transformations involved in creating the two principal component scores didn’t change the predictive power of model. Nor did standardization. If you’re interested in learning how all of this relates to exploratory factor analysis, tune in later to my next blog post. The important points to take from this post for factor analysis are that Pearson correlation coefficients are equal to regression parameter estimates (with a Y-intercept of zero) when all variables are on the standardized metric of zero mean and unit variance (variance of one) and that the adjusted parameter estimates (and therefore adjusted correlations with the Y-variable) are the same as the raw (unadjusted) parameter estimates when the explanatory variables are uncorrelated.

MarcHuber · ‎09-11-2020

I see. That is a conceptually different situation than it looked like before. In your special case, you want the risk of death to change immediately as a person enters the icu. My assumption was that those things cannot happen at the same time. In fact, I would guess that you don't believe that a person can die at the exact instant that they cross the threshold to the icu. So, we need to break the tie. Your change in the programming statements would necessitate a change in the data step code as well. I don't really know what the actual time values are in your data. I was under the assumption that time was always recorded in non-negative positive integers. Therefore, in this modified code, I made a slight adjustment to the time_to_icu value. I subtracted 0.0005 from all value to reflect the fact that going to the icu always precedes dying, even if by a tiny amount. Note that the actual value of 0.0005 is not magical. It could have been 0.00001 or 0.00000004536. The important thing is that it is smaller than the gap between any two actual times. So, if your times are 0, 0.00001, 0.00002...1.0, then you might subtract 0.000005 from the time_to_icu value. Time order is the only thing that matters in Cox PH models, not actual times. Here is the slightly modified code: data co2; set co1; status=status_death; time_to_icu = time_to_icu - 0.0005; /* Assuming missing on time_to_icu means no icu admissions. */ if time_to_icu=. then do; start=0; stop=time_to_death; icu_1=0; output; end; /* Generate two records if there was an icu visit. It is presumed that time_to_icu */ /* can never be greater than time_to_death. */ else if time_to_icu < time_to_death then do; /* first record */ start=0; stop=time_to_icu; icu_1=0; status_death=0; output; /* second record */ start=time_to_icu; stop=time_to_death; icu_1=1; status_death=status; output; end; drop icu_admission status time_to_icu time_to_death; run; proc phreg data=co2; class Provenance2 (ref=first) age_cat3n (ref=first) sex (ref=first) comorbidity_catn (ref=first); model (start, stop)*status_death1(0)=age_cat3n sex comorbidity_catn provenance2 icu_1 /rl; run; proc phreg data=co1; class Provenance2 (ref=first) age_cat3n (ref=first) sex (ref=first) comorbidity_catn (ref=first); model time_to_death*status_death(0)=age_cat3n sex comorbidity_catn provenance2 icu_1 /rl; /* This is the key */ icu_1=(0 <=time_to_icu <= time_to_death); run;

MarcHuber · ‎09-11-2020

I was looking at your code. The way you created the multiple records data set was fine. The problem was with your creation of icu_1 in the programming code in PROC PHREG. I'm assuming that if icu_admission in the co1 data set is 0 then the the time_to_icu variable will be missing. I'm not sure what else you might have coded there, but assuming that it is indeed missing, you can simplify not only your DATA step code for creating the counting process data set (although, as I said, yours appears to work just fine), but also correct and simplify the programming code in PROC PHREG. I am including such coding and if you do this. If you use this code, then both forms of specifying the model should come out the same - not just close, but the same. Note that I didn't rename status_death to status_death1 variable in the multiple records data. It is called status_death in both data sets. data co2; set co1; status=status_death; /* Assuming missing on time_to_icu means no icu admissions. */ if time_to_icu = . then do; start=0; stop=time_to_death; icu_1=0; output; end; /* Generate two records if there was an icu visit. It is presumed that time_to_icu */ /* can never be greater than time_to_death. */ else do; if time_to_icu < time_to_death then do; /* first record */ start=0; stop=time_to_icu; icu_1=0; status_death=0; output; /* second record */ start=time_to_icu; stop=time_to_death; icu_1=1; status_death=status; output; end; end; drop icu_admission status time_to_icu time_to_death; run; proc phreg data = co2; class Provenance2 (ref=first) age_cat3n (ref=first) sex (ref=first) comorbidity_catn (ref=first) ; model (start,stop)*status_death1(0) =age_cat3n sex comorbidity_catn provenance2 icu_1 /rl ; run; proc phreg data = co1; class Provenance2 (ref=first) age_cat3n (ref=first) sex (ref=first) comorbidity_catn (ref=first) ; model time_to_death*status_death(0) =age_cat3n sex comorbidity_catn provenance2 icu_1 /rl; /* This is the key */ icu_1=(0 <= time_to_icu < time_to_death); run;

MarcHuber · ‎07-10-2019

Just to be clear, in my previous reply, I should have mentioned that 905 MUST be the group mean for Medication 2, 795 MUST be the group mean for Medication 1, and 615 MUST be the group mean for medication 3. It must be because predicted values in one-way ANOVA are the group means. And this is a listing of an output table from a one-way ANOVA.

MarcHuber · ‎07-10-2019

This is the context for this question. The table in question is the output predicted values from an ANOVA problem. In ANOVA (ANalysis Of VAriance, of course), the so-called "predicted value for any observation in the sample is the calculated arithmetic mean of the target variable (in this case, the target is TCellCount) for the group that the observation belongs to (group being the Medication value). So, in this example, each and every Predicted TCellCount for Medication 2 must be the same - in this case 905, just as each and every Predicted TCellCount for Medication 1 must be the same - in this case 795, and each and every Predicted TCellCount for Medication 3 must be the same - in this case 615. This MUST be the case because this is one-way ANOVA (ANOVA with only one classification variable as a predictor).

MarcHuber · ‎06-28-2019

Hi, Bill, That is an interesting thought. Unfortunately, it would not really be possible to even implement even if it were advisable. Think about how you might try to code a "winter-summer" variable, as your questions mentions. You might try to add a new column to your "design matrix" (the matrix of data that is actually fed to SAS for your model parameter estimations). How would you code it? If you assign the value '1' to winter and '0' to summer (or any two values in the universe that you'd like to use) then how would you code fall and spring? If you coded them missing then SAS would eliminate those observations from its calculations. So, essentially you'd have an analysis of data (in my best movie trailer voice), "in a world where there is only summer and winter...". That's not what you want. There are also issues of taking advantage of chance relationships in the sample you are using that are only partially addressed with multiple comparisons adjustments, such as Tukey's HSD or the Bonferroni adjustment. You have to be careful about overfitting your model to the sample used to estimate parameters. This leads to "false discovery" which is what it sounds like - discovering connections that don't really exist. It's worthwhile to say that the concept of multiple comparison adjustment isn't without controversy. A basic issue is what should constitute a family of tests that I should adjust for. For instance, should I adjust for all hypotheses I've tested in my whole statistical life because I know that Type 1 Error rate assures (under normal regression assumptions) that alpha percent of my tests of true null hypotheses will appear significant, just by chance? I guess the easy answer is just the practical problem of creating the variable that you're thinking about. But there are deeper issues, as well. Marc

MarcHuber · ‎03-15-2017

A simple solution is just to use PROC MEANS on your newly assigned data set. Use and NWAY option and a CLASS statement with NEWCLUSTER as the classification variable. proc means data=OUTSTAT nway; class NEWCLUSTER; var VOLUME: LOGVAL_Mean: LOGVAL_Sum:; output out=CENTROIDS mean=; run;

MarcHuber · ‎01-17-2014

Hi, Adam, This has a lot to do with your definitiion of event. If event is defined as the onset of heart failure and that, by definition, can only occur once (you and Doc Muhlbaier would know better than I), then the remaining observations after the event occurrence only add bias by inflating the risk set at times where some individuals are no longer at risk (they have heart failure, but are no longer at risk of having the onset of heart failure, just like someone who died will always be dead, but will no longer be at risk for the even to dying).. If this is the case here (only one event can occur and it is defined to have occurred at onset), then the indicator for event should be 1 for the interval that ends at onset. So, if they had the event at time 7, then event should be 1 at the third interval. In addition, for this analysis, the remaining observations beyond the third observation should not be included in the analysis. Unless they are at risk for the onset of heart failure again, anything that occurs after onset is to be ignored. Now, if heart failure were a repeatable event, then you would still have a 1 for event at the third interval, but you could also have a 1 at later intervals where the event occurred again. Like I said, I am not an expert on heart failure, so I don't know what can happen. However, I trust Doc Muhlbaier in his or her assertion that it cannot recur, since it cannot be undone (except by transplant?). What you currently have (what your statistician friend suggested) can only be correct if the event itself is defined to have occurred at time 16. So, for instance, if 'event' is defined to be death and that occcurred for this subject at time 16, then this might be correct. If that were the case, then you can use heart failure as a predictor of death and then heart failure would be a separate indicator variable that can be used as a predictor. It could be coded as '0' for all time intervals preceding onset of heart failure and 1 for all interals at onset until death (or censoring). If you really have a repeatable event, then there are several models to choose from. In most cases, you'd need to have another variable for event number, to indicate which event a subject is currently at risk for, and you'd have to adjust the model for the lack of independence of events within subject. Marc

Online Status	Offline
Date Last Visited	‎06-11-2024 07:34 PM

Cluster Profiling is Right, Part 2: Graphics

Don’t Listen to Ron White. Cluster Profiling is Right!

Exploratory Factor Analysis and Matrices, Final Post

The Relationship between Factor Analysis and Regression

How Correlation Relates to Linear Regression and Factor Analysis

Re: proc phreg time depedent covariate

Re: proc phreg time depedent covariate

Re: Big Data course - predicted and residual

Re: Big Data course - predicted and residual

Re: Follow-up to previous question of Moduile 1 Chapter 3 ANOVA, Regre...

Re: proc phreg time depedent covariate

Re: Big Data course - predicted and residual

Re: Big Data course - predicted and residual

Cluster Profiling is Right, Part 2: Graphics

Don’t Listen to Ron White. Cluster Profiling is Right!

Exploratory Factor Analysis and Matrices, Final Post

The Relationship between Factor Analysis and Regression

How Correlation Relates to Linear Regression and Factor Analysis

Cluster Profiling is Right, Part 2: Graphics

Don’t Listen to Ron White. Cluster Profiling is Right!

Exploratory Factor Analysis and Matrices, Final Post

The Relationship between Factor Analysis and Regression

How Correlation Relates to Linear Regression and Factor Analysis

Re: proc phreg time depedent covariate

Re: proc phreg time depedent covariate

Re: Big Data course - predicted and residual

Re: Big Data course - predicted and residual

Re: Follow-up to previous question of Moduile 1 Chapter 3 ANOVA, Regre...

Re: Produce Cluster stats on Manually updated Clusters

Re: PROC PHREG with time dependent covariates giving odd results