Re: interpreting associations between rows & columns in Proc Corresp

dglassbrenner · Posted 07-06-2017 09:19 AM

What do you look for in the output of Proc Corresp in order to learn which, if any, associations exist between the rows & columns of a contingency table? Specific instructions illustrated on the Getting Started Example (on doctorates in the sciences) would be appreciated!

Rick_SAS · Posted 07-06-2017 12:56 PM

In the GS example, the doc says to look at the Inertia and Chi-Square Decomposition table. "The total chi-square statistic, which is a measure of the association between the rows and columns." Be sure to use the CHI2P option to get the p-value.

dglassbrenner · Posted 07-06-2017 01:44 PM

I probably didn’t explain myself well. How do you tell which particular rows and columns are associated with one another?

Rick_SAS · Posted 07-06-2017 02:22 PM

Look at the "Simple Correspondance Analysis of US Population." The Row Profiles and Row Coordinates tables (and the CA plot) indicate that the "New England" row is similar to "NY,NJ,PA" row, but is not very similar to the "Pacific" row. Similarly, the "1920" column is similar to the "1930" column, but not very similar to "1970."

dglassbrenner · Posted 07-06-2017 02:56 PM

I fear I’m still not explaining my question well. I’m interested in associations between rows and columns. For instance, the Getting Started example concludes from Figure 31.1 that Mathematics and Engineering are associated with the earlier years. How did they come to this conclusion? It’s not based on the proximity of the 1973 & 1974 points with the Math and Engineering points, right (since the distance between a row point and a column point has no meaning)?

More generally, how does one figure out which rows in a contingency table are associated with which columns from the output of Proc Corresp?

Ksharp · Posted 07-07-2017 09:55 AM

Two points is more close, these two is more association.

E.X.

two points both are falling in northeast corner , so they are positive association.

two points, one is falling in northeast corner,another is falling in southwest,so they are negative association.

StatDave · Posted 07-07-2017 10:43 AM

As stated by Clausen (1998) when discussing the interpretation of distances between row and column points, "Usually, however, the points i and j will be close to each other when f(ij)>e(ij), and the distance will be great when f(ij)<e(ij), where f(ij) is the observed and e(ij) is the expected frequency... ." Intuitively, observed counts larger than expected (see the EXPECTED and CELLCHI2 options in PROC CORRESP) indicate some association and this tends to be depicted visually in the plot by closeness. But no, the distances are not chi-square distances as the are between two row points or two column points.

Claussen, S-E (1998), Applied Correspondence Analysis: An Introduction, Sage University Papers Series on Qualitative Applications in the Social Sciences, 07-121. Thousand Oaks, CA: Sage.

Rick_SAS · Posted 07-07-2017 11:15 AM

To add to StatDave's response, both examples in the PROC CORRESP doc have a very strong eigendirection, so most of the inertia (the "variance" analog) is in one dimension. Thus although the row and column points are scaled differently, an extreme row point that is near an extreme column point indicates that these quantities differ from the expected values in the same direction.

If you are interested in which CELLS in the table are deviating the most from their expected values (under independence), I wouldn't use this CA plot. I'd use PROC FREQ and request either a two-way stacked bar chart or a mosaic plot. You can even color-code the cells of the mosaic plot to represent deviations from expectation, which I think is easier to interpret and gives more information in the two-way case. I'd reserve CA for higher-dimensional problems.

dglassbrenner · Posted 07-10-2017 09:00 AM

Thanks to both StatDave and Rick for their responses. So if I understand correctly, in order to find which rows are associated with which columns, you look in the Correspondence Analysis plot for row and column points that: 1) are among the furthest from the origin (which is what I assume Rick means by “extreme”), and 2) are close to one another.

I had been wondering why the Getting Started example didn’t conclude that Physical Sciences (instead of Math) and Engineering are associated with the earlier years, given that Phys Sci clearly beats Math on criterion 2). I think I understand now that it’s because Phys Sci is relatively close to the centroid.

dglassbrenner · Posted 07-11-2017 04:26 PM

So I was just looking over the Simple Correspondence Analysis in Example 31.1, and I’m confused again. It says: “The fact that the Married with Kids point is close to the American point and the fact that the Japanese point is near the Single point should be ignored.” This would seem to be in direct contradiction to step 2) in my previous post (looking for points that are close to each other). What gives?

chacreton190 · Posted 10-25-2020 04:45 PM

That line in the SAS docs confused me also. I even wrote to SAS documentation support about it. That line even seems to be contradicted by the next few lines in the passage.

When I read the explanations above, they made sense, but I am now nearly completely unclear as to how this output should be interpreted.

Rick_SAS · Posted 10-26-2020 07:33 AM

As the doc says, " Distances between points within a variable have meaning, but distances between points from different variables do not." That's why the doc says to ignore distances between "Japanese" and "Single." These points belong to different variables.

Catch up on SAS Innovate 2026