Canonical Correlation Analysis

Typically in multiple regression, you have one dependent variable against all regressors. However in canonical correlation analysis, the analyst can model the relationship between a set of multidimensional variables against another multidimensional set of variables.

cca

Description

Canonical correlation analysis finds basis vectors for two sets of variables such that the correlation between the projections of the variables onto these basis vectors is mutually maximized.

$$ (a',b') = \underset{a,b}{argmax} \operatorname{corr}(a^T X, b^T Y) $$

A simple scenario would explain the intuition clearly. Consider a group of subjects on whom we have variables related to exercise such as timed runs, weight lifted in dead-lifts, number of push-ups and situps. We also have information related to their blood glucose levels, BMI and blood pressure. We can run an analysis on these two multivariate sets of variables using this method.

More precisely, we define $(U_{i},V_{j})$ as follows:

$$ \begin{matrix} U_{1} = a_{11}X_{1} + a_{12}X_{2} + \dots + a_{1p}X_p\\ \vdots \\ U_{p} = a_{p1}X_{1} + a_{p2}X_{2} + \dots + a_{pp}X_p \end{matrix} $$

$$ \begin{matrix} V_{1} = b_{11}Y_{1} + b_{12}Y_{2} + \dots + b_{1q}Y_q\\ \vdots \\ V_{p} = b_{p1}Y_{1} + b_{p2}Y_{2} + \dots + b_{pq}Y_q \end{matrix} $$

The canonical correlation to maximize is the following: $$ \rho^*_i = \dfrac{\text{cov}(U_i, V_i)}{\sqrt{\text{var}(U_i) \text{var}(V_i)}} $$

where the covariance between U and V is: $$ \text{cov}(U_i, V_j) = \sum\limits_{k=1}^{p} \sum\limits_{l=1}^{q}a_{ik}b_{jl}\text{cov}(X_k, Y_l) $$

and the variance is: $$ \text{var}(V_j) = \sum\limits_{k=1}^{p} \sum\limits_{l=1}^{q} b_{jk}b_{jl}\text{cov}(Y_k, Y_l) $$

Returns

  • c_corrs: Canonical correlations
  • dfn: Degrees of freedom numerator
  • dfd: Degrees of freedom denominator
  • f stat: F-statistic
  • f p: right tailed p-value for F statistic
  • chisq stat: Chi Square statistic
  • chisq p: right tailed p-value for Chi Square statistic
  • lr (wilks): Proportion of variability not explained by model. Ranges from 0 to 1.
  • A: Canonical coefficients of $ \mathrm{X} $
  • B: Canonical coefficients of $ \mathrm{Y} $
  • U: Canonical scores for X matrix
  • V: Canonical scores for Y matrix