The research reported in this thesis addresses the problem of component analysis, which aims at reducing large data to lower dimensions, to reveal the essential structure of the data. This problem is encountered in almost all areas of science - from physics and biology to finance, economics and psychometrics - where large data sets need to be analyzed.
Several paradigms for component analysis are considered, e.g., principal component analysis, independent component analysis and sparse principal component analysis, which are naturally formulated as an optimization problem subject to constraints that endow the problem with a well-characterized matrix manifold structure. Component analysis is so cast in the realm of optimization on matrix manifolds. Algorithms for component analysis are subsequently derived that take advantage of the geometrical structure of the problem.
When formalizing component analysis into an optimization framework, three main classes of problems are encountered, for which methods are proposed. We first consider the problem of optimizing a smooth function on the set of n-by-p real matrices with orthonormal columns. Then, a method is proposed to maximize a convex function on a compact manifold, which generalizes to this context the well-known power method that computes the dominant eigenvector of a matrix. Finally, we address the issue of solving problems defined in terms of large positive semidefinite matrices in a numerically efficient manner by using low-rank approximations of such matrices.
The efficiency of the proposed algorithms for component analysis is evaluated on the analysis of gene expression data related to breast cancer, which encode the expression levels of thousands of genes gained from experiments on hundreds of cancerous cells. Such data provide a snapshot of the biological processes that occur in tumor cells and offer huge opportunities for an improved understanding of cancer. Thanks to an original framework to evaluate the biological significance of a set of components, well-known but also novel knowledge is inferred about the biological processes that underlie breast cancer.
Hence, to summarize the thesis in one sentence: We adopt a geometric point of view to propose optimization algorithms performing component analysis, which, applied on large gene expression data, enable to reveal novel biological knowledge.