Hierarchical canonical correlation analysis reveals phenotype, genotype, and geoclimate associations in plants

Raphael Petegrosso, Tianci Song, Rui Kuang

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


The local environment of the geographical origin of plants shaped their genetic variations through environmental adaptation. While the characteristics of the local environment correlate with the genotypes and other genomic features of the plants, they can also be indicative of genotype-phenotype associations providing additional information relevant to environmental dependence. In this study, we investigate how the geoclimatic features from the geographical origin of the Arabidopsis thaliana accessions can be integrated with genomic features for phenotype prediction and association analysis using advanced canonical correlation analysis (CCA). In particular, we propose a novel method called hierarchical canonical correlation analysis (HCCA) to combine mutations, gene expressions, and DNA methylations with geoclimatic features for informative coprojections of the features. HCCA uses a condition number of the cross-covariance between pairs of datasets to infer a hierarchical structure for applying CCA to combine the data. In the experiments on Arabidopsis thaliana data from 1001 Genomes and 1001 Epigenomes projects and climatic, atmospheric, and soil environmental variables combined by CLIMtools, HCCA provided a joint representation of the genomic data and geoclimate data for better prediction of the special flowering time at 10°C (FT10) of Arabidopsis thaliana. We also extended HCCA with information from a protein-protein interaction (PPI) network to guide the feature learning by imposing network modules onto the genomic features, which are shown to be useful for identifying genes with more coherent functions correlated with the geoclimatic features. The findings in this study suggest that environmental data comprise an important component in plant phenotype analysis. HCCA is a useful data integration technique for phenotype prediction, and a better understanding of the interactions between gene functions and environment as more useful functional information is introduced by coprojections of multiple genomic datasets.

Original languageEnglish (US)
Article number1969142
JournalPlant Phenomics
StatePublished - 2020

Bibliographical note

Funding Information:
RP was partially supported by CAPES Foundation, Ministry of Education, Brazil (BEX 13250/13-2).

Publisher Copyright:
Copyright © 2020 Raphael Petegrosso et al. Exclusive Licensee Nanjing Agricultural University. Distributed under a Creative Commons Attribution License (CC BY 4.0).


Dive into the research topics of 'Hierarchical canonical correlation analysis reveals phenotype, genotype, and geoclimate associations in plants'. Together they form a unique fingerprint.

Cite this