TY - JOUR

T1 - Intercorrelation of major DNA/RNA sequence descriptors - A preliminary study

AU - Sen, Dwaipayan

AU - Dasgupta, Subhadeep

AU - Pal, Indrajit

AU - Manna, Smarajit

AU - Basak, Subhash C

AU - Nandy, Ashesh

AU - Grunwald, Gregory D.

PY - 2016/9/1

Y1 - 2016/9/1

N2 - Background: A large number of alignment–free techniques of graphical representation and numerical characterization (GRANCH) of bio-molecular sequences have been proposed in the recent past years, but the relative efficacy of these methods in determining the degree of similarities and dissimilarities of such sequences have not been ascertained. Objective: Our objective is to make an assessment of the relative efficacy of these methods in determining the degree of similarities and dissimilarities of bio-molecular sequences. Method: We have chosen 7 published/communicated methods that represent various classes of GRANCH techniques and computed the descriptors that are expected to characterize similarities and dissimilarities in several sets of gene sequences. We critically appraise the different methods and determine which of these yield non-redundant structural information that could be used to compute different properties of the sequences, and which are correlated enough to one another so that using the simplest representative of the group would suffice. We also do a principal component analysis (PCA) to determine how the variances in the calculated sequence descriptors are explained by the computed principal components (PCs). Results: We found that some of the descriptors are strongly correlated implying a commonality of structural information encoded by them while others are distinctly separate. The PCA results show that the first three PC’s explain >97% of the variances. Conclusion: We found that some mathematical DNA descriptors calculated by a few of these techniques correlate strongly with one another implying a redundancy in the structural information quantified by those descriptors; others are not strongly correlated with one another suggesting that they encode non-redundant sequence information. From this and our PCA results, our recommendation would be to use minimally correlated set of descriptors or orthogonal descriptors like PCs derived from the descriptor set for the characterization of nucleic acid structure and function.

AB - Background: A large number of alignment–free techniques of graphical representation and numerical characterization (GRANCH) of bio-molecular sequences have been proposed in the recent past years, but the relative efficacy of these methods in determining the degree of similarities and dissimilarities of such sequences have not been ascertained. Objective: Our objective is to make an assessment of the relative efficacy of these methods in determining the degree of similarities and dissimilarities of bio-molecular sequences. Method: We have chosen 7 published/communicated methods that represent various classes of GRANCH techniques and computed the descriptors that are expected to characterize similarities and dissimilarities in several sets of gene sequences. We critically appraise the different methods and determine which of these yield non-redundant structural information that could be used to compute different properties of the sequences, and which are correlated enough to one another so that using the simplest representative of the group would suffice. We also do a principal component analysis (PCA) to determine how the variances in the calculated sequence descriptors are explained by the computed principal components (PCs). Results: We found that some of the descriptors are strongly correlated implying a commonality of structural information encoded by them while others are distinctly separate. The PCA results show that the first three PC’s explain >97% of the variances. Conclusion: We found that some mathematical DNA descriptors calculated by a few of these techniques correlate strongly with one another implying a redundancy in the structural information quantified by those descriptors; others are not strongly correlated with one another suggesting that they encode non-redundant sequence information. From this and our PCA results, our recommendation would be to use minimally correlated set of descriptors or orthogonal descriptors like PCs derived from the descriptor set for the characterization of nucleic acid structure and function.

KW - DNA descriptors

KW - Descriptor correlations

KW - Gene sequences

KW - Graphical representation and numerical characterization (GRANCH) techniques

KW - Principal component (PC)

KW - Principal components analysis (PCA)

UR - http://www.scopus.com/inward/record.url?scp=84991406703&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84991406703&partnerID=8YFLogxK

U2 - 10.2174/1573409912666160525111918

DO - 10.2174/1573409912666160525111918

M3 - Article

C2 - 27222032

AN - SCOPUS:84991406703

VL - 12

SP - 216

EP - 228

JO - Current Computer-Aided Drug Design

JF - Current Computer-Aided Drug Design

SN - 1573-4099

IS - 3

ER -