Optimization of minimum set of protein-DNA interactions: a quasi exact solution with minimum over-fitting.

N. A. Temiz, A. Trapp, O. A. Prokopyev, C. J. Camacho

Research output: Contribution to journalArticle

8 Citations (Scopus)

Abstract

MOTIVATION: A major limitation in modeling protein interactions is the difficulty of assessing the over-fitting of the training set. Recently, an experimentally based approach that integrates crystallographic information of C2H2 zinc finger-DNA complexes with binding data from 11 mutants, 7 from EGR finger I, was used to define an improved interaction code (no optimization). Here, we present a novel mixed integer programming (MIP)-based method that transforms this type of data into an optimized code, demonstrating both the advantages of the mathematical formulation to minimize over- and under-fitting and the robustness of the underlying physical parameters mapped by the code. RESULTS: Based on the structural models of feasible interaction networks for 35 mutants of EGR-DNA complexes, the MIP method minimizes the cumulative binding energy over all complexes for a general set of fundamental protein-DNA interactions. To guard against over-fitting, we use the scalability of the method to probe against the elimination of related interactions. From an initial set of 12 parameters (six hydrogen bonds, five desolvation penalties and a water factor), we proceed to eliminate five of them with only a marginal reduction of the correlation coefficient to 0.9983. Further reduction of parameters negatively impacts the performance of the code (under-fitting). Besides accurately predicting the change in binding affinity of validation sets, the code identifies possible context-dependent effects in the definition of the interaction networks. Yet, the approach of constraining predictions to within a pre-selected set of interactions limits the impact of these potential errors to related low-affinity complexes. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Original languageEnglish (US)
Pages (from-to)319-325
Number of pages7
JournalBioinformatics (Oxford, England)
Volume26
Issue number3
DOIs
StatePublished - Feb 1 2010

Fingerprint

Overfitting
DNA
Exact Solution
Integer programming
Proteins
Protein
Optimization
Interaction
Structural Models
Bioinformatics
Computational Biology
Binding energy
Mixed Integer Programming
Fingers
Scalability
Zinc
Hydrogen
Mutant
Hydrogen bonds
Affine transformation

PubMed: MeSH publication types

  • Journal Article
  • Research Support, U.S. Gov't, Non-P.H.S.

Cite this

Optimization of minimum set of protein-DNA interactions : a quasi exact solution with minimum over-fitting. / Temiz, N. A.; Trapp, A.; Prokopyev, O. A.; Camacho, C. J.

In: Bioinformatics (Oxford, England), Vol. 26, No. 3, 01.02.2010, p. 319-325.

Research output: Contribution to journalArticle

@article{bcbd470a9d7343a3b2a100500020820b,
title = "Optimization of minimum set of protein-DNA interactions: a quasi exact solution with minimum over-fitting.",
abstract = "MOTIVATION: A major limitation in modeling protein interactions is the difficulty of assessing the over-fitting of the training set. Recently, an experimentally based approach that integrates crystallographic information of C2H2 zinc finger-DNA complexes with binding data from 11 mutants, 7 from EGR finger I, was used to define an improved interaction code (no optimization). Here, we present a novel mixed integer programming (MIP)-based method that transforms this type of data into an optimized code, demonstrating both the advantages of the mathematical formulation to minimize over- and under-fitting and the robustness of the underlying physical parameters mapped by the code. RESULTS: Based on the structural models of feasible interaction networks for 35 mutants of EGR-DNA complexes, the MIP method minimizes the cumulative binding energy over all complexes for a general set of fundamental protein-DNA interactions. To guard against over-fitting, we use the scalability of the method to probe against the elimination of related interactions. From an initial set of 12 parameters (six hydrogen bonds, five desolvation penalties and a water factor), we proceed to eliminate five of them with only a marginal reduction of the correlation coefficient to 0.9983. Further reduction of parameters negatively impacts the performance of the code (under-fitting). Besides accurately predicting the change in binding affinity of validation sets, the code identifies possible context-dependent effects in the definition of the interaction networks. Yet, the approach of constraining predictions to within a pre-selected set of interactions limits the impact of these potential errors to related low-affinity complexes. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.",
author = "Temiz, {N. A.} and A. Trapp and Prokopyev, {O. A.} and Camacho, {C. J.}",
year = "2010",
month = "2",
day = "1",
doi = "10.1093/bioinformatics/btp664",
language = "English (US)",
volume = "26",
pages = "319--325",
journal = "Bioinformatics",
issn = "1367-4803",
publisher = "Oxford University Press",
number = "3",

}

TY - JOUR

T1 - Optimization of minimum set of protein-DNA interactions

T2 - a quasi exact solution with minimum over-fitting.

AU - Temiz, N. A.

AU - Trapp, A.

AU - Prokopyev, O. A.

AU - Camacho, C. J.

PY - 2010/2/1

Y1 - 2010/2/1

N2 - MOTIVATION: A major limitation in modeling protein interactions is the difficulty of assessing the over-fitting of the training set. Recently, an experimentally based approach that integrates crystallographic information of C2H2 zinc finger-DNA complexes with binding data from 11 mutants, 7 from EGR finger I, was used to define an improved interaction code (no optimization). Here, we present a novel mixed integer programming (MIP)-based method that transforms this type of data into an optimized code, demonstrating both the advantages of the mathematical formulation to minimize over- and under-fitting and the robustness of the underlying physical parameters mapped by the code. RESULTS: Based on the structural models of feasible interaction networks for 35 mutants of EGR-DNA complexes, the MIP method minimizes the cumulative binding energy over all complexes for a general set of fundamental protein-DNA interactions. To guard against over-fitting, we use the scalability of the method to probe against the elimination of related interactions. From an initial set of 12 parameters (six hydrogen bonds, five desolvation penalties and a water factor), we proceed to eliminate five of them with only a marginal reduction of the correlation coefficient to 0.9983. Further reduction of parameters negatively impacts the performance of the code (under-fitting). Besides accurately predicting the change in binding affinity of validation sets, the code identifies possible context-dependent effects in the definition of the interaction networks. Yet, the approach of constraining predictions to within a pre-selected set of interactions limits the impact of these potential errors to related low-affinity complexes. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

AB - MOTIVATION: A major limitation in modeling protein interactions is the difficulty of assessing the over-fitting of the training set. Recently, an experimentally based approach that integrates crystallographic information of C2H2 zinc finger-DNA complexes with binding data from 11 mutants, 7 from EGR finger I, was used to define an improved interaction code (no optimization). Here, we present a novel mixed integer programming (MIP)-based method that transforms this type of data into an optimized code, demonstrating both the advantages of the mathematical formulation to minimize over- and under-fitting and the robustness of the underlying physical parameters mapped by the code. RESULTS: Based on the structural models of feasible interaction networks for 35 mutants of EGR-DNA complexes, the MIP method minimizes the cumulative binding energy over all complexes for a general set of fundamental protein-DNA interactions. To guard against over-fitting, we use the scalability of the method to probe against the elimination of related interactions. From an initial set of 12 parameters (six hydrogen bonds, five desolvation penalties and a water factor), we proceed to eliminate five of them with only a marginal reduction of the correlation coefficient to 0.9983. Further reduction of parameters negatively impacts the performance of the code (under-fitting). Besides accurately predicting the change in binding affinity of validation sets, the code identifies possible context-dependent effects in the definition of the interaction networks. Yet, the approach of constraining predictions to within a pre-selected set of interactions limits the impact of these potential errors to related low-affinity complexes. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

UR - http://www.scopus.com/inward/record.url?scp=77951627216&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77951627216&partnerID=8YFLogxK

U2 - 10.1093/bioinformatics/btp664

DO - 10.1093/bioinformatics/btp664

M3 - Article

C2 - 19965883

AN - SCOPUS:77949526604

VL - 26

SP - 319

EP - 325

JO - Bioinformatics

JF - Bioinformatics

SN - 1367-4803

IS - 3

ER -