## Machine Learning Model to Predict Gibbs Free Energy

Here we describe the development of a machine learning model to predict Gibbs free energy of binding using the program SAnDReS (Xavier et al., 2016). To create this machine-learning model, we employed a dataset composed of 48 high-resolution crystallographic structures for which Delta G data were available (Delta G dataset). We used the energy terms available in the scoring function of the program AutoDock Vina (Trott & Olson, 2010). We employed the atomic coordinates obtained directly from the crystallographic structures. No docked structures were used in this study.

#### Delta G Dataset X

The PDB access codes for the Delta G dataset are listed below.

**PDB access codes (training + test sets)**

1A9T, 1AJ6, 1GFW, 1HXD, 1KZK, 1SG0, 1T64, 1US0, 1YHS, 1ZND, 1ZNG, 1ZNH, 1ZNK, 2AVS, 2BIK, 2BYA, 2C3I, 2DM5, 2FZD, 2G1O, 2G1R, 2G1S, 2I4Q, 2IKH, 2IKO, 2IKU, 2IL2, 2NMZ, 2O3P, 2O9A, 2PDK, 2PZN, 2Q6B, 2QX8, 2UXI, 2UXP, 2UXU, 3AKM, 3CCT, 3CCW, 3CCZ, 3K8Q, 3M4H, 4LCE, 4QA0, 4QOG, 4RXD, 5A14

**PDB access codes (training set)**1GFW, 1HXD, 1US0, 1YHS, 1ZND, 1ZNH, 1ZNK, 2AVS, 2BIK, 2BYA, 2DM5, 2G1O, 2G1R, 2G1S, 2I4Q, 2IKH, 2IKU, 2IL2, 2NMZ, 2O3P, 2O9A, 2PZN, 2Q6B, 2UXP, 2UXU, 3AKM, 3CCT, 3CCW, 3CCZ, 3K8Q, 3M4H, 4LCE, 4QA0, 4QOG, 4RXD, 5A14

**PDB access codes (test set) **1A9T, 1AJ6, 1KZK, 1SG0, 1T64, 1ZNG, 2C3I, 2FZD, 2IKO, 2PDK, 2QX8, 2UXI

This zip file (DeltaG) contains all necessary files to generate the machine-learning model described by (Bitencourt-Ferreira & de Azevedo Jr., 2018) using the program SAnDReS.

#### Download Delta G Dataset from GitHub

You may also download Delta G dataset from GitHub.

#### AutoDock Vina Scoring Function

Here we briefly describe the scoring function of AutoDock Vina, for more details, please see (Trott & Olson, 2010). Part of the function of AutoDock Vina works with the expression described below,

$c={c}_{\text{inter}}+{c}_{\text{intra}}=\sum _{i<j}{h}_{{t}_{i}{t}_{j}}\left({d}_{\mathit{ij}}\right)$ (Equation 1)

where c is the total energy, c_{inter} is the intermolecular contribution and c_{intra} the intramolecular part. The index i is for each atom of the biological system that is assigned a type represented by the term t_{i}.The component h_{titj} is a weighted sum of steric interactions (Equations (3), (4), (5)) similar for all the atoms, and the term d_{ij} is the distance surface which we define by the following equation.

${d}_{\mathit{ij}}={r}_{\mathit{ij}}-{R}_{{t}_{i}}-{R}_{{t}_{j}}$ (Equation 2)

In the above equation, r_{ij} represents the interatomic distance and R_{t} is the van der Waals radius for an atom of type t.
Each atom pair interacts through a steric interaction. The AutoDock Vina program makes use of three terms to access this steric interaction, as follows.

${\text{gauss}}_{1}\left(d\right)={e}^{-(d/0.5\phantom{\rule{0.5em}{0ex}}{\mathrm{\AA}}^{2})}$ (Equation 3)

${\text{gauss}}_{2}\left(d\right)={e}^{-{\left(\left(d-3\phantom{\rule{0.5em}{0ex}}\mathrm{\AA}\right)2\phantom{\rule{0.12em}{0ex}}\mathrm{\AA}\right)}^{2}}$ (Equation 4)

$\text{repulsion}\left(d\right)=\left\{\begin{array}{c}{d}^{2},\text{if}\phantom{\rule{0.12em}{0ex}}d<0\\ 0,\text{if}\phantom{\rule{0.12em}{0ex}}d\ge 0\end{array}\right.$ (Equation 5)

In the AutoDock Vina the hydrophobic and hydrogen bond (Hbond) interactions are determined by piecewise linear functions. Below we have the definition of the Equation (6), which is included in the computation of the steric interactions when we have a hydrophobic pair of atoms. We apply the Equation (7) when the atom pair is composed of a hydrogen bond donor atom and a hydrogen bond acceptor atom. These two terms are determined as follows.

$\text{hydrophobic}\phantom{\rule{0.25em}{0ex}}\left(d\right)=\left\{\begin{array}{c}1,\text{if}\phantom{\rule{0.12em}{0ex}}d<0.5\phantom{\rule{0.12em}{0ex}}\mathrm{\AA}\\ \begin{array}{l}\text{linearly\hspace{0.17em}interpolated\hspace{0.17em}if}\phantom{\rule{0.12em}{0ex}}0.5\phantom{\rule{0.12em}{0ex}}\mathrm{\AA}<d>1.5\phantom{\rule{0.12em}{0ex}}\mathrm{\AA}\\ 0,\text{if}\phantom{\rule{0.12em}{0ex}}d>1.5\phantom{\rule{0.12em}{0ex}}\mathrm{\AA}\end{array}\end{array}\right.$ (Equation 6)

$\text{Hbond}\phantom{\rule{0.25em}{0ex}}\left(d\right)=\left\{\begin{array}{c}1,\text{if}\phantom{\rule{0.12em}{0ex}}d<-0.7\phantom{\rule{0.12em}{0ex}}\mathrm{\AA}\\ \begin{array}{l}\text{similarly\hspace{0.17em}linearly}\phantom{\rule{0.5em}{0ex}}int\text{erpolated\hspace{0.17em}if}-0.7\phantom{\rule{0.12em}{0ex}}\mathrm{\AA}<d>0\\ 0,\text{if}\phantom{\rule{0.12em}{0ex}}d>0\end{array}\end{array}\right.$ (Equation 7)

In the Equations (3), (4), (5), (6), (7), the term d is the surface distance defined by the Equation (2).

#### Machine Learning with SAnDReS

Considering that AutoDock Vina has five energy terms (Gauss1, Gauss2, Repulsion, Hydrophobic, and Hydrogen) in the scoring function and that the polynomial equation built by SAnDReS has three independent variables, we have a total of 10 combinations of different energy terms. To generate the machine learning models, we used the following random seed: 42949672 in the program SAnDReS.

We used the polynomial scoring function methodology implemented in the SAnDReS program to construct a scoring function based on the AutoDock Vina scoring function with the aim to improve the prediction of ΔG binding affinity. Taking Spearman's rank correlation coefficient as a criterion for the selection of the best machine-learning model, we identified the model 155. The polynomial equation 155 (score155), with coefficients determined by the regression method ElasticNetCV (Elastic Net with Cross-Validation), is shown below,

${\text{score}}_{155}=-1.106004-0.006689{x}_{1}+0.000447{x}_{2}{x}_{3}+0.00139{x}_{1}{x}_{3}-0.000002{x}_{1}^{2}-0.002388{x}_{3}^{2}$ (Equation 8)

This polynomial equation uses the energy terms Gauss2 (x1), Gauss1 (x2) and Hydrophobic (x3) as explanatory variables. The Spearman's rank correlation coefficient for training set is 0.721 (p<0.001) and for the test set is 0.886 (p-value < 0.001). Statistical analysis of the predictive performance of AutoDock 4 (Morris et al., 1998), MolDock (Thomsen & Christensen, 2006), and Plants (Korb et al., 2009) scores generated correlation coefficients below 0.886 for the test set (Bitencourt-Ferreira & de Azevedo Jr., 2018).

#### References

Bitencourt-Ferreira G, de Azevedo Jr. WF. Development of a machine-learning model to predict Gibbs free energy of binding for protein-ligand complexes. Biophys Chem. 2018; https://doi.org/10.1016/j.bpc.2018.05.010 PDF

Korb O, Stützle T, Exner TE. Empirical scoring functions for advanced protein-ligand docking with PLANTS. J Chem Inf Model 2009; 49(1): 84–96. PubMed

Morris G, Goodsell D, Halliday R, Huey R, Hart W, Belew R, Olson A. Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J Comput Chem. 1998; 19:1639-1662. Link

Thomsen R, Christensen MH. MolDock: a new technique for high-accuracy molecular docking. J Med Chem. 2006;49:3315-3321. PubMed

Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010; 31(2):455-461. PubMed

Xavier MM, Heck GS, de Avila MB, Levin NM, Pintro VO, Carvalho NL, Azevedo WF Jr. SAnDReS a Computational Tool for Statistical Analysis of Docking Results and Development of Scoring Functions. Comb Chem High Throughput Screen. 2016; 19(10): 801–812. Link PubMed Go To SAnDReS PDF GitHub