**Please cite the following paper Xavier et al. 2016 when using SAnDReS **

## Machine Learning

In the development of a machine-learning model to predict the binding affinity, for instance, the goal is to determine the relative weight (β_{j}) of the explanatory variables, to bring the predicted values (f

_{i}) close to the experimental values (y

_{i}). In the equation 1 below, we have the response variable (f) expressed as a function of the explanatory variables (x

_{j}),

$$f(x_1,...,x_N)= \beta_0 + \sum_{j=1}^N\beta_jx_j \text{ (Eq. 1).}$$

Where N indicates the number of explanatory variables and β_{0}represents the regression constant.

#### Ordinary Linear Regression

Among the supervised machine learning techniques, the oldest method is the ordinary linear regression method. The idea behind the ordinary linear regression method is to minimize the cost function known as residual sum of squares (RSS). Some authors call this cost function the sum of squared residuals (SSR) (Bell, 2014; Bruce and Bruce, 2017). Below we have the equation for RSS,$$RSS= \sum_{i=1}^M(y_i-f(x_1,...,x_N))^2 \text{ (Eq. 2).}$$

Where M is the number of observations, y_{i}is the experimental value, and f

_{i}is the predicted value. RSS is the sum of the differences between the experimental value (y

_{i}) and the predicted value (f

_{i}). The regression method optimizes the weights (β

_{j}) in the equation (1) to minimize the RSS.

#### Least Absolute Shrinkage and Selection Operator (Lasso)

The Lasso method adds a term involving the sum of the absolute values of the relative weights to the RSS equation (Tibshirani, 1996), as indicated below,$$RSS= \sum_{i=1}^M(y_i-f(x_1,...,x_N))^2+\lambda_1\sum_{j=1}^N|\beta_j| \text{ (Eq. 3).}$$

In the equation 3, the term λ_{1}≥ 0 indicates a coefficient responsible for controlling the strength of the penalty. The larger is the value of the penalty; the greater is the shrinkage. We call this additional term added to the original RSS equation, the penalty term. In Lasso method, the regression carries out the L1 regularization. This method can generate sparse models with fewer coefficients when compared with the ordinary linear regression method. Furthermore, some coefficients can be zero. When we increase the penalties, the consequences are coefficient values closer to zero. This situation is the ideal for producing models with fewer explanatory variables.

#### Ridge

In the Ridge method (Tikhonov, 1963), we follow the same principle of adding a penalty term to the original expression of RSS (equation 2). The penalty term takes a form of a sum of the squared weights, as indicated below,$$RSS= \sum_{i=1}^M(y_i-f(x_1,...,x_N))^2+\lambda_2\sum_{j=1}^N|\beta_j|^2 \text{ (Eq. 4).}$$

In the above equation, λ_{2}≥ 0 is the regularization parameter. The Ridge method performs L2 regularization.

#### Elastic Net

The idea behind the Elastic Net method is to combine the Lasso and the Ridge regression methods (Zou and Hastie, 2005), as indicated below,$$RSS= \sum_{i=1}^M(y_i-f(x_1,...,x_N))^2+\lambda_1\sum_{j=1}^N|\beta_j|+\lambda_2\sum_{j=1}^N|\beta_j|^2 \text{ (Eq. 5).}$$

In the above equation, the terms λ_{1}≥ 0 and λ

_{2}≥ 0 are the two regularization parameters.

#### SAnDReS for Machine Learning

The use of machine-learning methods to study biological systems is not new. For instance, we can find applications of artificial neural networks, as old as 1985 (Nanard & Nanard, 1985). Considering the application of supervised machine-learning techniques to the prediction of ligand-binding affinity, we have studies dating back 1994 (Hirst *et al*., 19944a; Hirst *et al*., 1994b).

So, what is new about SAnDReS? SAnDReS (Xavier *et al*., 2016) makes use of supervised machine-learning techniques to generate polynomial equations to predict ligand-binding affinity, which allows improvement of native scoring functions. SAnDReS (Xavier *et al*., 2016) allows training a model making it specific for a biological system. Let us consider the HIV-1 Protease system (Pintro & de Azevedo, 2017), we could make use of a standard scoring function, such as PLANTS score (Korb *et al*., 2009) and fine-tuning its terms to adjust it to predict log(Ki) for the HIV-1 Protease (Pintro & de Azevedo, 2017). We could say that we are integrating computational systems biology and machine-learning techniques to improve the predictive power of scoring functions, which gives you the flexibility to test different scenarios for the biological system you are interested in.

*Schematic diagram illustrating the development of a target-based scoring function to predict log(Ki) for the HIV-! Protease (Pintro & de Azevedo, 2017).*

We could think that we have the Protein Sequence Space (Smith, 1970) and the Chemical Space with all potential binders to elements of the Protein Sequence Space (Smith, 1970). SAnDReS (Xavier *et al*., 2016) allows the construction of a third space, we call it Scoring Function Space (Heck *et al*., 2017), where we find infinite mathematical functions to predict ligand-binding affinity. SAnDReS (Xavier *et al*., 2016) applies machine-learning techniques to explore this Scoring Function Space (Heck *et al*., 2017) finding the function that predicts the experimental binding affinity as closer as possible.

SAnDReS (Xavier *et al*., 2016) has a flexible interface that allows testing the predictive power of regression models generated by machine learning techniques, such as: Linear Regression, Least Absolute Shrinkage and Selection Operator (Lasso), Ridge, Elastic Net, Stochastic Gradient Descent Regressor, and Support Vector Regression. All these methods are available from the scikit-Learn library (Pedregosa *et al*., 2011) and implemented as an intuitive workflow in SAnDReS (Xavier *et al*., 2016).

The SAnDReS (Xavier *et al*., 2016) project has over 25,000 lines of Python code and is able to automatically carry out docking simulations using AutoDock4 (Morris *et al.,* 1998), AutoDock Vina (Trott & Olson, 2010), and (Thomsen & Christensen, 2006) without any worries with input files. But the soul of the program SAnDReS (Xavier *et al*., 2016) is its machine-learning box, that allows you to build a targeted-scoring function for the biological system you are interested in. SAnDReS (Xavier *et al*., 2016) uses scikit-learn library (Pedregosa *et al*., 2011) to build hundreds of polynomial equations where the explanatory variables are taken from the original dataset and determines the relative weight for each explanatory variable in the following polynomial equation,

$$f(x_1,...,x_N)= log(K) = \alpha_0 + \sum_{i=1}^N\alpha_ix_i+ \sum_{i=1}^{N-1}\sum_{j>i}^N\beta_{ij}x_ix_j+\sum_{i=1}^N\omega_ix_i^2$$

where α_{i}, β

_{ij}, ω

_{ij}are the relative weights for the explanatory variables (x

_{i}, x

_{j}), and f(x

_{1}, x

_{2},...,x

_{N}) is the response variable . N is the number of explanatory variables and α

_{0}is the regression constant. The term log(K) represents the log of inhibtion constant (K).

Taking N = 3, we have the following polynomial equation:

$$f(x_1,x_2,x_3) = log(K) = \alpha_0 + \alpha_1x_1 + \alpha_2x_2 + \alpha_3x_3 + \beta_{12}x_1x_2 + \beta_{13}x_1x_3 + \beta_{23}x_2x_3 + \omega_1x_1^2+ \omega_2x_2^2 + \omega_2x_3^2$$Considering that the above equation has 9 independent variables, we have a total of 511 possible polynomial equations. We don't consider the equation log(K)=α_{0}.

#### HIV-1 Protease Dataset

Let's consider the HIV-1 protease (Pintro & de Azevedo, 2017) for which crystallographic and inhibition constant (Ki) data are available. There are a total of 71 structures satisfying both criteria, table available here. Evaluation of binding affinity using scoring functions available in the program Molegro Virtual Docker (Thomsen & Christensen, 2006) generated Spearman's correlation coefficient ranging from -0.245 to 0.38, the highest correlation was obtained for interaction score. The figure below shows the scatter plot for the predicted and experimental binding affinity using all 71 structures.

*Scatter plot for Interaction Score vs log(Ki) for HIV-1 protease dataset. In the plot, au represents arbitrary units.*

The Polscore methodology implemented in the program SAnDReS makes possible to test different scoring schemes, using polynomial equations where their terms are taken from the original scoring functions generated by the molecular docking programs. Here, we consider a polynomial equation involving PLANTS, Interaction, and Ligand Efficiency 3 Scores. We generated a total of 511 new polynomial scoring functions using SAnDReS. The Table below summarizes the results of training and test set data for the original scoring functions and the top-ranked polynomial equation. The best result was obtained for polynomial equation 504 with ρ = 0.525 (p-value < 0.001) for the training set (51 structures) and ρ = 0.368 (p-value = 0.1106) for a test set with 20 structures. The figure shows the scatter plot for polynomial equation 504 *vs *log(Ki), with training set data.

**Correlation between scoring functions and log(Ki)**

**Scoring Function ρ p-value ρ p-value**

PLANTS Score 0.264 0.06162 0.010 0.9674

MolDock Score 0.218 0.1247 0.086 0.7193

Re-rank Score 0.350 0.1184 -0.086 0.7169

Interaction Score 0.479 0.00038 0.080 0.7383

Co-factor Score -0.143 0.3176 -0.384 0.09459

Protein Score 0.223 0.1154 0.165 0.4877

Water Score 0.043 0.766 0.214 0.3658

H-Bond Score 0.027 0.8525 -0.288 0.2181

LE1 Score 0.187 0.1886 0.256 0.2750

LE3 Score 0.045 0.7559 -0.140 0.5563

Score504 0.525 0.000077 0.368 0.1106

*Scatter plot for polynomial equation 504 (Score504) vs log(Ki) for 51 structures in HIV-1 Protease training set. In the plot, au represents arbitrary units.*

As we can see, the application of the machine-learning technique generated a model with superior predictive power.

**References**

Bell, J. Machine Learning. Hands-On for Developers and Technical Professionals; John Wiley and Sons: Indianapolis, 2015. PDF

Bruce, P.; Bruce, A. Practical Statistics for Data Scientists. 50 Essential Concepts; O’Reilly Media: Sebastopol, 2017. PDF

Heck GS, Pintro VO, Pereira RR, de Ávila MB, Levin NMB, de Azevedo WF. Supervised Machine Learning Methods Applied to Predict Ligand-Binding Affinity. Curr Med Chem. 2017; 24(23): 2459–70. PubMed PDF

Hirst JD, King RD, Sternberg MJ. Quantitative structure-activity relationships by neural networks and inductive logic programming. I. The inhibition of dihydrofolate reductase by pyrimidines. J Comput Aided Mol Des. 1994a; 8(4):405–20. PubMed

Hirst JD, King RD, Sternberg MJ. Quantitative structure-activity relationships by neural networks and inductive logic programming. II. The inhibition of dihydrofolate reductase by triazines. J Comput Aided Mol Des. 1994b; 8(4): 421–32. PubMed

Korb O, Stützle T, Exner TE. Empirical scoring functions for advanced protein-ligand docking with PLANTS. J Chem Inf Model 2009; 49(1): 84–96. PubMed

Morris G, Goodsell D, Halliday R, Huey R, Hart W, Belew R, Olson A. Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J Comput Chem. 1998; 19:1639–62. PubMed

Nanard M, Nanard J. A user-friendly biological workstation. Biochimie 1985; 67(5): 429–32. PubMed

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Verplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011; 12: 2825–30. PDF

Pintro VO, Azevedo WF. Optimized Virtual Screening Workflow. Towards Target-Based Polynomial Scoring Functions for HIV-1 Protease. Comb Chem High Throughput Screen. 2017. doi: 10.2174/1386207320666171121110019. PubMed PDF

Smith JM. Natural selection and the concept of a protein space. Nature. 1970; 225(5232):563–4. PDF

Thomsen R, Christensen MH. MolDock: a new technique for high-accuracy molecular docking. J Med Chem. 2006; 49: 3315–21. PubMed

Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol., 1996, 58(1), 267–88. PDF

Tihonov, A.N. On the regularization of ill-posed problems. Dokl. Akad. Nauk SSSR, 1963, 153, 49–52 (Russian). MR 0162378

Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010; 31(2):455–61. PubMed

Xavier MM, Heck GS, de Avila MB, Levin NM, Pintro VO, Carvalho NL, Azevedo WF Jr. SAnDReS a Computational Tool for Statistical Analysis of Docking Results and Development of Scoring Functions. Comb Chem High Throughput Screen. 2016; 19(10): 801–12. Link PubMed Go To SAnDReS PDF

Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Series B Stat. Methodol., 2005, 67(2), 301–20. PDF