Abstract
Popular methods in supervised learning, from regression and neural networks to sup-port vector machines, are commonly presented from the perspective of statistics or biology. Instead, we present common techniques in supervised learning as applications of mathematical optimization and examine the practical benefits this perspective brings.Under the optimization perspective, linear regression is understood as the function f(X) = XW + [vector]b that is trained by solving arg min [subscript] W, [vector]b ξ(f(X),Y ), where W is a weight matrix, X is a matrix of training arguments, Y is a matrix of training targets, and ξ is an error function such as mean squared error. Similarly, a multilayer perceptron neural network is understood as a function of the form f(X) = t[subscript]n−1(· · ·(t2(t1(XW1 + [vector]b)W2)· · ·)W[subscript]n−1), where tᵢ is the i[superscript]th transfer function, Wᵢ is the i[superscript]th weight matrix, and n is the number of layers. Under the optimization perspective, training a multilayer perceptron is the same as training a regression model: solve arg min[subscript]W1,W2,...,Wn−1,[vector]bξ(f(X),Y ). Mathematical optimization serves as the workhorse for training by solving the arg min problem. Powerful optimization methods such as Broyden-Fletcher-Goldfarb-Shanno (BFGS) and its limited-memory L-BFGS variant can efficiently solve this problem. The inclusion of line search or trust region techniques removes the need for hand tuned learning rates and drastically improve performance and consistency. Through the rapid training enabled by efficient optimization, more complex models can be applied and larger datasets learned. Bigger data, faster real time learning, and more effective image recognition are possible. From the optimization perspective, explanation of models is simplified and implementation is naturally modular and flexible. Optimization techniques are easily reused between models. The development of a new generalization improving error function is easily propagated to existing and future models. Datasets are better learned and accuracy improved by easily applying, developing, and testing a multitude of models. When supervised learning is performed from the optimization perspective, an equation with adjustable parameters or an error function is all that is necessary to implement a new model and better solve problems in data science and machine learning.