The final part of the basic supervised machine learning trinity is the gradient descent algorithm. Given a hypothesis and a cost function, this algorithm iterates through different values of ‘theta,’ (remember this is a parameter set), to find something called the local optima for a particular data set.
Right, but that explanation is full of odd terms. I think the simplest to understand is the term ‘iterate.’ In the world of algorithms, ‘iterating’ is doing something multiple times using different values each time. This happens in the first part of the equation:
‘OMG’ is just a placeholder for everything after the minus sign, because that part scares me with mathematical notation. What I do recognize is ‘:=’ also known as the assignment operator. It means to assign a value to the thing on the left that equals the value of the thing on the right. In this case, ‘theta of j’ changes by subtracting whatever OMG is from ‘theta of j’ each time the algorithm runs or ‘iterates.’ As OMG gets close to zero, ‘theta of j’ or the set of parameters will not change very much. In machine learning parlance, this is called ‘convergence.’
At which point, I can no longer avoid thinking about the part of the equation previous known as ‘OMG’:
Translating this to English, I say ‘alpha multiplied by the partial derivative of the cost function J at theta_0 and theta_1.’ I already know what the cost function is. That leaves two parts I don’t understand.
This is the symbol for ‘alpha.’ Here, ‘alpha’ represents a scalar, or single numeric value, called ‘the learning rate.’ Essentially, it is a stable, unchanging number (a.k.a. ‘constant’) that represents how big each step in the iteration to take. An analogy that works for me is how I approach learning material from a textbook. I can take notes sentence by sentence, paragraph by paragraph, page by page, or chapter by chapter. If I try to take notes on every sentence, I might not be able to finish my coursework on time because I am going too slowly. In contrast, if I take notes only at the end of every chapter, I might miss some important information because I am going too quickly.
The learning rate or ‘alpha’ is kind of like that. It is a numeric way for the machine learning algorithm to decide how big the chunks of learning are. If alpha is too big, OMG might be too big causing the algorithm to miss the local optima. If alpha is too small, OMG might be too small. In this case, the algorithm might never reach the local optima. Choosing the value for alpha is a topic for another post.
Which leaves this scary thing:
This is the mathematical notation for ‘the partial derivative of a function theta at j.’ Before I can understand the partial derivative part of the algorithm, I dredged the depths of my memory (actually, searched online for the definitions) to understand three terms from geometry, trigonometry and calculus.
The first is a parabola and its corresponding quadratic function. A quadratic function, when plotted on a graph, makes a parabola. The second is a tangent line. This is a straight line that intersects exactly one point on a curve. The third is what the derivative of a function does. Stated using the previous two terms, the derivative calculates the tangent line for a given point on a curve like a parabola.
Now, I recognize that this cost function is a quadratic function. Therefore, when plotted on a graph, it makes a u-shaped parabola. The x-axis is the value of predicted_y based on the current set of parameters ‘theta.’ The y-axis plots the error or cost when predicted_y is plugged into the cost function. With each iteration of the gradient descent algorithm, I have the location or value of one point on the parabola. Ideally, I want ‘theta’ (or parameters) that put my error as close to the vertex, or lowest point, of the parabola as possible. In machine learning terms, this vertex is the ‘local optima.’
But how to measure whether this iteration of the algorithm is close to the local optima? Why, the derivative of the cost function, of course! This provides a tangent line for this iteration’s point on the parabola. If the slope of this tangent line is very steep, the point is far away from the vertex. But if the slope is very flat (or close to horizontal), the point is very close to the vertex. And if my algorithm is working, the slope of this tangent line will decrease with each iteration of my algorithm.
So, at a very general level, this gradient descent algorithm uses lines tangential to the cost function’s parabola to change, at a particular learning rate, the value of theta which gives the best prediction of y.
Oh, yeah, that is so much clearer.