Machine Learning Study Notes

Machine Learning A Cost Function

    \begin{equation*}\begin{align} J(\theta_0, \theta_1) = \frac{1}{2m} \sum^m_{i=1}(\hat{y}_i-y_i)^{2} \end{align}\end{equation*}

I call this a cost function rather than the cost function because one cost function does not fit all. Cost functions vary depending on the kind of data set or machine learning model. For the purposes of machine learning, cost functions become the algorithms that measure the performance of hypotheses. This equation comes from the mean squared error function, abbreviated to MSE, which is related to the concept of standard deviation. It can be written many ways:

    \begin{equation*} \begin{align}J(\theta_0, \theta_1)& = \frac{1}{2m} \sum^m_{i=1}(\hat{y}_i-y_i)^{2}\\  &= \frac{1}{2m} \sum^m_{i=1}(\hat{y}^{(i)} -y^{(i)})^{2}\\  &=\frac{1}{2m} \sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})^{2}\end{align} \end{equation*}

A new symbol in this equation is the capital Greek letter Σ. English speakers call it ‘sigma.’ Mathematicians read it as ‘the sum of’ whatever follows it. Terms written below Σ indicate a start point while terms above it indicate the end point. So:

    \begin{equation*} \sum^m_{i=1} \end{equation*}

means ‘sum of every case in the set of ‘m’ cases starting with the first case.’ \sum^m_{i=1} is an alternate way of writing it.

(\hat{y}_i-y_i)^{2} means for a particular case or record in the data set, subtract the known value of ‘y’ from the predicted value of ‘y.’ Then square the result. Another way of saying this is ‘for a particular case in the data set, indicated by ‘i,’ square the difference between the predicted value of ‘y’ and the known value of ‘y.’

\sum^m_{i=1}(\hat{y}_i-y_i)^{2} puts the pieces together. It says, for every case in the data set, sum the squared difference between the predicted ‘y’ and the known ‘y.’ The linear hypothesis post talks about how the predicted ‘y’ is calculated from the known ‘x’ in the data set. Because this algorithm is part of a supervised machine learning model, the data set provides a known ‘y’ for every known ‘x.’

Different forms of the equation use slightly different notation. x_i is the same as x^{(i)} The tiny ‘i’ can be located in different places. Because it is in parentheses in the superscript, it means for every ‘i’ in a set. It does not mean the same thing as x^{i} which would mean ‘x’ to the ith power.