Linear Regression in detail
Introduction
If you‘re thinking of building the model that makes the machine predict the house prices in your city or predict the temperature in your locality on any day, then Linear Regression will fit perfectly.
So, what exactly is linear regression and how actually the machines learn to make predictions? This blog will clear all your doubts on how? why? what? Let's start………
What is Linear Regression?
Linear regression is part of a supervised learning algorithm that builds the relationship between the given data's dependent variables and predicts the output as discrete values. The above definition is more technical, let’s understand with an example of house price prediction.
House prediction example is one of the popular examples in linear regression. In this, the machine has to predict the price of the house (Output) taking the area of the house, the number of rooms, etc as variables. Here the number of rooms, area of the house, etc are the dependent variables that plays important role in giving output price. Wait! how???. Some variables like the number of rooms will be a higher priority than having a garden so, each dependent variable has its priority. The priority of each variable in machine learning is called weights. If the variable has more weight, then it has a major contribution towards the output.
Mathematical representation
Now, let’s represent what we’ve learnt mathematically
Output price = weight of the variable 1 * variable 1 + weight of the variable 2 * variable 2 +………..weight of variable n *variable n
y(x)=w0+w1*x1+w2*x2+w3*x3+…….+wn*xn
where ‘n’ is the number of features or dependent variables, ‘w’ is the weight, w0 is the intercept term which is called bias.
When we plot the graph then it will (n+1) dimensional. so, to make the concept simple I will take only one dependent variable. so, the equation becomes
y(x)=b+w1x1
In Machine learning, linear regression is trained with the data to find the better values of b and w1 to output good predictions. So, now what are the initial values of b and w1? Generally, the values of b and w1 will be initialized with zeros or random values.
Alright then, if we’ve initialized with zero or random value, does it predict wrong output? Of course, the answer is “yes”, it will predict the wrong value. To solve this loss function comes into the picture, to find the error between the predicted value(y) and the original value(Y).
Loss Function
The loss function calculates the loss between predicted output and original value. There are different loss functions for different types of problems like Squared Error Loss, Absolute Error Loss, Huber Loss, etc.
Let’s consider the square difference between the original value and the predicted value to find the loss.
The above one is only for the one training example, in a real dataset, there are more than one training examples. so, in that case, we’ve to add all losses of the individual training example which is called the cost function.
The plot shows the graph between features and output value and finding the loss for each training example L1, L2, ….. After finding the loss, machine learning tries to change the slope and intercept of the algorithm i.e., weight and bias. Weight and bias will be updated by using a gradient descent algorithm. It continues to find the weight and bias which have less loss.
Gradient Descent
Let’s plot the graph between J(w,b), w and b. The graph will be 3 dimensional.
Now we’re doing the equation of degree 2 for the cost function. So, the graph looks like 3d parabolic and have one global optimum.
To find the global optimum which is the minimum loss, we will make the algorithm descent towards global minima and update the previous weights and bias with new ones. The algorithm used is
The previous weights and bias will be updated by new ones which are obtained by subtracting the learning rate time of its gradient.
Where Alpha is a learning rate, which says how fast the descent should be. If alpha is large then the step size of decent will be large and vice versa.
By taking a Contour plot of the 3D graph, let’s plot the upgradations of gradient descent.
The algorithm tries to decrease the cost of the data, to achieve greater accuracies.
For better visualization, let us take b=0, so, we can get a 2D plot.
By iterating more times, the graph converges at local minima with a minimum cost function resulting in better accuracy.
Parameters Vs Hyperparameters
In machine learning, weights and bias are called parameters because by varying the ‘w’ and ‘b’ the results changes. so, they act as parameters to the algorithm. Learning rate, Number of iterations, loss function, etc act as the hyperparameters because they act as parameters for the parameters.
We need to tune the hyperparameters for better accuracy because more iterations result in taking more time to run the model, less learning rate gives good accuracy but it takes very small steps while descending while making the process slower. If the learning rate is more, then it shoots out the path and just oscillates in the graph and doesn’t reach global minima. So, good tuning of hyperparameters is required to make more accurate predictions.
Conclusion
This the way the machine learns makes predictions. First, the model is initialized with zero or random value add its weight and bias results in predicting some discrete value. It analyzes the loss by using the loss function and sums up all losses over the training set. Once finding the loss, the machine tries to reduce the loss by taking descending steps towards global minima subtracting the learning rate times its gradient. This continues until it reaches the global minima based on the number of iterations we defined in the program.
I hope this blog cleared all your doubts on linear regression. To get more clarity on linear regression, the coding part will soon be updated in the next part.
I hope you guys liked the blog. So, let me know your opinion in the comments.