To start of with let’s find out why regularization, by doing so we would get a better intuition of what and how. Before starting with why, we would have to know “what is overfitting?”
Overfitting
On a whole overfitting is a modelling error. When a model start to learn too much of training data or if a model is trying to “adapt” to the training data this leads to high training accuracy.
Wait ,“isn’t high accuracy a good thing?”
Ans: not necessarily, when training accuracy becomes very high the model adapts to the train data only, for example: if you are preparing for an exam with a huge syllabus and you have sat down to study, and you have no idea on the question that would come in the exam. So you “memorize” all the available question in the text book and attend the exam. Now in the exam if your lucky you may get the same question you have studied or else there could be some questions with numeric changes or a different model of questions. here high training accuracy means you are you are memorizing the q&a, this kind of learning would surely fail for new questions.
Similarly high training accuracy doesn’t me high testing accuracy and testing accuracy is the thing to look out for as this is one of your most important report of the performance of your model.
Into a little mathematical visualization lets see how the classifier looks graphically.
In the above graph the blue line is the trace for the overfit model as we can conclude that the it contains high degree polynomial curves and 100% training accuracy. Now lets think on what could be the disadvantages of the model. This high degree polynomial is very expensive in computation, the curve has abnormal spikes in the graph, these spikes are not efficient for a almost linear data. Now looking at the black line its a simple line with not a great training accuracy but with a decent, it has no spikes meaning it would handle new data wisely.
“How do we regulate the model from overfitting?”, now that's where regularization comes into play. There a few more methods to prevent overfitting but we’ll take a look at regularization.
Regularization
lets assume for now that regularization is a penalty addition we’ll justify this in a moment.
Now, how does a machine learning algorithm work, for simplicity lets see how logistic regression works. On a overview logistic regression is an iterative process of learning the weight matrix with the help of gradient descend and loss function. Given the case of overfitting what is the outcomes, the complexity of weight matrix increases, by this we mean the value of magnitude of weights increases. Here magnitude is taken because in the weight matrix there could be continues addition or subtraction resulting in negative and positive numbers, so magnitude represents the actual weightage of the feature.
In the above diagram the y axis denotes loss function and the x axis denotes number of iterations. The dotted line represents training loss and and the solid represents testing loss. As we can see after a few iterations the testing loss starts increasing and the training loss continues to keep decreasing, this represents that our model is starting to overfit and anything before this is referred to underfit. What we desire is the apt spot where the testing loss just starts to increase.
as stated earlier there are a few methods to prevent overfitting some by
adjusting step size
adjusting iterations / stopping gradient descent
the way the above stated methods prevent overfitting and are self intuitive, however regularization as a unique approach, so lets see how it’s done.
How does regularization work
As we can notice in overfitting the the weights get overloaded with large magnitudes, can we find a way to control this factor. lets see… weights are increased or decreased by the gradient decent algorithm which in turn is controlled the the loss function. So you might be getting the intuitions now, we could control the loss function by penalizing it if the weights increase and by penalization we mean to increase the loss function since gradient descent algorithm concentrates on decreasing the loss and by us penalizing the loss would alter the gradient decent in updating the values. There we go that’s the solution.
let’s take a look at the mathematical representation of how penalty should be applied.
By seeing the above diagram we get an intuition on how our penalty should the applied. Getting into the specifics there are mainly 2 ways of applying regularization to logistic regression, they are l1 and l2 norm.
L1 Regularization
l1 norm says that if we add the sum of the magnitude of weight values to the loss function, we could control overfitting. Simple, but lets take a look at how its done mathematically and then we’ll get into the details.
in the above snippet loss of the logistic regression is calculated a the sum of error and and sum of magnitude of weights, error could be anything from log loss to MSE. error depends on the model you are using in our case since we have considered the logistic regression example we would take error to be log loss. In the second part of the equation, as mentioned the sum of magnitude of weights and the total is multiplied by a constant lambda, lambda is a hyperparameter here and it controls the weightage you want to give to the regularizer.
How does it work?
in the regularizer the first term is as usual the loss function, in addition to this is the second term which penalizes loss while over fitting. if we notice as the number of iterations keep increasing and the model is tending to over fit, the second term in this case increases thus increasing the loss. since gradient of decent is calculated via the loss function we are increasing it as weights tend to over fit, as a result this would affect the update rule and prevent the model from over fitting.
Some common question
Why are we taking the magnitude on the second term?
Ans: Since weights could be both -ve and +ve and while overfitting they would tend to increase in magnitude and not value, hence its required to take the magnitude.
what does i represent in the equation ?
Ans: i is the feature weight number used to subscript with W. Wi represents the ith feature weight.
what is N in the equation ?
Ans: N denotes the total number of features.
Problems with L1 regularization
the main problem with l1 regularization occurs because of the mod function on the weights. what happens is the in the calculation of gradient of decent the loss function is differentiated now with the addition of the mod function, it becomes expensive in computation. this happens so because modulus function is not differentiable at zero. Hence there was a need to tackle this problem and l2 regularization was introduced.
Before going into l2 regularization we would have to a talk a little about the advantages of l1 regularization. as we noticed that the modulus function is the problem, there is also an advantage with this. Suppose there is an outlier in our weights then the modulus function deals with it better, we’ll talk more on this in the next section when we discuss more on l2 regularization.
L2 Regularization
the processes and concept of l2 regularization is similar to that of L1, the major difference is the the penalty. here the penalty is the sum of square of the weights multiplied by lambda. The lambda constant in both the cases are the same “functionally”, so I would mention more on that. Below is the equation of loss function with l2 regularizer.
we can notice that there’s not much of a change from l1, the only difference is the second part of the loss function so we’ll take a look at that. in the second part the penalty is changed from modulus to square operation. The main effect of this is that now the total loss function could be differentiable easily without excess computation.
what does the square imply?
Ans: we take the square of the so that the resultant is positive. As we know that weights contain both +ve and -ve values, hence its necessary to take the magnitude only.
Below is the graph of modulus and X² function. This would help you understand better on the penalty and the differentiability of the two functions.
talking about outliers l2 performs poorly with outliers where as l1 is good in handling them. how?
in l2 regularization, for example if l2 regularization encounters an outlier what it would do is square the weight and add it which would make the difference from the standard deviation much more, hence its not accurate. where as on the other had l1 regularizer just add the weight and the difference is comparatively kept low, hence we could say that l1 regularizer deals with outliers better.
We just took a look at regularization for logistic regression similarly for different models there are different was of applying regularization. On an end note can you guess how regularization is implemented on trees?
Ans: penalty is applied in proportion to the number of nodes in a tree