Regularization

To start of with let’s find out why regularization, by doing so we would get a better intuition of what and how. Before starting with why, we would have to know “what is overfitting?”

Overfitting

Wait ,“isn’t high accuracy a good thing?”

Ans: not necessarily, when training accuracy becomes very high the model adapts to the train data only, for example: if you are preparing for an exam with a huge syllabus and you have sat down to study, and you have no idea on the question that would come in the exam. So you “memorize” all the available question in the text book and attend the exam. Now in the exam if your lucky you may get the same question you have studied or else there could be some questions with numeric changes or a different model of questions. here high training accuracy means you are you are memorizing the q&a, this kind of learning would surely fail for new questions.

Similarly high training accuracy doesn’t me high testing accuracy and testing accuracy is the thing to look out for as this is one of your most important report of the performance of your model.

Into a little mathematical visualization lets see how the classifier looks graphically.

fig1
Fig1. (Source-Wikipedia)

In the above graph the blue line is the trace for the overfit model as we can conclude that the it contains high degree polynomial curves and 100% training accuracy. Now lets think on what could be the disadvantages of the model. This high degree polynomial is very expensive in computation, the curve has abnormal spikes in the graph, these spikes are not efficient for a almost linear data. Now looking at the black line its a simple line with not a great training accuracy but with a decent, it has no spikes meaning it would handle new data wisely.

“How do we regulate the model from overfitting?”, now that's where regularization comes into play. There a few more methods to prevent overfitting but we’ll take a look at regularization.

Regularization

Now, how does a machine learning algorithm work, for simplicity lets see how logistic regression works. On a overview logistic regression is an iterative process of learning the weight matrix with the help of gradient descend and loss function. Given the case of overfitting what is the outcomes, the complexity of weight matrix increases, by this we mean the value of magnitude of weights increases. Here magnitude is taken because in the weight matrix there could be continues addition or subtraction resulting in negative and positive numbers, so magnitude represents the actual weightage of the feature.

Fig:2(Source-GitHub)

In the above diagram the y axis denotes loss function and the x axis denotes number of iterations. The dotted line represents training loss and and the solid represents testing loss. As we can see after a few iterations the testing loss starts increasing and the training loss continues to keep decreasing, this represents that our model is starting to overfit and anything before this is referred to underfit. What we desire is the apt spot where the testing loss just starts to increase.

as stated earlier there are a few methods to prevent overfitting some by

adjusting step size

adjusting iterations / stopping gradient descent

the way the above stated methods prevent overfitting and are self intuitive, however regularization as a unique approach, so lets see how it’s done.

How does regularization work

let’s take a look at the mathematical representation of how penalty should be applied.

Fig:3

By seeing the above diagram we get an intuition on how our penalty should the applied. Getting into the specifics there are mainly 2 ways of applying regularization to logistic regression, they are l1 and l2 norm.

L1 Regularization

Fig:4

in the above snippet loss of the logistic regression is calculated a the sum of error and and sum of magnitude of weights, error could be anything from log loss to MSE. error depends on the model you are using in our case since we have considered the logistic regression example we would take error to be log loss. In the second part of the equation, as mentioned the sum of magnitude of weights and the total is multiplied by a constant lambda, lambda is a hyperparameter here and it controls the weightage you want to give to the regularizer.

How does it work?

in the regularizer the first term is as usual the loss function, in addition to this is the second term which penalizes loss while over fitting. if we notice as the number of iterations keep increasing and the model is tending to over fit, the second term in this case increases thus increasing the loss. since gradient of decent is calculated via the loss function we are increasing it as weights tend to over fit, as a result this would affect the update rule and prevent the model from over fitting.

Some common question

Why are we taking the magnitude on the second term?

Ans: Since weights could be both -ve and +ve and while overfitting they would tend to increase in magnitude and not value, hence its required to take the magnitude.

what does i represent in the equation ?

Ans: i is the feature weight number used to subscript with W. Wi represents the ith feature weight.

what is N in the equation ?

Ans: N denotes the total number of features.

Problems with L1 regularization

Before going into l2 regularization we would have to a talk a little about the advantages of l1 regularization. as we noticed that the modulus function is the problem, there is also an advantage with this. Suppose there is an outlier in our weights then the modulus function deals with it better, we’ll talk more on this in the next section when we discuss more on l2 regularization.

L2 Regularization

Fig:5

we can notice that there’s not much of a change from l1, the only difference is the second part of the loss function so we’ll take a look at that. in the second part the penalty is changed from modulus to square operation. The main effect of this is that now the total loss function could be differentiable easily without excess computation.

what does the square imply?

Ans: we take the square of the so that the resultant is positive. As we know that weights contain both +ve and -ve values, hence its necessary to take the magnitude only.

Below is the graph of modulus and X² function. This would help you understand better on the penalty and the differentiability of the two functions.

Fig:5 Blue indicates X² and red indicates mod(X) plots.(source: towards data science )

talking about outliers l2 performs poorly with outliers where as l1 is good in handling them. how?

in l2 regularization, for example if l2 regularization encounters an outlier what it would do is square the weight and add it which would make the difference from the standard deviation much more, hence its not accurate. where as on the other had l1 regularizer just add the weight and the difference is comparatively kept low, hence we could say that l1 regularizer deals with outliers better.

We just took a look at regularization for logistic regression similarly for different models there are different was of applying regularization. On an end note can you guess how regularization is implemented on trees?

Ans: penalty is applied in proportion to the number of nodes in a tree

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store