Confidence Interval in Machine Learning

Confidence interval is a part of probability and statistics in machine learning and it is one of the most important topic to learn because it help a machine learning experts to predict the range between which a parameter like mean,median, variance etc.. can be found along with the confidence percentage (probabilistic chances os finding that parameter in a specified range).

Since, It is important and needs to understand well that’s why I will take it slow and In brief. So you becomes confident while dealing with confidence interval in future.

Let’s begin…

Introduction:

Here are some common questions that arises in everyone’s mind when first encountered with confidence intervals. So,I will explain them in this part:

What is Confidence Interval?

Why we need to study this topic?

How it will be useful in Machine Learning?

Suppose you are machine learning engineer and you have a distribution of salaries of peoples working in your company.

Now, if you have to calculate the interval (range) of salaries in which your population mean , population standard deviation, median or anything lies with some confidence value (tell’s how much confident that you will find the required value in the calculated range).

For example:

You said that for population mean: {20,000 to 50,000} with 90% probability.

Let me elaborate above statement for you. Suppose you have your population mean as 35,550. So, by stating above statement you say there is 90% chances that your mean value will lie between 20,000 to 50,000.

In confidence interval we find the range (like above we have 20,000 to 50,000) with some confidence probability value (like above we say 90% probability)

So, Technically speaking confidence interval is the calculated range from the given distribution along with the percentage value which tells about how confidence you actually are.

Why we study this, from above explanation we find the range of data. Now, suppose we requires mean value for further processing in some scenario. In such case you can find interval (range)and hence the scope of data used for analysis is also reduced.

So, I think now you got intuitive feeling about confidence interval. Now, its time to understand this in brief.

Explanation:

To explain this I will take example So, that you understand how it works.

Lets take distribution X and we don’t know which kind of distribution it is means we don’t know whether it’s Gaussian distribution, log normal distribution, uniform distribution etc.

So, here we have X which I take as the distribution of people’s height living in particular city. With population mean as µ (meu) and standard deviation as σ (sigma).

So, our distribution can be represented as:

This is large data set known as POPULATION in machine learning and processing this population of data will require lot of time even if you are doing small analysis on personnel computers. So, for analysis we take a small amount of data known as SAMPLE to analyse our data. Here I am doing random sampling with n=10 (means, taking 10 points in one sample) we will take m such sample So, now we have:

Hope now you understand the data we have here. So, lets look what we want to find from above data set.

Suppose we have to find the 95% Confidence Interval of µ (population mean).

Let me elaborate the motive. Here from the distribution X we have to find two values for which we can say that our mean will lie between those values with 95% probability.

Here we have two cases that you will encounter which are as follows:

CASE 1: σ (Sigma) is given.

In this case we know the σ (sigma) of our distribution .

Here you need to know about center limit theorem. If you know that very good but, if you don’t know then no need to worry I will explain it too.

Center Limit Theorem (CLT) :

It says that if you have any distribution X and you do not know anything about the type of distribution.

Then we can do something called “sampling distribution of sample means”

In this we divide our distribution X into small m samples s and take mean of each sample s. As shown below:

So according to Center Limit Theorem X” will be gaussian distributed with:

Mean = µ (meu)

Standard Deviation = σ/sqrt(n)

Constraint for CLT: Distribution X should have finite µ and σ^2 (variance). Example of infinite mean and variance is Pareto Distribution.

Now let’s get back to our topic. So,from our unknown distribution X we converted to our known distribution Gaussian Distribution as explained above.

We have one very interesting feature of gaussian distribution as shown below:

So, from above figure you may conclude that:

1 Gaussian distribution is bell shape curved.

2 It is distributed equally on both side of µ(meu).

3 Gaussian distribution have two variables (µ,σ^2) mean and variance.

4 Mean is the center point and variance is the spread of the curve.

For more information about gaussian distribution : https://en.wikipedia.org/wiki/Normal_distribution

So, here we will use 68-95-99.7 rule of gaussian curve.

This rule says that:

1 You will find 68% of data points in 1^st standard deviation from µ i.e. between µ-σ to µ+σ.

2 You will find 95% of data points in 2^nd standard deviation from µ i.e. between µ -2σ to µ +2σ.

3 You will find 99.7% of data pointsin 3^rd standard deviation from µ i.e. between

µ -3σ to µ +3σ.

As we are dealing with the case where we know our σ and we have converted our distribution into gaussian distribution using Center Limit Theorem (CLT) which have µ and σ/sqrt(n).

Let’s recall what we have to find. We have to find the 95% Confidence Interval of µ

So, from 68-95-99.7 rule we know that our distribution will have 95% of data points within which mean will lie from µ-2σ toµ+2σ.

We know, X” (sample mean Distribution) ~ N( µ, σ/sqrt(n) )

from above relation our 95% Confidence Interval of µ will be:

µ = { ( X” – 2σ/sqrt(n)) , (X” + 2σ/sqrt(n)) } for, X”= Sample mean.

This is case 1 where we know σ.

Case 2: If we do not know σ.

If we do not know σ then we go with t-distribution ( student t-distribution)

We have to find:

What is the 95% confidence interval for the population mean?

The formula for the confidence interval for one population mean, using the t-distribution is

In this case,

The sample mean, X’ is 4.8

The sample standard deviation, s, is 0.4

The sample size, n, is 30

The degrees of freedom, n – 1, is 29.

That means t_n _– 1 = 2.05.

Now, plug in the numbers:

Rounded to two decimal places, the answer is 4.65 to 4.95.

This is how we compute in case we do not know σ of our population. So, what we have discussed so far:

If you notice in both of the above cases we calculated Confidence Interval for population mean. But, what if we need Confidence interval for median, standard deviation etc.. In these cases we use something known as BOOTSTRAPPING

CONFIDENCE INTERVAL USING BOOTSTRAPPING

When you want to calculate confidence interval for parameters other than mean like median, variance etc.. we use bootstrapping.

I am taking example to calculate confidence interval for median and method is same for other parameters too.

So, we have some distribution X for which we need to calculate confidence interval with 95% probability chance.

X ~ { x1, x2, x3, x4, ……………, xn }

Now we will

1 Take 1000 samples here I am taking sample size as 10.

2 Since we are calculating for median so, we will take median of each sample.

3 Then we will sort all the medians.

Now we have medians arranged as:

m’1<= m’2<= m’3<= …….. <= m’1000

and all are in sorted order. As we are calculating 95% confidence interval of medians so we will take range as {m’25– m’975} as it contains 95% of data leaving 2.5% in both end.

Follow same method while calculating for other parameters.

This is all the numerical and theoretical concept of Confidence Interval. Thank you for reading.