Intereting Posts

If $f(xy)=f(x)f(y)$ then show that $f(x) = x^t$ for some t
Number of $\sigma$ -Algebra on the finite set
preservation of completeness under homeomorphism
Limit of a Rational Trigonometric Function $\lim_{x \to 0} \frac{\sin5x}{\sin4x}$
How many directed graphs of size n are there where each vertex is the tail of exactly one edge?
Evaluate $\int\frac1{1+x^n}dx$ for $n\in\mathbb R$
Questins on Formulae for Eigenvalues & Eigenvectors of any 2 by 2 Matrix
Is the Nested Radical Constant rational or irrational?
Can we prove the Riemann-Lebesgue lemma by using the Weierstrass approximation theorem?
Show that $\sum_{k=0}^n\binom{3n}{3k}=\frac{8^n+2(-1)^n}{3}$
Prove that 2.101001000100001… is an irrational number.
Why does the Hilbert curve fill the whole square?
Proof of $\arctan(x) = \arcsin(x/\sqrt{1+x^2})$
Given $3$-dimensional subspaces $V, W \subset \Bbb R^5$, there is a nonzero vector in $V \cap W$
Subspace of $C^3$ that spanned by a set over C and over R

I am learning machine learning from Andrew Ng’s open-class notes and coursera.org. I am trying to understand how the cost function for the logistic regression is derived. I will start with the cost function for linear regression and then get to my question with logistic regression.

(Btw a similar question was asked here, which answers the question how the derivative of cost function was derived but not the cost function itself.)

1) **Linear regression** uses the following hypothesis: $$ h_\theta(x) = \theta_0 + \theta_1 x$$

- Understanding imaginary exponents
- Finding $\log_{-e} e$
- What is a closed form of $\int_0^1\ln(-\ln x)\ \text{li}\ x\ dx$
- Show that, for $t>0$, $\log t$ is not a polynomial.
- Log-concave functions whose sums are still log-concave: possible to find a subset?
- Is it possible to generalize Ramanujan's lower bound for factorials that he used in his proof of Bertrand's Postulate?

Accordingly, the cost function is defined as:

$$J(\theta) = \dfrac {1}{2m} \displaystyle \sum_{i=1}^m \left (h_\theta (x^{(i)}) – y^{(i)} \right)^2$$

–

2) The **logistic regression** uses a sigmoid/logistic function which is $ 0 \leq h_\theta (x) \leq 1 $.

Defined as :

$$

\begin{align*}

& h_\theta (x) = \dfrac{1}{1 + e^{-(\theta^T x)}} \newline \newline

\end{align*}

$$

Accordingly, our cost function has also changed. However, instead of plugging-in the new h(x) equation directly, we used logarithm.

$$

\begin{align*}

& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m Cost(h_\theta(x^{(i)}),y^{(i)}) \newline

& Cost(h_\theta(x),y) = -log(h_\theta(x)) \; & \text{if y = 1} \newline

& Cost(h_\theta(x),y) = -log(1-h_\theta(x)) \; & \text{if y = 0}

\end{align*}

$$

And the new cost function is defined as:

$$ J(\theta) = – \frac{1}{m} \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 – y^{(i)})\log (1 – h_\theta(x^{(i)}))]$$

From class notes “.. the more hypothesis is off from y, the larger the cost function output. If our hypothesis is equal to y, then our cost is 0.”

It’s also mentioned in the class notes that MLE (maximum-likelihood estimation) is used to derive the logs in the cost function. I can see how logs function and set penalty values until we find the right values, but I don’t see how we came to choose them in the cost function.

- What is $\int_0^1 \ln (1-x) \ln \left(\ln \left(\frac{1}{x}\right)\right) \, dx$?
- Prove that $\forall x>0, \frac {x-1}{\ln(x)} \geq \sqrt{x} $.
- Integral ${\large\int}_0^1\ln(1-x)\ln(1+x)\ln^2x\,dx$
- Can $\displaystyle\lim_{h\to 0}\frac{b^h - 1}{h}$ be solved without circular reasoning?
- Sum of the alternating harmonic series $\sum_{k=1}^{\infty}\frac{(-1)^{k+1}}{k} = \frac{1}{1} - \frac{1}{2} + \cdots $
- What is the best way to calculate log without a calculator?
- The positive root of the transcendental equation $\ln x-\sqrt{x-1}+1=0$
- Integral $\int_0^{\pi/2}\frac{x}{\sin x}\log^2\left(\frac{1+\cos x-\sin x}{1+\cos x+\sin x}\right)dx$
- What is intresting about $\sqrt{\log_x{\exp{\sqrt{\log_x{\exp{\sqrt{\log_x{\exp{\cdots}}}}}}}}}=\log_x{e}$?
- Solving the equation $\ln(x)=-x$

Lets try to derive why the logarithm comes in the cost function of logistic regression from first principles.

So we have a dataset **X** consisting of *m* datapoints and *n* features. And there is a class variable **y** a vector of length *m* which can have two values 1 or 0.

Now logistic regression says that the probability that class variable value $y_i =1$ , $i=1,2,…m$ can be modelled as follows

$$

P( y_i =1 | \mathbf{x}_i ; \theta) = h_{\theta}(\mathbf{x}_i) = \dfrac{1}{1+e^{(- \theta^T \mathbf{x}_i)}}

$$

so $y_i = 1$ with probability $h_{\theta}(\mathbf{x}_i)$ and $y_i=0$ with probability $1-h_{\theta}(\mathbf{x}_i)$.

This can be combined into a single equation as follows, ( actually $y_i$ follows a Bernoulli distribution)

$$ P(y_i ) = h_{\theta}(\mathbf{x}_i)^{y_i} (1 – h_{\theta}(\mathbf{x}_i))^{1-y_i}$$

$P(y_i)$ is known as the likelihood of single data point $\mathbf{x}_i$, i.e. given the value of $y_i$ what is the probability of $\mathbf{x}_i$ occurring. it is the conditional probability $P(\mathbf{x}_i | y_i)$.

The likelihood of the entire dataset $\mathbf{X}$ is the product of the individual data point likelihoods. Thus

$$ P(\mathbf{X}|\mathbf{y}) = \prod_{i=1}^{m} P(\mathbf{x}_i | y_i) = \prod_{i=1}^{m} h_{\theta}(\mathbf{x}_i)^{y_i} (1 – h_{\theta}(\mathbf{x}_i))^{1-y_i}$$

Now the principle of maximum likelihood says that we find the parameters that maximise likelihood $P(\mathbf{X}|\mathbf{y})$.

As mentioned in the comment, logarithms are used because they convert products into sums and do not alter the maximization search, as they are monotone increasing functions. Here too we have a product form in the likelihood.So we take the natural logarithm as maximising the likelihood is same as maximising the log likelihood, so log likelihood $L(\theta)$ is now:

$$ L(\theta) = \log(P(\mathbf{X}|\mathbf{y}) = \sum_{i=1}^{m} y_i \log(h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log(1 – h_{\theta}(\mathbf{x}_i)) $$.

Since in linear regression we found the $\theta$ that minimizes our cost function , here too for the sake of consistency, we would like to have a minimization problem. And we want the average cost over all the data points. Currently, we have a maximimzation of $L(\theta)$. Maximization of $L(

\theta)$ is equivalent to minimization of $ -L(\theta)$. And using the average cost over all data points, our cost function for logistic regresion comes out to be,

$$ J(\theta) = – \dfrac{1}{m} L(\theta)$$

$$ = – \dfrac{1}{m} \left( \sum_{i=1}^{m} y_i \log (h_{\theta}(\mathbf{x}_i)) + (1-y_i) \log (1 – h_{\theta}(\mathbf{x}_i)) \right )$$

Now we can also understand why the cost for single data point comes as follows:

the cost for a single data point is $ = -\log( P(\mathbf{x}_i | y_i))$, which can be written as $ – \left ( y_i \log (h_{\theta}(\mathbf{x}_i)) + (1 – y_i) \log (1 – h_{\theta}(\mathbf{x}_i) \right )$.

We can now split the above into two depending upon the value of $y_i$. Thus we get

$J(h_{\theta}(\mathbf{x}_i), y_i) = – \log (h_{\theta}(\mathbf{x}_i)) , \text{ if } y_i=1$, and

$J(h_{\theta}(\mathbf{x}_i), y_i) = – \log (1 – (h_{\theta}(\mathbf{x}_i) ) , \text{ if } y_i=0 $.

@user76170 should be correct. But I don’t understand a thing and as I am new to here I am not allowed to comment. I hope it is all right to write here.

I think, what @user76170 means by :

$\ p(y_i)=h_\theta(x_i)^{y_i}*(1−h_\theta(x_i))^{1−y_i}$

is:$$

p(y_i|x_i;\theta)=h_\theta(x_i)^{y_i}*(1−h_\theta(x_i))^{1−y_i}

$$

And I don’t understand why he/she changes the above to conditonal probability $\ P(x_i|y_i)$.

As I have read from here http://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf

The expression of the log likelihood function of parameter value can be:

$$

L(\theta) =log (\prod_{i=1}^{m}p(y_i|x_i;\theta))\\

L(\theta) =\sum_{i=1}^{m}log(p(y_i|x_i;\theta))\\

L(\theta) = \sum_{i=1}^{m}y_i\log(h_\theta(x_i))+(1-y_i)*log(1-h_\theta(x_i))

$$

After that, in my schoolwork, we will use first and second derivative to find the value of $\ \theta$ when the above log-likelihood reaches its maximum value and we call that the MLE method.

As here, we want to use the Cost Function method so we need a cost function which while minimising it, the log-likelihood will be maximised. Hence, we just need to add a minus to it as suggested by @user76170.

For reason I don’t know, the Cost Function given below and in the lecture is the average of cost over all data points.

$$

J(\theta)=-\frac{1}{m}*L(\theta)

$$

Why do we have have to divide it by m. Will that make the gradient descent method goes quicker? Probably, I should raise a question.

- Show that the set of all infinite subsets of $\mathbb{N}$ has cardinality greater than $\mathbb{N}$
- What shape do we get when we shear an ellipse? And more generally, do affine transformations always map conic sections to conic sections?
- Dirichlet's function expressed as $\lim_{m\to\infty} \lim_{n\to\infty} \cos^{2n}(m!\pi x)$
- Prove the definitions of $e$ to be equivalent
- When is a sequence $(x_n) \subset $ dense in $$?
- Can we have an uncountable number of isolated points?
- Bessel and cosine function identity formula
- Getting value of $\sin x + \cos x$ with complex exponential
- Arithmetic-geometric Mean
- Is the relation on integers, defined by $(a,b)\in R\iff a=5q+b$, a function?
- How many ways are there to choose 10 objects from 6 distinct types when…
- Continuity of multidimensional function: $f_1(x,y)=\frac{x^2y}{x^2+y^2}$ and $f_2(x,y)=\frac{2xy^3}{(x^2+y^2)^2}$
- Joint probability distribution (over unit circle)
- A good commutative algebra book
- $\log_{2}{3} > \log_{3}{5}$?