1. Shannon Entropy

- This formula measures the uncertainty of information (it is not easy to guess the information). It also express the least number of questions to identify the information.

Note: log is base 2

- We have 3 strings: "AAAAAAAA", "AAAABBCD", "AABBCCDD"

"AAAAAAAA" has S = -1*log(8/8) = 0

"AAAABBCD" has S = -(4/8)*log(4/8) - (2/8)*log(2/8) - (1/8)*log(1/8) - (1/8)*log(1/8) = 1.75

"AABBCCDD" has S = -(2/8)*log(2/8) - (2/8)*log(2/8) - (2/8)*log(2/8) - (2/8)*log(2/8) = 2

- The uncertainty of "AABBCCDD" is largest

Refer this.

2. Apply in Deep Learning - Classification
Use a modification version of Shannon Entropy => Cross-Entropy.
2.1 Binary Cross-Entropy Loss
Output only takes 2 classes.

yi: True label
p(yi): Predicted label
2.2 Cross-Entropy Loss
Output can takes n (> 2) classes.

q(yc): True label, one-hot encoded
p(yc): Predicted label passed through softmax
Comparing with Shannon Entropy

If p(yc) move colser to q(yc) (minimizes the cross-entropy), Cross-Entropy becomes Shannon Entropy. But Cross-Entropy is often greater than Shannon Entropy then we have Kullback-Leibler Divergence. Kullback-Leibler Divergence measures the divergence between q(yc) and p(yc).

Refer this.

What is the difference Cross-entropy and KL divergence?

You will need some conditions to claim the equivalence between minimizing cross entropy and minimizing KL divergence. I will put your question under the context of classification problems using cross entropy as loss functions.

Let us first recall that entropy is used to measure the uncertainty of a system, which is defined as

S (v) = - \sum_{i} p (v_{i}) \log p (v_{i}),

for

p (v_{i})

as the probabilities of different states

v_{i}

of the system. From an information theory point of view,

S (v)

is the amount of information is needed for removing the uncertainty.

For instance, the event $I$ I will die within 200 years is almost certain (we may solve the aging problem for the word almost), therefore it has low uncertainty which requires only the information of the aging problem cannot be solved to make it certain. However, the event $I I$ I will die within 50 years is more uncertain than event $I$ , thus it needs more information to remove the uncertainties. Here entropy can be used to quantify the uncertainty of the distribution When will I die?, which can be regarded as the expectation of uncertainties of individual events like $I$ and $I I$ .

Now look at the definition of KL divergence between distributions A and B

D_{K L} (A ∥ B) = \sum_{i} p_{A} (v_{i}) \log p_{A} (v_{i}) - p_{A} (v_{i}) \log p_{B} (v_{i}),

where the first term of the right hand side is the entropy of distribution A, the second term can be interpreted as the expectation of distribution B in terms of A. And the

D_{K L}

describes how different B is from A from the perspective of A. It's worth of noting

A

usually stands for the data, i.e. the measured distribution, and

B

is the theoretical or hypothetical distribution. That means, you always start from what you observed.

To relate cross entropy to entropy and KL divergence, we formalize the cross entropy in terms of distributions $A$ and $B$ as

H (A, B) = - \sum_{i} p_{A} (v_{i}) \log p_{B} (v_{i}) .

From the definitions, we can easily see

H (A, B) = D_{K L} (A ∥ B) + S_{A} .

S_{A}

is a constant, then minimizing

H (A, B)

is equivalent to minimizing

D_{K L} (A ∥ B)

A further question follows naturally as how the entropy can be a constant. In a machine learning task, we start with a dataset (denoted as $P (D)$ ) which represent the problem to be solved, and the learning purpose is to make the model estimated distribution (denoted as $P (m o d e l)$ ) as close as possible to true distribution of the problem (denoted as $P (t r u t h)$ ). $P (t r u t h)$ is unknown and represented by $P (D)$ . Therefore in an ideal world, we expect

P (m o d e l) \approx P (D) \approx P (t r u t h)

and minimize

D_{K L} (P (D) ∥ P (m o d e l))

. And luckily, in practice

D

is given, which means its entropy

S (D)

is fixed as a constant.

the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as

H (q, p) = D_{K L} (p, q) + H (p) = - \sum_{i} p_{i} l o g (q_{i})

so have

D_{K L} (p, q) = H (q, p) - H (p)

From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p^{'}$ of a minibatch may be different from the global $p$ . In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish his job.

Tech It Yourself

Information Entropy - Shannon Entropy

What is the difference Cross-entropy and KL divergence?

Post a Comment

0 Comments

Latest Posts

Popular Posts

Collection of points in Computer Vision

Note 2: Linear regression - Python demos

Principal Component Analysis PCA using Singular Value Decomposition SVD

new data augmentation methods

Visualize the heatmap - GradCAM - Keras

Popular Posts

Collection of points in Computer Vision

Note 2: Linear regression - Python demos

Principal Component Analysis PCA using Singular Value Decomposition SVD

new data augmentation methods

Visualize the heatmap - GradCAM - Keras

What is a batch-norm in machine learning?

Robot Operating System - ROS tutorial

Information Entropy - Shannon Entropy

Tags

Information Entropy - Shannon Entropy

What is the difference Cross-entropy and KL divergence?

Post a Comment

0 Comments

Follow us

Latest Posts

Popular Posts

Popular Posts

Tags