Feature selection is important in pratical machine learning cases, aiming to obtain the features that are more capable to classify. One principle of feature selection is based on mutual information. Here we will introduce the basic concept of this.

Theory

1. Shannon Measure of Information (SIC)

The measure of information should contain the following intuitive properties:

Information contained in the events should be defined according to some measure of uncertainty of the events.
Less certain events should contain more information than more certain events.
The information of unrelated events taken as a single event should equal the sum of the information of the unrelated events.
The probabilities of various possible events should be taken into account.

A natural way to measure the uncertainty of an event X is the probability of X. Based on this, according to the properties 2 – 4 above, the information content in an event can be expressed as:
$$INFO {X} = -\log p_x = \log\frac{1}{p_x}$$
, which can also be extended to multiple independent events as:
$$SIC = \sum_xp_x\log_2\frac{1}{p_x} = -\sum_xp_x\log_2{p_x}$$

2. Entropy

Based on the derivation above, we can define entropy as:
$$H(X) = \sum_x{p(x)logp(x)}$$
According to its definition, entropy is a measure of the uncertainty of a random variable. Concretely, for a random variable, more uncertain (probability distribution is more flat) it is, larger the entropy value.

Similar with conditional probability, here we can also define the conditional entropy $H(Y|X)$ as the uncertainty of random variable Y when the random variable X is known.
$$H(Y|X) = \sum_x p(x)H(Y|X=x)$$

3. Information Gain

In information theory, the information gain is actually called Mutual Information:

$$I(X, Y) = H(X) - H(X|Y)$$

Mutual information measures the information that X and Y share. In another words, it measures how much knowing one of these variables reduces uncertainty about the other. Concretely, if X and Y are independent, then the mutual information is zero; if X and Y are identical, then the $𝐼(𝑋,𝑌) =𝐻(𝑋) =𝐻(𝑌)$

Example

Let’s take a machine learning case for example: X is the dataset with labels and several features and we want to find out which feature contains the most capability to classify this dataset. So what should we do to get it? By using the information gain criteria, we can just simply select the one (feature Q) with the highest mutual information $I(X,Q)$

A brief explanation: $H(X)$ is the uncertainty of the dataset (and its label); $H(X|Q)$ is the uncertainty of the dataset when the information Q is known; and the mutual information is the deference between them, which also can be considered as the decrease of the uncertainty of classifiation with the knowledge of Q. This is also why it is called “information gain”. Therefore, it is obvious that the feature with the highest information gain contains the most capability to classify the dataset.