Naive Bayes classification

1. Usage

• NBC is extremely fast for both training and prediction

• NBC is often very easily interpretable

• NBC has very few (if any) tunable parameters

• When the data match naive assumptions (very rare in practice)

• For very well-separated categories, and simple model is needed

• For very high-dimensional data, and simple model is needed

2. Implementation

Classes c₁, c₂, c₃

Features x₁, x₂

The result of a classifier is

p(c) is the probability (frequencies) that class c is observed in the labeled dataset.

With assumption x₁, x₂ are independent

how to model p(x₁|c₁), p(x₂|c₁), p(x₁|c₂), p(x₂|c₂), p(x₁|c₃) and p(x₂|c₃)?

If the features are 0 and 1 only, you could use a Bernoulli distribution.

If the features are integers, a Multinomial distribution.

If the features are real values, a Gaussian distribution.

With a class cⱼ from the data, estimates μᵢ,ⱼ (the mean) and σᵢ,ⱼ (the standard deviation) for each feature i.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=20, centers=[(0,0), (4,4), (-4, 4)], random_state=2)
print(X.shape)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='RdBu');
plt.show()

class GNB:
    def __init__(self):
        pass
    def fit(self, X, y):
        print(y)
        total = len(y)
        self.unique_y = np.unique(y)
        self.params = {}
        for j in self.unique_y:
            id_class_j = np.where(y==j)
            prob_class_j = len(id_class_j[0])/total
            x_class_j = X[id_class_j]
            mean_class_j = np.mean(x_class_j, axis=0)
            std_class_j = np.std(x_class_j, axis=0)
            self.params[j] = [prob_class_j, mean_class_j, std_class_j]

    def find_prob(self, X):
        probs = []
        for x in X:
            prob = []
            for j in self.unique_y: 
                prob_class_j, mean_class_j, std_class_j = self.params[j]
                pij = (1/np.sqrt(2 * np.pi * std_class_j **2)) * np.exp((-1/2) * ((np.array(x) - mean_class_j)/std_class_j) **2)
                pij = np.prod(pij)
                pij *= prob_class_j
                prob.append(pij)
            prob = np.array(prob)
            pij_sum = np.sum(prob)
            prob /= pij_sum
            probs.append(prob)

        return probs

my_gauss = GNB()
my_gauss.fit(X, y)
rrs = my_gauss.find_prob([[-2, 5], [0,0], [6, -0.3]])

for r in rrs:
    print(r)