Naive Bayes

Naive Bayes is a technique that has found wide application, notably in spam filters.

While Bayes’ Theorem is a theorem in mathematics, there is no “Naive Bayes’ Theorem”. Rather, the “naive” comes from the naive assumption that the probability of some value occurring in your data is independent of the probability of various other values.

For example, we may assume that the probability of a spam email containing the word “wealthy” is independent of the probability of a spam email containing the word “monies”. Since in reality these words are often found together, this is an incorrect assumption.

However, naive Bayes classifiers often do a great job of sorting out spam from non-spam.

The general idea is to assign probabilities to various items occurring in your data, and score the items of interest in such a way that the score increases, or decreases, as additional occurrences of those items are found.

Here we’ll look at an example of using Naive Bayes to classify irises from the Iris Flower Dataset.

import re
from collections import defaultdict
from sklearn.naive_bayes import ComplementNB, GaussianNB, MultinomialNB, BernoulliNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
target = iris['target']
df = iris['data']

X_train, X_test, y_train, y_test = train_test_split(df, target, shuffle=True, train_size=0.7)

model = MultinomialNB(), y_train)

y_predicted = model.predict(X_test)

print(model.__class__, accuracy_score(y_test, y_predicted))
cm = confusion_matrix(y_test, y_predicted)

You can see from the confusion matrix that this does a pretty good job of classifying irises, once it’s trained. Also notice that we can simply use Scikit-learn Naive Bayes models in much the same way that we’ve used other models.

The main thing that has to be decided at the start is what type of Naive Bayes model to use. We can choose from the following.

  • GaussianNB: assumes probabilities are normally distributed; good when data features are floating-point numbers. This doesn’t mean values of your data series have to be normally distributed; only that the probabilities of them occurring are normally distributed.
  • MultinomialNB: when your predictor values are integers.
  • CategoricalNB: used when data features take categorical values.
  • BernoulliNB: for data features with boolean values.
  • ComplementNB: instead of looking at the probability of values being found in your data, this looks at the probability of them not being found. Since this often works well when the chances of values being found are small, it can be particularly good for text classification tasks.

In practice if you can’t remember which is which, you can just as well try all of them and see which one works the best. You might get surprising results. For example, your data features are supposed to be boolean if you’re using BernoulliNB, but in practice Python can interpret non-boolean values as boolean, so this might work unexpectedly well on non-boolean data.

Leave a Reply

Blog at