Naïve Bayes

Overview

Naive Bayes (NB) is a classic supervised machine learning algorithm that is based upon the traditional statistical principal: Bayes’ Theorem. The technique is referred to as naive due to the naive assumption made that each of the input variables are independent from one another. In general NB models calculates probabilities. Given input data the algorithm determines the likelihood of the input feature belonging to a specific class. To perform NB the model must initially be taught on training data to then accurately interpret input data and make class predictions. NB is most commonly used when dealing with text data and /or categorical or count based response features.

More specifically the Multinomial NB algorithm is ideally designed towards working with discrete count data. In this version of NB the algorithm assumes that the features follow a multinomial distribution and takes into account how often an item appears. Multinomial NB is often used in text classification. For example, given the frequencies of certain words the algorithm can predict who the author of the text is likely to be. Alternatively Bernoulli NB is designed for binary or boolean features and predictions. For this instance the algorithm does not count frequencies rather determines the presence or lack of presence of a feature. This approach is commonly used in email spam detection, is the incoming email spam or not - binary. Another common NB approach is Gaussian NB which in turn assumes that features follow a Gaussian - or normal - distribution. This approach is most applicable to dealing with numeric continuous feature variables such as age or salaries. Finally, Categorical NB, which is a helpful approach when dealing with a response that has a fixed number of discrete categories. This approach is often applied when looking at population demographic attributes.

For all NB models (although often less necessary for Gaussian models dealing with continuous data) smoothing is required to adjust probability estimates and avoid zero values. Smoothing instead applies very small probabilities to unseen events, rather than zero. Avoiding zero is necessary because the presenece of a zero in the probability equation will result in the entire probability becoming 0 even if the potential class is otherwise an accurate fit.

Process of splitting data to train and test Naive Bayes model

Basics of Bayes’ Theorem

Visualization of a basic Naive Bayes’ Classifier

Data Prep for Naive Bayes

For Multinomial and Gaussian Naive Bayes

Snipit of data to be used in Gaussian and Multinomial NB (above)

Snipit of training data to be used in Gaussian and Multinomial NB

Snipit of test data to be used in Gaussian and Multinomial NB

Link to Training Data

For categorical NB data preparation, features must be categorical and have therefore been discretized where necessary.

After the data has been adequately prepared it was similarly split into a training and a test set. For the following analysis the training set was comprised of 80% of the responses and the model was then testes on the remaining 20% of responses.

For each of the Naive Bayes models the label for the data is the race distance, the label will either correspond to a 50 mile (4,447 responses) or 100 kilometer (4,652 responses) race. For the multinomial and gaussian NB data preparation features are converted to continuous and count formats.

After the data has been adequately prepared it must be split into a training and a test set. For the following models the training set was comprised of 80% of the responses and the model was then testes on the remaining 20% of responses. It is crucial for these two sets to be completely disjoint from one another so that the model performance can be accuratley gauged given its performance on previously unseen data.

Link to Prepared Data

Link to Test Data

Categorical Naive Bayes

Link to Code

Snipit of data to be used in Categorical NB (above)

Link to Prepared Categorical Data

Link to Code

Results

Multinomial Naive Bayes

When predicting the label for the 1,820 responses in the test set the multinomial Naive Bayes model accurately predicted the distance of the race on 72.5% of responses. The model is performing decently but not amazingly.

More specifically by analyzing the confusion matrix the model tends to over predict the shorter distance race (50 miles / 80 km). The model predicted this class for 55% of responses when in reality the class only made of 50% of the test set.

Link to code

Confusion Matrix for Gaussian Naive Bayes

Categorical Naive Bayes

As mentioned previously, due to the mathematical nature of the model the categorical NB dataset is slightly different from the multinomial and gaussian datasets. As a result direct model comparisons are not as applicable. That being said the categorical model received an accuracy of 81.2% when predicted classes for the test set.

As visualized in the confusion matrix, the categorical model had a similar bias to the multinomial model in that it over predicted the 50 mile (80km) race classification. The categorical model had an even larger bias predicting this race nearly 58% of the time as rather than the expected 50%.

Link to code

Confusion Matrix for Multinomial Naive Bayes

Gaussian Naive Bayes

When working with the exact same training and test sets the Gaussian NB model out-performed the multinomial model. The model received a much higher accuracy score of 87.3%. Given the nature of the data this model performed very well.

Furthermore, as evidenced by the confusion mtrix the model was not biased towards either response class. The inaccurate classifications were not skewed towards the same wrong decisions as seen by approximately 115 inaccurate guesses for each class.

Link to code

Confusion Matrix for Categorical Naive Bayes

Summary of Conclusions

As discussed above, of the three Naive Bayes models analyzed the gaussian model was by far the highest performer. The model accurately predicted the test class 87.3% of the time and did not exhibit any clear biases

This aligns with a critical thinking of the assessed topic. Many of the variables of interest are continuous and likely normally distributed in nature - such as age, or hours to complete a race. As a result the data lent itself to be most accurately interpreted by this approach. Furthermore, it is interesting that both the multinomial and categorical NB models were consistently biased towards choosing the shorter race class.

Overall, all 3 models performed adequately when predicting race distance. Again this makes logical sense given feature inputs such as hours to complete the race which will be highly correlated with the length of the race. However, given that none of the models could score over 90% indicates there are additional contributing factors that are not being accounted for. One potential missing component could be the race course surface: is it on pavement? Grass? Sand? Rocky trails?