Naive Bayes classifier (NBC) originated from classical mathematical theory, has a solid mathematical foundation and stable classification efficiency. At the same time, NBC model requires few estimated parameters, is not sensitive to missing data, and the algorithm is relatively simple. The reason why it becomes “naive” is that the whole formalization process only makes the most primitive and simplest assumptions. Naive Bayes is still effective in the case of less data, and can deal with multi category problems.

Detailed explanation of naive Bayesian algorithm:https://boywithacoin.cn/article/fen-lei-su…

Email spam filtering, specific process

- Collect the data first. The specific data is inhttps://github.com/Freen247/database/tree/…
- Parse the text file into an entry vector
- Check the entries to ensure the correctness of parsing
- Training / testing / using algorithms

Since the third-party library is more needed in the program, we need to download the dependent package first

`pip install feedparser`

## 0x00 realize vocabulary to vector conversion

Using object-oriented thinking, construct Bayes objects:

```
#!/usr/bin/python
# -*- coding: utf-8 -*-
#__author__ : stray_camel
#pip_source : https://mirrors.aliyun.com/pypi/simple
import sys,os
class Bayes():
def __init__(self,
absPath:"Directory of the current file"== os.path.dirname(os.path.abspath(__file__)),
):
self.absPath = absPath
```

The create function returns a list containing non repeating words that appear in all documents:

```
#contain all documents and list without duplicate words
def createVocabList(self,
dataSet:dict(type="", help = "the source data"),
)->dict(type=list, help = "Deduplicated list"):
vocabSet=set([])#creat an empty set,'set' is a list without duplicate words
for document in dataSet:
vocabSet=vocabSet|set(document) #create an union of two sets
return list(vocabSet)
```

At the same time, we also need a function to use the vocabulary or all the words we want to check as input, and then construct a feature for each word. Once a document is given, it will be converted into a word vector.

`#determine if a term appears in the documents`

## 0x01 implement Bayes classifier training function

Use naive Bayesian classifier to train functions:

```
#naive bayes classfication training function
def trainNB0(self,trainMatrix,trainCategory):
numTrainDocs=len(trainMatrix)
numWords=len(trainMatrix[0])
pAbusive=sum(trainCategory)/float(numTrainDocs)
p0Num = ones(numWords)
p1Num = ones(numWords)
p0Denom = 2.0
p1Denom = 2.0
for i in range(numTrainDocs):#Iterate through all documents
if trainCategory[i]==1:
p1Num+=trainMatrix[i]
p1Denom+=sum(trainMatrix[i])
else:
p0Num+=trainMatrix[i]
p0Denom+=sum(trainMatrix[i])
p1Vect = log(p1Num / p1Denom)
p0Vect = log(p0Num / p0Denom)
return p0Vect, p1Vect, pAbusive
```

## 0x02 implement spam test function

Use spamtest() to automate the Bayesian spam classifier. Import the text files under the folders spam and ham, and parse them into word lists. There are 20 emails in the case, of which 10 emails are randomly selected as the test set. The probability calculation required by the classifier refers to using the documents in the training set. This process of randomly selecting a part as the training set and the rest as the test set is called retained cross validation.

`spamTest()`

：

`#filtering email, training+testing`

The final function running result is shown in the following figure:

```
if __name__ == "__main__":
test = Bayes()
test.spamTest()
```

`The wrong classification is:`

This work adoptsCC agreement, reprint must indicate the author and the link to this article