Filter spam using naive Bayes

Time:2022-7-4

Naive Bayes classifier (NBC) originated from classical mathematical theory, has a solid mathematical foundation and stable classification efficiency. At the same time, NBC model requires few estimated parameters, is not sensitive to missing data, and the algorithm is relatively simple. The reason why it becomes “naive” is that the whole formalization process only makes the most primitive and simplest assumptions. Naive Bayes is still effective in the case of less data, and can deal with multi category problems.

Detailed explanation of naive Bayesian algorithm:https://boywithacoin.cn/article/fen-lei-su…

Email spam filtering, specific process

Since the third-party library is more needed in the program, we need to download the dependent package firstpip install feedparser

0x00 realize vocabulary to vector conversion

Using object-oriented thinking, construct Bayes objects:

#!/usr/bin/python
# -*- coding: utf-8 -*-
#__author__ : stray_camel
#pip_source : https://mirrors.aliyun.com/pypi/simple
import sys,os

class Bayes():
    def __init__(self, 
    absPath:"Directory of the current file"== os.path.dirname(os.path.abspath(__file__)),
    ):
        self.absPath = absPath

The create function returns a list containing non repeating words that appear in all documents:

#contain all documents and list without duplicate words
    def createVocabList(self, 
    dataSet:dict(type="", help = "the source data"),
    )->dict(type=list, help = "Deduplicated list"):
        vocabSet=set([])#creat an empty set,'set' is a list without duplicate words
        for document in dataSet:
            vocabSet=vocabSet|set(document) #create an union of two sets
        return list(vocabSet)

At the same time, we also need a function to use the vocabulary or all the words we want to check as input, and then construct a feature for each word. Once a document is given, it will be converted into a word vector.

#determine if a term appears in the documents

0x01 implement Bayes classifier training function

Use naive Bayesian classifier to train functions:

#naive bayes classfication training function
    def trainNB0(self,trainMatrix,trainCategory):
        numTrainDocs=len(trainMatrix)
        numWords=len(trainMatrix[0])
        pAbusive=sum(trainCategory)/float(numTrainDocs)
        p0Num = ones(numWords)
        p1Num = ones(numWords)
        p0Denom = 2.0
        p1Denom = 2.0
        for i in range(numTrainDocs):#Iterate through all documents
            if trainCategory[i]==1:
                p1Num+=trainMatrix[i]
                p1Denom+=sum(trainMatrix[i])
            else:
                p0Num+=trainMatrix[i]
                p0Denom+=sum(trainMatrix[i])

        p1Vect = log(p1Num / p1Denom)
        p0Vect = log(p0Num / p0Denom)
        return p0Vect, p1Vect, pAbusive

0x02 implement spam test function

Use spamtest() to automate the Bayesian spam classifier. Import the text files under the folders spam and ham, and parse them into word lists. There are 20 emails in the case, of which 10 emails are randomly selected as the test set. The probability calculation required by the classifier refers to using the documents in the training set. This process of randomly selecting a part as the training set and the rest as the test set is called retained cross validation.

spamTest()

#filtering email, training+testing

The final function running result is shown in the following figure:

if __name__ == "__main__":
    test = Bayes()
    test.spamTest()
The wrong classification is:

This work adoptsCC agreement, reprint must indicate the author and the link to this article

article!! Launched on my blogStray_Camel(^U^)ノ~YO