Machine learning based on C ා spam filtering

Time:2020-7-31

In this chapter, we will build a spam filtering classification model. We will use a raw email dataset that contains both spam and non spam and use it to train our ML model. We will begin to follow the steps for developing ml models discussed in the previous chapter. This will help us understand the workflow.

In this chapter, we will discuss the following topics:

    l  Definition problem

    l  Prepare the data

    l  Data analysis

    l  Characteristics of construction data

    l  Logical regression and naive BayesianEmailSpam filtering

    l  Validate classification model

Definition problem

Let’s start by defining the issues to be addressed in this chapter. We may already be familiar with spam; spam filtering is a basic function of mass email services. In addition to this, spam may bring more risks. For example, spam can be designed to obtain credit card number or bank account information that can be used for credit card fraud or money laundering. Spam can also be used to obtain personal data, which can then be used for identity theft and various other crimes. Spam filtering technology is an important step for e-mail service to avoid users suffering from such crimes. However, it is difficult to have the right spam filtering solution. We want to filter out suspicious email, but at the same time, we don’t want to filter too much, so that non spam goes into the spam folder and will never be seen by users. To solve this problem, we will let our ML model learn from the original e-mail dataset and use the subject line to classify suspicious e-mails as spam. We will look at two performance metrics to measure our success: accuracy and recall. We will discuss these indicators in detail in the following sections.

To summarize our problem definition:

What are the problems to be solved?We need a spam filtering solution to prevent our users from becoming victims of fraud while improving the user experience.

Why is this a problem?It’s difficult to strike the right balance between filtering suspicious messages and not filtering too many messages so that spam still gets into the inboxes. We will rely on the ML model to learn how to classify these suspicious messages statistically.

What are the solutions to this problem?We will build a classification model to mark potential spam according to the subject line of the message. We will use accuracy and recall to balance the number of messages that are filtered.

What are the criteria for success?We want a high response rate (percentage of actual spam retrieved as a percentage of total spam) without sacrificing too much precision (the percentage of correctly classified spam predicted as spam).

 

Prepare the data

Now that we have clearly described and defined the problem to be solved with ml, we need to prepare the data. Usually, we need to take additional steps to collect the data we need before the data preparation step, but for now, we will use a precompiled and marked as publicly available dataset. In this chapter, we will use the csdmc2010 garbage dataset to train and test our model. We’re going to see a name called SPAMTrain.label Text file. SPAMTrain.label The file encodes each email in the training folder, 0 for spam and 1 for non spam. We will use this text file and email data from the training folder to build a spam classification model.

What we now have is a raw dataset that contains many EML files, including information about a single email, and a text file that contains tag information. In order to make the original data set available to build spam classification model, we need to do the following work:

  1. fromEMLExtract the subject line from the file:The first step in preparing data for future tasks is to extract the subject and body from the various EML files. We will use a package called eagetmail to load and extract information from the EML file. With the eagetmail package, we can easily load and extract topics and body content from EML files. Once the subject and body are extracted from the email, each row of data needs to be attached to the deedle data framework as a row.
  2. Combine the extracted data with tags:After extracting the subject and body content from the various EML files, we need to do one more thing. We need to map the encoded tag (spam is 0, not spam is 1) to each row of the data frame we created in the previous step. If we turn on spam. With any text editor, you can see the encoded tags in the first column and the corresponding email file name in the second column, separated by a space. Using the readcsv function of the deedle frame, we can easily load the tag data into the data frame by specifying a space as a separator. Once we load the tagged data into a data frame, we can simply add the first column of the data frame to the other data frames created using the addcolumn function of the deedle framework in the previous step.
  3. Export the merged data asCSVfile:Now that we have a data framework that contains email and tag data, we can now export the data framework to a CSV file for future use. Using the savecsv function of the deedle frame, you can easily save a data frame as a CSV file.

The code for this data preparation step is as follows:

using Deedle;
 using EAGetMail;
 using System;
 using System.IO;
 using System.Linq;
 
 The namespace prepares the data
 {
     internal class Program
     {
         private static void Main(string[] args)
         {
             //Get all the original e-mail format files
             //Todo: change the path to the data directory
             String rawdatadirpath = @ "D: work / code base / AI / spam filtering / raw data";
             string[] emailFiles = Directory.GetFiles(rawDataDirPath, "*.eml");
 
             //Parse the subject and body from the email file
             var emailDF = ParseEmails(emailFiles);
             //Get the label of each email (spam vs. ham)
             var labelDF = Frame.ReadCsv(rawDataDirPath + "\\SPAMTrain.label", hasHeaders: false, separators: " ", schema: "int,string");
             //Add these tags to the email data frame
             emailDF.AddColumn("is_ham", labelDF.GetColumnAt(0));
             //Save the parsed e-mail and tags as a CSV file
             emailDF.SaveCsv("transformed.csv");
 
             Console.WriteLine ("data preparation step completed!");
             Console.ReadKey();
         }
 
         private static Frame ParseEmails(string[] files)
         {
             //We will parse the subject and body of each email and store each record in a key value pair
             var rows = files.AsEnumerable().Select((x, i) =>
             {
                 //Load each email file into the mail object
                 Mail email = new Mail("TryIt");
                 email.Load(x, false);
 
                 //Extract subject and body
                 String eatrialversionmark = "(trial version)"; // eagetmail adds the subject "(trial version)" to the trial version
                 string emailSubject = email.Subject.EndsWith(EATrialVersionRemark) ?
                     email.Subject.Substring(0, email.Subject.Length - EATrialVersionRemark.Length) : email.Subject;
                 string textBody = email.TextBody;
 
                 //Create key value pairs using email ID (email Num), subject, and body
                 return new { emailNum = i, subject = emailSubject, body = textBody };
             });
 
             //Create a data frame based on the row created above
             return Frame.FromRecords(rows);
         }
     }
 }

View Code

After running this code, the program will create a transformed.csv Which will contain four columns (email num, subject, body, and is_ ham)。 We will use this output data as input to the next steps to build the ML model of the spam filtering project. However, we can also try to use the deedle framework and the eagetmail package to adjust and prepare the data in different ways. The code I’m providing here is a way to prepare this raw e-mail data for future use, as well as some information that we can extract from the original e-mail data. With eagetmail package, we can also extract other features, such as the sender’s e-mail address and attachments in e-mail. These additional features may help to improve the spam classification model.

Data analysis

In the data preparation step, we convert the original dataset into a more readable and usable dataset. We now have a file to view to find out which messages are spam and which are not. In addition, we can easily find spam and non spam subject lines. With this transformed data, let’s start looking at what the data actually looks like and see if we can find any patterns or problems in the data.

Because we’re working with text data, the first thing we need to look at is the difference in word distribution between spam and non spam. To do this, we need to convert the data output of the previous step into a matrix representation of the number of word occurrences. Let’s do this step by step with the first three topic behaviors in the data. Our first three themes are as follows:

 

If we convert the data so that each column corresponds to each word in each subject row, and encode the value of each cell as 1. If there is a word in a given topic row, it is encoded as 0. If not, the generated matrix is as follows:

 

 

 

This particular encoding method is called one hot encoding. We only care about whether a particular word appears in the topic line, not how many times each word actually appears in the topic line. In the previous example, we also removed all punctuation marks, such as colons, question marks, and exclamation marks. To do this programmatically, we can use regex to break each subject line into words that contain only alphanumeric characters, and then build a data framework with one hot encoding. The code to complete this coding step is as follows:

private static Frame CreateWordVec(Series rows)
         {
             var wordsByRows = rows.GetAllValues().Select((x, i) =>
             {
                 var sb = new SeriesBuilder();
 
                 ISet words = new HashSet(
                     Regex.Matches(
                         //Alphabetic characters only
                         x.Value, "[a-zA-Z]+('(s|d|t|ve|m))?"
                     ).Cast().Select(
                         //Then, convert each word to a lowercase letter
                         y => y.Value.ToLower()
                     ).ToArray()
                 );
 
                 //Code the words that appear in each line by 1
                 foreach (string w in words)
                 {
                     sb.Add(w, 1);
                 }
 
                 return KeyValue.Create(i, sb.Series);
             });
 
             //Create a data frame from the row we just created and encode the missing value as 0
             var wordVecDF = Frame.FromRows(wordsByRows).FillMissing(0);
 
             return wordVecDF;
         }

View Code

With the words represented by the one hot coding matrix, our data analysis process becomes easier. For example, if we want to see the top 10 words in spam, we can simply sum the values of each column of a one hot coded word matrix of spam, and then get the 10 words with the highest sum value. This is exactly what we did in the following code:

var hamTermFrequencies = subjectWordVecDF.Where(
                 x => x.Value.GetAs("is_ham") == 1
             ).Sum().Sort().Reversed.Where(x => x.Key != "is_ham");
 
             var spamTermFrequencies = subjectWordVecDF.Where(
                 x => x.Value.GetAs("is_ham") == 0
             ).Sum().Sort().Reversed;
 
             //View the top 10 spam and non spam
             var topN = 10;
 
             var hamTermProportions = hamTermFrequencies / hamEmailCount;
             var topHamTerms = hamTermProportions.Keys.Take(topN);
             var topHamTermsProportions = hamTermProportions.Values.Take(topN);
 
             System.IO.File.WriteAllLines(
                 dataDirPath + "\\ham-frequencies.csv",
                 hamTermFrequencies.Keys.Zip(
                     hamTermFrequencies.Values, (a, b) => string.Format("{0},{1}", a, b)
                 )
             );
 
             var spamTermProportions = spamTermFrequencies / spamEmailCount;
             var topSpamTerms = spamTermProportions.Keys.Take(topN);
             var topSpamTermsProportions = spamTermProportions.Values.Take(topN);
 
             System.IO.File.WriteAllLines(
                 dataDirPath + "\\spam-frequencies.csv",
                 spamTermFrequencies.Keys.Zip(
                     spamTermFrequencies.Values, (a, b) => string.Format("{0},{1}", a, b)
                 )
             );

View Code

As you can see from this code, we use the summation method of deedle’s data framework to sum the values in each column and sort them in reverse order. We do this once for spam and once for non spam. Then, we use the take method to get the top ten words in spam and non spam. When asked to run this code, it will generate two CSV files: Ham frequency- cies.csv And spam frequency- cies.csv 。 These two files contain information about the number of words that appear in spam and non spam, which we will use later in constructing data features and model building steps.

Now let’s visualize some data for further analysis. First, take a look at the 10 most frequently used terms in ham e-mail in the dataset:

 

 

 

As you can see from this histogram, there is more non spam in the dataset than spam, just as in the real world. We receive more non spam than spam in our inboxes.

We used the following code to generate this histogram to visualize the distribution of ham and spam emails in the dataset:

 var barChart = DataBarBox.Show(
                 new string[] { "Ham", "Spam" },
                 new double[] {
                     hamEmailCount,
                     spamEmailCount
                 }
             );
 barChart.SetTitle("Ham vs. Spam in Sample Set");

View Code

use Accord.Net The databarbox class in. We can easily visualize the data in the histogram. Now let’s look at the top ten words that appear most frequently in ham and spam emails. You can use the following code to generate a histogram for the top 10 terms in ham and spam mail:

 var hamBarChart = DataBarBox.Show(
                 topHamTerms.ToArray(),
                 new double[][] {
                     topHamTermsProportions.ToArray(),
                     spamTermProportions.GetItems(topHamTerms).Values.ToArray()
                 }
             );
             hamBarChart.SetTitle("Top 10 Terms in Ham Emails (blue: HAM, red: SPAM)");
             System.Threading.Thread.Sleep(3000);
             hamBarChart.Invoke(
                 new Action(() =>
                 {
                     hamBarChart.Size = new System.Drawing.Size(5000, 1500);
                 })
             );
 
             var spamBarChart = DataBarBox.Show(
                 topSpamTerms.ToArray(),
                 new double[][] {
                     hamTermProportions.GetItems(topSpamTerms).Values.ToArray(),
                     topSpamTermsProportions.ToArray()
                 }
             );
             spamBarChart.SetTitle("Top 10 Terms in Spam Emails (blue: HAM, red: SPAM)");

View Code

Similarly, we use the databarbox class to display bar graphs. When running this code, we will see the following figure, which shows the top 10 terms that appear most frequently in ham e-mail:

 

 

The bar chart of the ten most common terms in spam email is as follows:

 

 

 

As expected, the word distribution in spam is quite different from that in non spam. For example, if you look at the chart above, the words spam and hibody appear very often in spam, but not in non spam. However, some things don’t make much sense. If you look closely, you will find that all spam and non spam have the words trial and version, which is unlikely. If you open some of the original EML files in a text editor, it’s easy to see that not all email header lines contain these two words.

So, what happened? Is our data contaminated by previous data preparation or data analysis steps?

Further research shows that one of the packages we use is causing this problem. The eagetail package we use to load and extract e-mail content automatically appends the trial version to the end of the subject line. Now that we know the root cause of this data problem, we need to go back and fix it. One solution is to go back to the data preparation step and update the parseemails function with the following code, which simply removes the additional (trial version) flag from the subject line:

private static Frame ParseEmails(string[] files)
         {
             //We will parse the subject and body of each email and store each record in a key value pair
             var rows = files.AsEnumerable().Select((x, i) =>
             {
                 //Load each email file into the mail object
                 Mail email = new Mail("TryIt");
                 email.Load(x, false);
 
                 //Extract subject and body
                 String eatrialversionmark = "(trial version)"; // eagetmail adds the subject "(trial version)" to the trial version
                 string emailSubject = email.Subject.EndsWith(EATrialVersionRemark) ?
                     email.Subject.Substring(0, email.Subject.Length - EATrialVersionRemark.Length) : email.Subject;
                 string textBody = email.TextBody;
 
                 //Create key value pairs using email ID (email Num), subject, and body
                 return new { emailNum = i, subject = emailSubject, body = textBody };
             });
 
             //Create a data frame based on the row created above
             return Frame.FromRecords(rows);
         }

View Code

After updating this code and running the previous data preparation and analysis code again, the histogram of word distribution becomes more meaningful.

The following bar chart shows the top 10 terms that appear most frequently in ham messages after the repair and remove (trial version) Tags:

 

The bar chart below shows the top 10 terms that appear most frequently in spam messages after the trial version flag

 

 

This is a good example to illustrate the importance of data analysis steps in building ml models. It is very common to iterate between the data preparation and data analysis steps, because we usually find problems with the data in the analysis step, and we can usually improve the data quality by updating some of the code used in the data preparation step. Now that we have clear data in the matrix representation of the words used in the subject line, it’s time to start looking at the actual features that we will use to build the ML model.

Characteristics of construction data

In the previous steps, we briefly looked at the word categories for spam and non spam, and we noticed something. First of all, a large number of the most frequent words are frequently used words, which have no meaning. For example, words like to, the, for, and a are commonly used words, and our ML algorithm doesn’t learn anything from them. These types of words are called stop words, and they are often ignored or removed from the feature set. We will use nltk’s stop word list to filter out common words from the feature set.

One way to filter these stop words is as follows:

//Read stop list 
 ISet stopWords = new HashSet(File.ReadLines();
 //Filtering stop words from word frequency sequence
 var spamTermFrequenciesAfterStopWords = spamTermFrequencies.Where(
                 x => !stopWords.Contains(x.Key)
 );

View Code

After filtering, the ten new words often appeared in non spam are as follows:

After filtering out stop words, the ten most common words in spam are as follows:

As you can see from these bars, stop words in the feature set are filtered out to make more meaningful words appear at the top of the frequently occurring word list. However, we also notice one thing. Numbers seem to be one of the most common words. For example, the numbers 3 and 2 enter the top 10 non spam words. The numbers 80 and 70 enter the top 10 words in spam. However, it is difficult to determine whether these numbers will help train ml models to classify e-mail as spam or spam.

There are several ways to filter these numbers out of the feature set, but we’ll show only one here. We updated the regular expression used in the previous step to match words that contain only alphanumeric characters and not alphanumeric characters. The following code shows how we can update the createwordvec function to filter out the numbers in the feature set.

private static Frame CreateWordVec(Series rows)
         {
             var wordsByRows = rows.GetAllValues()
                 .Select((x, i) =>
                 {
                     var sb = new SeriesBuilder();
                     ISet words = new HashSet(
                         //Alphabetic characters only
                         Regex.Matches(x.Value, "[a-zA-Z]+('(s|d|t|ve|m))?")
                         .Cast()
                         //Then, convert each word to a lowercase letter
                         .Select(y => y.Value.ToLower())
                         .ToArray()
                     );
                     //Code the words that appear in each line by 1
                     foreach (string w in words)
                     {
                         sb.Add(w, 1);
                     }
                     return KeyValue.Create(i, sb.Series);
                 });
             //Create a data frame from the row we just created and encode the missing value with 0
             var wordVecDF = Frame.FromRows(wordsByRows).FillMissing(0);
             return wordVecDF;
         }

View Code

Once we get rid of these non spam word sets, filter them as follows:

 

 

The word distribution of spam, after filtering out the numbers from the feature set, looks like this:

As you can see from these bars, we have more meaningful words on the top of the list, which seems to be a big difference from before, in the distribution of spam and non spam words. Words that often appear in spam don’t seem to be common in non spam, and vice versa.

Once you run this code, it will generate a histogram showing the spam word distribution and non spam and two word lists of CSV files one non spam and corresponding items appearing and another email appearing in the spam word list and corresponding items. In the model building section below, when we build the classification model for spam filtering, we will use the term frequency output to conduct the feature selection process.

Logical regression and naive BayesianEmail spam filtering

We have come a long way to build our first ML model in C ා. In this section, we will train logistic regression and naive Bayesian classifier to classify e-mail into spam and non spam. We will use these two learning algorithms for cross validation to better understand the performance of our classification model in practice. As discussed briefly in the previous chapter, in k-fold cross validation, the training set is divided into k subsets of equal size, one of which is the verification set, and the remaining k-1 subsets are used to train the model. This process is then repeated K times, using different subsets or folds as test verification sets in each iteration, and then averaging the corresponding K validation results to report a single estimate.
Let’s first look at how to use accord to instantiate cross validation algorithms in C ා. The code is as follows:

var cvLogisticRegressionClassifier = CrossValidation.Create, double[], int>(
                     //Number of folds    
                     k: numFolds,
                     //Learning algorithm    
                     learner: (p) => new IterativeReweightedLeastSquares()
                     {
                         MaxIterations = 100,
                         Regularization = 1e-6
                     },
                     //Use the 0 - 1 loss function as the cost function 
                     loss: (actual, expected, p) => new ZeroOneLoss(expected).Loss(actual),
                     //Proper classifier    
                     fit: (teacher, x, y, w) => teacher.Learn(x, y, w),
                     //Input    
                     x: input,
                     //Output    
                     y: output
                 );
 //Run cross validation
 var result = cvLogisticRegressionClassifier.Learn(input, output);

View Code

Let’s take a closer look at this code. By providing the model type to be trained, the learning algorithm type suitable for the model, the input data type and the output data type, we can use the static create function to create a new cross validation algorithm. For this example, we create a new cross validation algorithm, which takes logistic regression as the model, iterative reweighted least squares as the learning algorithm, double array as the input type and integer as the output type (each tag). You can try different learning algorithms to train the logistic regression model. In the agreement. You can choose to use the logistic gradient descent algorithm as the learning algorithm suitable for the logistic regression model.
For parameters, we can specify the number of folds for k-fold cross validation (k), learning method with custom parameters (learner), selected loss / cost function (loss), and a function (x), input (x) and output (y) that know how to fit the model using the learning algorithm (fit). For illustration in this section, we set a relatively small number 3 for k-fold cross validation. In addition, we choose a relatively small number, 100, for the largest iteration, and a relatively large number, le-6 or 1 / 1000000, for the regularization of the iterative weighted least squares learning algorithm. For the loss function, we use a simple 0 – 1 loss function, which assigns 0 for the correct prediction and 1 for the wrong prediction. This is the cost function that our learning algorithm tries to minimize. All of these parameters can be tuned differently. We can choose a different loss / cost function, the number of folds used in k-fold cross validation, and the maximum number of iterations and regularizations of the learning algorithm. We can even use different learning algorithms to adapt to logistic regression models, such as logistic gradient descent, which will iteratively try to find the local minimum of the loss function.
We can use the same method to train naive Bayes classifier and use K times cross validation. The code of k-fold cross validation using naive Bayesian learning algorithm is as follows:

var cvNaiveBayesClassifier = CrossValidation.Create,
                 NaiveBayesLearning, double[], int>(
                     //Number of folds
                     k: numFolds,
                     //Naive Bayesian classifier based on binomial distribution
                     learner: (p) => new NaiveBayesLearning(),
                     //Use the 0 - 1 loss function as the cost function
                     loss: (actual, expected, p) => new ZeroOneLoss(expected).Loss(actual),
                     //Proper classifier
                     fit: (teacher, x, y, w) => teacher.Learn(x, y, w),
                     //Input
                     x: input,
                     //Output
                     y: output
                 );
 //Run cross validation
 var result = cvNaiveBayesClassifier.Learn(input, output);

View Code

The only difference between the previous code and this code is the model and learning algorithm we choose. We use naive Bayes as model and naive Bayes learning as learning algorithm to train our naive Bayes classifier instead of using logistic regression and iterative reweighted least squares. Since all input values are binary (0 or 1), we use Bernoulli distribution as our naive byes classifier model.
When you run this code, you should see an output as follows:

 

 

In the next section on model validation methods, we’ll take a closer look at what these numbers represent. To try different ml models. You can replace them with the logistic regression model code we discussed earlier, or you can try to choose a different learning algorithm to use.

Validate classification model

We use Accord.Net The framework establishes the first ML model in C ා. However, we are not yet fully completed. If we look more closely at the previous console output, there is one very worrying situation. The training error is about 0.03, while the verification error is about 0.26. This means that our classification model correctly predicted 87 out of 100 times in the training set and 74 times in the verification or test set. This is a typical example of over fitting, where the model is so close to the training set that its prediction of unforeseen data sets is unreliable and unpredictable. If we apply this model to the spam filtering system, the actual performance of the model used to filter spam will be unreliable and will be different from what we see in the training set.
Over fitting is usually because the model is too complex for a given data set or uses too many parameters to fit the model. The over fitting problem of the naive Bayesian classifier model we built in the previous section is likely due to the complexity and the number of features we use to train the model.
If we look again at the console output at the end of the previous section, we can see that the number of features used to train naive Bayesian models is 2212. That’s too much, considering that we only have about 4200 email records, only two-thirds (or about 3000 records) in our sample set have been used to train our model (this is because we use triple cross validation and only two or three folds are used for training sets at each iteration). In order to solve this over fitting problem, we must reduce the number of features used to train the model. To do this, we can filter out items that don’t often appear. The task code is as follows:

//Change the number of features to reduce over fitting
     int minNumOccurences = 1;
     string[] wordFeatures = indexedSpamTermFrequencyDF.Where(
         x => x.Value.GetAs("num_occurences") >= minNumOccurences
     ).RowKeys.ToArray();
 Console.WriteLine (Num feature selection: {0} ", wordFeatures.Count ());

View Code

As you can see from this code, the naive Bayes classifier model we built in the previous section uses at least all the words that appear in spam.
If we look at the frequency of words in spam, about 1400 words appear only once (see the spam created in the data analysis step- frequencies.csv Document). Intuitively, words that appear less often only produce noise, and there is not much information for our model to learn. This tells us how much noise our model will be exposed to when we first built the classification model in the previous section.
Now that we know the cause of this over fitting problem, let’s fix it. Let’s use different thresholds to select features. We’ve tried 5, 10, 15, 20, and 25 to minimize the number of occurrences in spam (that is, we set minnumoccurrences to 5, 10, 15, and so on) and use these thresholds to train naive Bayes classifiers.
Firstly, the results of naive Bayesian classifier appear at least five times, as shown in the following figure:

 

 

Firstly, the results of naive Bayesian classifier appear at least 10 times, as shown in the following figure:

First, the results of naive Bayesian classifier appear at least 15 times, as shown in the following figure:

First, the results of naive Bayesian classifier appear at least 20 times, as shown in the following figure:

From these experimental results, we can see that when we increase the minimum number of words and reduce the corresponding number of features to train the model, the gap between the training error and the verification error decreases, and the training error begins to approximate the verification error. When we have solved the fitting problem, we can be more confident about how the model will deal with unforeseen data and production systems.
Now that we’ve covered how to deal with the fitting problem, we’d like to see more model performance measurement tools:
Confusion matrix: the obfuscation matrix is a table that tells us to predict the overall performance of the model. Each column represents each actual class, and each row represents each prediction class. For binary classification problem, the confusion matrix is a 2 × 2 matrix, in which the first row represents negative prediction and the second row represents positive prediction. The first column represents the actual negation, and the second column represents the actual affirmative. The following table shows what each cell in the confusion matrix of a binary classification problem represents.

      

    True Negative (TN) :

TP, true positive: the prediction is positive and the actual is positive
FP, false positive: predicted positive, actual negative
FN, false negative false negative: predicted and negative, actual positive
TN, true negative: predicted negative, actual negative.
As can be seen from the table, the obfuscation matrix describes the performance of the whole model. In our example, if we look at the last console output in the previous screen shot, which shows the logical regression classification model of console output, we can see that the number of TNS is 2847, the number of FN is 606, the number of FPS is 102, and the number of 76 TPS is 772. Based on this information, we can further calculate the true positive rate (TPR), true negative rate (TNR), false positive rate (FPR) and false negative rate (FNR) as follows:

        

 

Using the previous example, the true positive rate in our example is 0.56, TNR is 0.97, fpr is 0.03, FNR is 0.44
Accuracy: accuracy is the proportion of correct predictions. Using the same representation as the confusion matrix in the previous example, the calculation accuracy is as follows:

       

 

 

 

Accuracy is often used as a model performance indicator, but sometimes it does not represent the performance of the whole model. For example, if the sample set is largely unbalanced, and if there are five spam messages and 95 hams in our sample set, then a simple classifier classifies each email as ham, then it must have 95% accuracy. However, it will never catch spam. That’s why we need to look at the confusion matrix and other performance metrics such as accuracy and accuracy
Precision rate: accuracy is the proportion of the number of correct positive predictions to the total number of positive predictions. Using the same symbols as before, we can calculate the accuracy as follows:

      

 

 

 

If you look at the logistic regression classification model results of previous screenshots output by the console, the number of precision calculations divided by the TPS confusion matrix, 772 years, by the sum of TPS, FPS, 102 years, 772 years from the confusion matrix, the result is 0.88.
Recall rate: the correct rate is the proportion of the number of correct positive predictions to the total number of positive results. This is a way to tell us how many actual positive cases are retrieved from this model. Using the same notation as before, we can calculate the recall rate as follows:

      

 

 

 

If you look at the past console output, the screenshot in the front is the result of our logistic regression classification pattern. The number of correct rate calculation is divided by the TPS confusion matrix. Through the sum of TPS, 772, 772 and FN, 606, the confusion matrix, the result is 0.56.
With these performance metrics, we can choose the best model. There is always a trade-off between accuracy and accuracy. Compared with other models, the recall rate of the model with higher accuracy is lower. For our spam filtering problem, if we think that it is more important to filter spam correctly, and we can sacrifice some spam passing through users’ inboxes, then we can optimize the accuracy. On the other hand, if we think it is more important to filter out as much spam as possible, even if we may filter out some non spam, then we can optimize the accuracy rate. Choosing the right model is not a simple decision. Careful consideration of requirements and success criteria is the key to making the right choice.
In summary, here is the code we can use to calculate performance metrics from cross validation results and obfuscation matrices:

//Run cross validation
     var result = cvNaiveBayesClassifier.Learn(input, output);
 //Training errors
     double trainingError = result.Training.Mean;
     //Validation error
 double validationError = result.Validation.Mean;
 Confusion matrix: true positive and false positive and true negative and false negative:
 //Confusion matrix
 GeneralConfusionMatrix gcm = result.ToConfusionMatrix(input, output);
 float truePositive = (float)gcm.Matrix[1, 1];
      float trueNegative = (float)gcm.Matrix[0, 0];
      float falsePositive = (float)gcm.Matrix[1, 0];
 float falseNegative = (float)gcm.Matrix[0, 1];

View Code

Training and validation (testing) errors: used to identify over fitting problems:

//Calculation accuracy, precision, recall
 float accuracy = (truePositive + trueNegative) / numberOfSamples;
 float precision = truePositive / (truePositive + falsePositive);
 float recall = truePositive / (truePositive + falseNegative);

View Code

summary

In this chapter, we use C ා to build the first ML model that can be used for spam filtering. We first define and clearly state the problems we are going to solve and the criteria for success. Then, we extract relevant information from the original email data and convert it into a format for data analysis, feature engineering and ML model building steps.

In the data analysis step, we learned how to apply a single hot code and build a matrix representation of the words used in the topic line.

We also found a data problem from the data analysis process and learned how to iterate back and forth between the data preparation and analysis steps.

Then, we further improved our feature set by filtering out stop words and separating non alphanumeric or non alphabetic words using regular expressions.

With this feature set, we built the first classification model using logistic regression and naive Bayesian classifier algorithm, briefly introduced the danger of over fitting, and learned how to evaluate and compare model performance by observing accuracy, precision and recall.

Finally, we learned about the trade-offs between precision and recall, and how to select models based on these metrics and business requirements.

Recommended Today

Tutorial on sending e-mail using net:: SMTP class in Ruby

Simple Mail Transfer Protocol(SMTP)SendE-mailAnd routing protocol processing between e-mail servers. RubyIt provides the connection of simple mail transfer protocol (SMTP) client of net:: SMTP class, and provides two new methods: new and start New takes two parameters: Server name defaults to localhost Port number defaults to 25 The start method takes these parameters: Server – […]