Spark NLP text classification based on Bert and general sentence coding


By veysel kocaman
Compile | VK
Source: toward Data Science

Natural language processing (NLP) is a key component of many data science systems that must understand or reason text. Common use cases include text classification, question answering, interpretation or summary, sentiment analysis, natural language Bi, language modeling and disambiguation.

NLP is becoming more and more important in more and more AI applications. If you are building chat robots, searching patent databases, matching patients to clinical trials, grading customer service or sales calls, and extracting abstracts from financial reports, you must extract accurate information from text.

Text classificationIt is one of the main tasks of modern natural language processing. It is a task to assign a proper category to a sentence or document. Categories depend on the selected dataset and can start with a topic.

Each text classification problem follows similar steps and is solved by different algorithms. Not to mention classical and popular machine learning classifiers, such as random forest or logistic regression, have more than 150 deep learning frameworks that propose various text classification problems.

Several benchmark datasets are used in the text classification problem Track the latest benchmark on. Here are the basic statistics for these datasets.

A simple text categorization application usually follows these steps:

  • Text preprocessing and cleaning
  • Feature Engineering (manually creating features from text)
  • Feature vectorization (TFIDF, frequency, coding) or embedding (word2vec, doc2vec, Bert, Elmo, sentence embedding, etc.)
  • Ml and DL algorithms were used to train the model.

Text classification in spark NLP

In this paper, we will use universal sentence embedding to build a text classification model in spark NLP. Then we will compare it with other ml and DL methods and text vectorization methods.

There are several text classification options in spark NLP:

  • Text preprocessing in spark NLP and ML algorithm based on spark ml
  • Text preprocessing and word embedding in spark NLP and ml algorithms (glove, Bert, Elmo)
  • Text preprocessing and sentence embedding in spark NLP and ml algorithms(Universal Sentence Encoders)
  • Text preprocessing and classifier DL module in spark NLP (based on tensorflow)

As we discussed in depth in our important article on spark NLP, all of these text processing steps prior to classifier DL can be implemented in a specified pipeline sequence, and each stage is a converter or estimator. These stages run in sequence, and the input data frames are converted as they pass through each stage. That is, the data passes through the pipes in order. Of each stagetransform()Method to update the dataset and pass it on to the next stage. With the help of pipelines, we can ensure that training and test data go through the same feature processing steps.

Universal Sentence Encoders

In natural language processing (NLP), text embedding plays an important role before establishing any deep learning model. Text embedding converts text (words or sentences) into vectors.

Basically, the text embedding method encodes words and sentences in fixed length vectors to greatly improve the processing of text data. The idea is simple: words that appear in the same context tend to have similar meanings.

Technologies like word2vec and groove are implemented by converting a word into a vector. Therefore, the corresponding vector “cat” is closer to “dog” than “Eagle”. However, when embedding a sentence, the context of the whole sentence needs to be captured in this vector. This is it“Universal Sentence Encoders”The function of.

Universal Sentence EncodersIt can be used in text classification, semantic similarity, clustering and other natural language tasks. In tensorflow hub, pre trainedUniversal Sentence Encoders。 It has two variants, one is trained with transformer encoder, the other is trained with deep average network (Dan).

Spark NLP uses the tensorflow hub version, which is packaged in a way that runs in a spark environment. In other words, you just insert and play the embed in spark NLP, and then train the model in a distributed way.

For sentence generation embedding, no further calculation is needed, because we do not average the word embedding of each word in the sentence to obtain sentence embedding.

Application of classifier DL and use in text classification in spark NLP

In this article, we will use the Agnews dataset (one of the benchmark datasets in the text classification task) to build text classifiers in spark NLP using use and classifierdl, the latest module added to spark NLP version 2.4.4.

ClassifierDLIt is the first multi class text classifier in spark NLP, which uses various text embedding as the input of text classification. Classifier dlannotator uses a deep learning model (DNN) built inside tensorflow, which supports up to 50 classes.

In other words, you can use this classiifrdl in spark NLPBertElmoGloveandUniversal Sentence EncodersConstruct a text classifier.

Let’s start writing code!

Declare to load the necessary packages and start a spark session.

import sparknlp
spark = sparknlp.start() 
Wei sparknlp.start (GPU = true) > > train on GPU
from sparknlp.base import *
from sparknlp.annotator import *
from import Pipeline
import pandas as pd
print("Spark NLP version", sparknlp.version())
print("Apache Spark version:", spark.version)
>> Spark NLP version 2.4.5
>> Apache Spark version: 2.4.4

We can then download the Agnews dataset from GitHub repo( Trainings/Public)。

! wget
! wget
trainDataset = \
      .option("header", True) \
      .csv("news_category_train.csv"), truncate=50)
|category|                                       description|
|Business| Short sellers, Wall Street's dwindling band of...|
|Business| Private investment firm Carlyle Group, which h...|
|Business| Soaring crude prices plus worries about the ec...|
|Business| Authorities have halted oil export flows from ...|
|Business| Tearaway world oil prices, toppling records an...|
|Business| Stocks ended slightly higher on Friday but sta...|
|Business| Assets of the nation's retail money market mut...|
|Business| Retail sales bounced back a bit in July, and n...|
|Business|" After earning a PH.D. in Sociology, Danny Baz...|
|Business| Short sellers, Wall Street's dwindling  band o...|
only showing top 10 rows

There are four categories in Agnews dataset: world, SCI / Tech, sports and business

from pyspark.sql.functions import col
trainDataset.groupBy("category") \
    .count() \
    .orderBy(col("count").desc()) \
|   World|30000|
|  Sports|30000|
testDataset = \
      .option("header", True) \
testDataset.groupBy("category") \
    .count() \
    .orderBy(col("count").desc()) \
|Sci/Tech| 1900|
|  Sports| 1900|
|   World| 1900|
|Business| 1900|

Now, we can provide this data to spark NLP document assembler, which is the entry point of spark NLP for any spark datagram.

#The actual content is in the Description column
document = DocumentAssembler()\
#We can download pre trained embeddings
use = UniversalSentenceEncoder.pretrained()\
#Classes / labels / categories are in the category column
classsifierdl = ClassifierDLApproach()\
use_clf_pipeline = Pipeline(
    stages = [

Above, we get the data set, input, and then get the sentence embedding from the use, and then train in classifier dl

Now we start training. We will useClassiferDLMedium.setMaxEpochs()Train 5 epochs. In the colab environment, this takes about 10 minutes to complete.

use_pipelineModel =

When you run this command, spark NLP writes the training log to the annotator in the home directory_ Logs folder. Here is the log.

As you can see, we achieved more than 90% validation accuracy in less than 10 minutes without text preprocessing, which is usually the most time-consuming and laborious step in any NLP modeling.

Now let’s get the prediction at the earliest possible time. We will use the test set downloaded above.

The following is through the sklearn Libraryclassification_reportTest results are obtained.

We have achieved 89.3% test set accuracy! It seems all right!

Text preprocessing based on berglobe and NLP

As with any text classification problem, there are many useful text preprocessing techniques, including stemming, stemming, spelling checking, and stop word deletion. Besides spell checking, almost all NLP libraries in Python have tools to apply these techniques. At present, spark NLP library is the only available NLP library with spell check function.

Let’s apply these steps in the spark NLP pipeline, and then use the globe embedding to train the text classifier. We will first apply a few text preprocessing steps (normalize by retaining the alphabetic order only, remove stop word words and stemming), then obtain the word embedding of each tag (the marked stem), and then average the word embedding in each sentence to obtain the sentence embedding per line.

For all of these text preprocessing tools in spark NLP and more, you can find detailed instructions and code examples in this colab notebook( Trainings/Public/2.Text_ Preprocessing_ with_ SparkNLP_ Annotators_ Transformers.ipynb )。

Then we can train.

clf_pipelineModel =

Get the test results.

Now we have 88% test set accuracy! Even after all these text cleaning steps, we still can’t beat itUniversal Sentence Embeddings+ClassifierDLThis is mainly becauseUSEIt performs better on the original text than the data cleaned version.

In order to train the same classifier as Bert, we can use Bert in the same pipeline constructed above_ Embedding replaces globe_ embeddings。

word_embeddings = BertEmbeddings\
    .pretrained('bert_base_cased', 'en') \
    .setPoolingLayer(-2) # default 0

We can also use Elmo embedding.

word_embeddings = ElmoEmbeddings\
      .pretrained('elmo', 'en')\

Fast reasoning using lightpipeline

As we discussed in depth in a previous article, lightpipelines is a spark NLP specific pipeline, equivalent to a spark ml pipeline, but its purpose is to process a small amount of data. They are useful when dealing with small data sets, debugging results, or running training or predictions from an API that is requested once from a service.

Spark NLP LightPipelinesThe spark ml pipeline is converted into a multi-threaded task on a single machine. For a small amount of data (the smaller is relative, but the approximate maximum of 50000 sentences), the speed is more than 10 times faster. To use them, we just need to insert a trained pipeline, and we can input it into a pipe that first accepts the dataframe as input without even having to convert the input text to a dataframe. This is useful when you need to get a prediction of a few lines of text from a trained ML model.

Lightpipelines are easy to create and avoid processing spark datasets. They are also very fast and perform parallel computing when working only on the driver node. Let’s see how it applies to the case we described above:

light_model = LightPipeline(clf_pipelineModel)
text="Euro 2020 and the Copa America have both been moved to the summer of 2021 due to the coronavirus outbreak."
>> "Sports"

You can also save this trained model to disk, and then later in another spark pipeline withClassifierDLModel.load()Together.


This paper uses word embedding and word embedding in spark NLPUniversal Sentence Encoders,A multi class text classification model is trained, and good model accuracy is obtained in less than 10 minutes of training time. The entire code can be found in this GitHub repo (colab compatible, Trainings/Public/5.Text_ Classification_ with_ ClassifierDL.ipynb )。 We have also prepared another notebook, which covers almost all possible text classification combinations (CV, TFIDF, glove, Bert, Elmo, use, LR, RF, classifierdl, docclassifier) in spark NLP and spark ml Trainings/Public/5.1_ Text_ classification_ examples_ In_ SparkML_ SparkNLP.ipynb 。

We are also starting to offer online spark NLP training for public and enterprise (healthcare) versions. Here are links to all public colab notebooks( Trainings/Public)

John Snow lab will organize virtual spark NLP training. The following is a link to the next training:

Above code screenshot

Link to the original text:

Welcome to visit pan Chuang AI blog station:

Sklearn machine learning Chinese official document:

Welcome to pay attention to pan Chuang blog resource collection station: