NLP (20) using Bert to realize text classification

Time:2020-2-26

   when we do event extraction, we need trigger words to determine whether it belongs to a specific event type. For example, we take political visiting events as an example, which often have the word “visit”, but it is not reliable to judge whether it belongs to visiting events only through the trigger word “visit”. For example, we will encounter the following Situation:
NLP (20) using Bert to realize text classification
Through the above examples, we know that documents such as access speed and access volume, although visited, do not belong to political visiting events. Therefore, we need to useText categorizationObviously, this is a two classification model.
            this article will show how to use the Bert + DNN model to determine whether a document belongs to a political visiting event.

data set

                        . The data set is divided into training set (250 samples) and test set (50 samples), with the proportion of 5:1. There are not many samples, but with the help of Bert, we can achieve good results on small samples.
The sample of   training set (part) is as follows:
NLP (20) using Bert to realize text classification

Code

   the structure of the project is as follows:
NLP (20) using Bert to realize text classification
   because we have a small sample size, we need to use Bert. And because it’s Chinese, we need to download Bert’s Chinese training filechinese_L-12_H-768_A-12, this is a trained model file.
   according to our experience in NLP (19) using the visual guidance of Bert for the first time, we need to write code to call the Bert model file, such as tokenizer, padding, masking and the output vector generated by the Bert model. Fortunately, someone has helped us do this well, we just need to call its code. This part of the code is located in the Bert folder, and readers can find it at the GitHub address at the end of the article. Because the model in this paper is text classification model, we need to take 768 dimension vector corresponding to [CLS] token.
                       

# -*- coding: utf-8 -*-
# author: Jclian91
# place: Pudong Shanghai
# time: 2020-02-12 12:57
import pandas as pd


#Read TXT file
def read_txt_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        content = [_.strip() for _ in f.readlines()]

    labels, texts = [], []
    for line in content:
        parts = line.split()
        label, text = parts[0], ''.join(parts[1:])
        labels.append(label)
        texts.append(text)

    return labels, texts


file_path = 'data/train.txt'
labels, texts = read_txt_file(file_path)
train_df = pd.DataFrame({'label': labels, 'text': texts})

file_path = 'data/test.txt'
labels, texts = read_txt_file(file_path)
test_df = pd.DataFrame({'label': labels, 'text': texts})

print(train_df.head())
print(test_df.head())

train_df['text_len'] = train_df['text'].apply(lambda x: len(x))
print(train_df.describe())

The output results are as follows:

label                                               text
On February 10 local time, the White House issued a statement saying that the US president's wife Melania trump will visit India from February 24 to 25
Russian satellite news agency said Monday that Philippine President duterte had ordered the termination of the VFA with the United States.
Russian military delegation will visit Ankara in the near future to discuss Syria, Turkish presidential spokesman Karin said
First of all, let's talk about lpddr5: you know, there are two kinds of memory particles in mobile phones. One is DRAM, which is what we often say
At the critical moment of the epidemic, something touching happened. Let's understand that this is the real good friend, and we are not afraid to visit our country
  label                                               text
At the invitation of the Prime Minister of Pakistan, Imran Khan, the Prime Minister of the kingdom of the Netherlands, and the federal government of Germany, Wang Qishan, vice president of the state, will
1 1 German Chancellor Angela Merkel arrived in India for a visit, and New Delhi was welcomed by the military honor guard in the haze. Merkel praised Germany
From May 6 to 12, Ulan, deputy secretary of the provincial Party committee, led a delegation to visit South Korea, Thailand, Korea International Exchange alliance, new village sports central meeting, etc
Ma Xiaoguang, spokesman for the Taiwan Affairs Office, said today (May 22) that Yu Muming, chairman of the new party and honorary president of the new Chinese people's Association, will lead Taiwan's various
From June 13 to 15, under secretary general for counter terrorism of the United Nations, vorenkov, visited Beijing and Xinjiang on invitation, and had a meeting with Chinese Vice Foreign Minister le
         text_len
count  250.000000
mean    77.540000
std     36.804493
min     11.000000
25%     47.500000
50%     73.000000
75%    100.750000
max    192.000000

It can be found that 75% of the text length of the training dataset is 100.75, so when we train the model, the uniform length of the padding process is 100.
After    data preprocessing, we use Bert to extract the features of documents. The filling length of each document is 100, corresponding to a 768 dimensional vector. Then we use keras to create DNN for model training. After training the model, we verify the test set and save the model file, which is convenient for subsequent model prediction. The model training script is model_train.py. The complete Python code is as follows:

# -*- coding: utf-8 -*-
# author: Jclian91
# place: Pudong Shanghai
# time: 2020-02-12 13:37

import os
#Whether to use GPU for training
# os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7,8"

import numpy as np
from load_data import train_df, test_df
from keras.utils import to_categorical
from keras.models import Model
from keras.optimizers import Adam
from keras.layers import Input, BatchNormalization, Dense
from bert.extract_feature import BertVector

#Read and convert files
bert_model = BertVector(pooling_strategy="REDUCE_MEAN", max_seq_len=100)
print('begin encoding')
f = lambda text: bert_model.encode([text])["encodes"][0]
train_df['x'] = train_df['text'].apply(f)
test_df['x'] = test_df['text'].apply(f)
print('end encoding')

x_train = np.array([vec for vec in train_df['x']])
x_test = np.array([vec for vec in test_df['x']])
y_train = np.array([vec for vec in train_df['label']])
y_test = np.array([vec for vec in test_df['label']])
print('x_train: ', x_train.shape)

# Convert class vectors to binary class matrices.
num_classes = 2
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)

#Create DNN model
x_in = Input(shape=(768, ))
x_out = Dense(32, activation="relu")(x_in)
x_out = BatchNormalization()(x_out)
x_out = Dense(num_classes, activation="softmax")(x_out)
model = Model(inputs=x_in, outputs=x_out)
print(model.summary())

model.compile(loss='categorical_crossentropy',
              optimizer=Adam(),
              metrics=['accuracy'])

#Model training, evaluation and preservation
model.fit(x_train, y_train, batch_size=8, epochs=20)
model.save('visit_classify.h5')
print(model.evaluate(x_test, y_test))

model training

In the model training, the DNN model structure we created is as follows:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 768)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                24608     
_________________________________________________________________
batch_normalization_1 (Batch (None, 32)                128       
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 66        
=================================================================
Total params: 24,802
Trainable params: 24,738
Non-trainable params: 64
_________________________________________________________________

The output during model training is as follows:

Epoch 1/20

  8/250 [..............................] - ETA: 43s - loss: 1.0427 - acc: 0.3750
250/250 [==============================] - 1s 6ms/step - loss: 0.3345 - acc: 0.8640
Epoch 2/20

  8/250 [..............................] - ETA: 0s - loss: 0.2664 - acc: 0.8750
250/250 [==============================] - 0s 133us/step - loss: 0.2147 - acc: 0.9320

...... (omit part of output result)

Epoch 19/20

  8/250 [..............................] - ETA: 0s - loss: 0.2481 - acc: 0.8750
250/250 [==============================] - 0s 136us/step - loss: 0.0716 - acc: 0.9760
Epoch 20/20

  8/250 [..............................] - ETA: 0s - loss: 0.0149 - acc: 1.0000
250/250 [==============================] - 0s 140us/step - loss: 0.0560 - acc: 0.9800

32/50 [==================>...........] - ETA: 0s
50/50 [==============================] - 0s 4ms/step
[0.3687818288803101, 0.9199999928474426]

After 20 epochs training, the accuracy rate of the model in the training set is 0.9800, the accuracy rate in the test set is about 0.9200, the effect of Bert is so amazing, and then the simple DNN model can achieve such a good effect.

model prediction

In order to verify the prediction effect of the model again, the author found another 20 documents from the website and predicted them. The predicted script is model_predict.py. The complete Python code is as follows:

# -*- coding: utf-8 -*-
# author: Jclian91
# place: Pudong Shanghai
# time: 2020-02-12 17:33

import pandas as pd
import numpy as np
from bert.extract_feature import BertVector
from keras.models import load_model
load_model = load_model("visit_classify.h5")

#Prediction statement
Texts = ['in access restrictions, users can choose to disable iPhone features, including Siri, iTunes purchase features, install / delete apps, etc., and even make iPhone a feature phone. Here are some specific functions that can be implemented by access restriction ',
         "It home April 23 news recently, Google said in its official forum that they have added a new feature for Android Auto: access to the full contact list. The user can now access the complete contact list by opening the menu in the upper left corner of auto's phone dialing interface. It is worth noting that this feature is only supported when the vehicle is stopped. ',
         To access the router through Telnet, you need to configure the router through the console port, such as IP address, password, etc. ',
         'home of it March 26 news recently Muso, an international anti piracy consulting company, released its 2017 annual report, in which data shows that the number of visits to pirated resources websites reached 300 billion last year, an increase of 1.6% over the previous year (2016). The United States has the largest number of visits to pirated sites, with a total of 27.9 billion visits, followed by Russia, India and Brazil, with China ranked 18th. ',
         At the invitation of the Portuguese Parliament, Ji Bingxuan, vice chairman of the Standing Committee of the National People's Congress, led a delegation to visit Portugal from December 14 to 16, to meet with vice speaker Felipe and Deputy General Secretary of the socialist party cannello. ',
         From February 26 to March 2, at the invitation of the Hong Kong SAR government's "mainland VIP visit plan", Chen Xiangqun, member of the Standing Committee of the provincial Party committee and executive vice governor of the province, paid a visit to Hong Kong, focusing on "the director of Hong Kong and the needs of Hunan", in-depth exchanges with relevant departments and institutions of the SAR government, and promoted new progress in exchanges and cooperation between Hunan and Hong Kong. ',
         'at present, station a has resumed its visit, can log in directly, the web page is loaded normally, and the video can be played normally. ',
         "On 8 June, UNHCR special envoy Angelina Jolie concluded her two-day visit to the refugee camps in the border area between Colombia and Venezuela, praising the humanitarian and courage shown by the Colombian people. ',
         "German Chancellor Angela Merkel plans to go to Ankara next January for talks with Turkish President Recep Tayyip Erdogan, according to the South German daily. ',
         From September 14 to 18, a working delegation led by Ruan Wenping, member of the Political Bureau of the Central Committee of the Communist Party of Vietnam, Secretary of the Secretariat of the Central Committee and Minister of the central Ministry of economy, paid a working visit to Greece. ',
         "Win7 computer prompts that there is a problem with the wireless adapter or access point. When many users use the wireless network to connect to the Internet, they find that the wireless network display is connected, but there is a yellow exclamation mark beside it, so they can't operate the network. Through the diagnosis, they prompt that there is a problem with the wireless adapter or access point of the computer, and it's in an unrepaired state. What should we do? I'll talk to you next Share the solution to the problem of wireless adapter or access point prompted by win7 computer. ',
         From October 13 to 14, 2019, Vice Foreign Minister Ma Zhaoxu visited Chile, met with Chilean foreign minister Ricardo Rivera, held talks with Chilean President's foreign affairs adviser Salas, and exchanged in-depth views on Chile's holding of the 27th APEC leaders' informal meeting. ',
         "Before all security groups are developed, FTP can be linked, but it will be slow to open. It will take 1-2 minutes to link ',
         Users of 'win7 system computer, when connecting to the WiFi network, sometimes encounter a sudden inability to access the network. When checking the connected WiFi, a text prompt of "limited access rights" appears. ',
         "UN Secretary General Ban Ki Moon visited Fukushima Prefecture, Japan on August 8, exchanging with local victims and visiting a high school. ',
         Premier Wen Jiabao arrived in Buenos Aires on a special plane in the afternoon of local time on the 23rd to begin his official visit to Argentina. ',
         'prime minister Stuart of Barbados, who is visiting China, visited Xi'an, Shaanxi Province on the 15th. ',
         According to foreign media reports, on the 10th local time, the White House announced that President trump would visit India at the end of February to have a strategic dialogue with Indian Prime Minister modi. ',
         On February 28, Zhao Lijun, chairman of Tangshan Caofeidian Blue Ocean Science and Technology Co., Ltd., and other five members visited the Yellow Sea Fisheries Research Institute. Xin Fuyan, deputy director of Huanghai Fisheries Research Institute and relevant department heads and experts attended the meeting. ',
         On July 2, 2018, Jiang Yanbin, President of the Confucius Cultural Promotion Association in Moscow, and Chen Guojian, executive vice president, visited the Moscow State Surikov Academy of fine arts, accompanied by Professor Muk, a famous Chinese oil painting master who stayed in Russia. They were received by the first vice president Igor gorbachek. '
         ]

labels = []

bert_model = BertVector(pooling_strategy="REDUCE_MEAN", max_seq_len=100)

#Predict the above sentences
for text in texts:

    #Convert sentences to vectors
    vec = bert_model.encode([text])["encodes"][0]
    x_train = np.array([vec])

    #Model prediction
    predicted = load_model.predict(x_train)
    y = np.argmax(predicted[0])
    label = 'Y' if y else 'N'
    labels.append(label)

for text,label in zip(texts, labels):
    print('%s\t%s'%(label, text))

#Save results as xlsx file
DF = PD. Dataframe ({'sentence': texts, "whether it belongs to visiting event": labels})
df.to_excel('./result.xlsx', index=False)

The results of model prediction will be output and saved to excel. The contents of the file are as follows:
NLP (20) using Bert to realize text classification
All forecast documents are correct!

Forecast

   the project has been open-source, GitHub address: https://github.com/percent4/b.
                      !
Thank you for reading~