Python using recurrent neural network to solve the problem of text classification


This paper describes the method of Python using recurrent neural network to solve the problem of text classification. For your reference, the details are as follows:

1. Concept

1.1. Cyclic neural network

Recurrent neural network (RNN) is a kind of recurrent neural network, which takes sequence data as input, recurses in the evolution direction of sequence, and all nodes (cyclic units) are linked in chain.

The input of convolution network is only the input data x, while the output of each step will be taken as the input of the next step in addition to the input data X of the cyclic neural network, so it circulates and uses the same activation function and parameters every time. In each cycle, x0 is multiplied by the coefficient u to obtain S0, and then the coefficient W is input to the next cycle to form the forward propagation of the recurrent neural network.


In the back propagation, the derivative of loss function e to parameter W is required. The formula at the bottom right can be obtained by chain derivation rule


Compared with convolution neural network, the convolution neural network is an output, which generates an output through the network. The recurrent neural network can realize one input multiple output (generating picture description), multiple input and one output (text classification), multiple input and multiple output (machine translation, video interpretation).

RNN uses Tan activation function, the output is between – 1 and 1, easy to disappear gradient. The step far from the output contributes little to the gradient.

If the output of the bottom layer is taken as the input of the high layer, a multi-layer RNN network can be formed, and the residual connection can be used to prevent over fitting.

1.2. Long and short term memory network

There is only one parameter w between each propagation of RNN. It is difficult to describe a large number of complex information requirements with this parameterLong short term memory (LSTM)。 This network can carry out selective mechanism, selectively input and output needed information and selectively forget unnecessary information. The output of sigmoid function is between 0 and 1. 0 represents forgetting, 1 represents memory, and 0.5 represents 50% memory

The network structure of LSTM is shown in the figure below,


As shown in the figure on the right above, it shows the operation of this roundImplied stateThe current state is obtained by the dot product of the previous state and the result of the forgetting gate, and then uploaded into our results

As shown in the figure on the leftForgetting gateStructure, the output HT-1 and data XT of the last round produce the forgetting result ft after the forgetting gate selects whether to forget or not

As shown in the figure belowIntroductionStructure, HT-1 and XT perform point product operation on the result it of forgetting gate and CT of tanh to get the input of this operation

As shown in the figure on the rightOutput gateStructure, HT-1 and XT produce the output by dot product of the result ot of forgetting gate with the current state


To implement the LSTM network, first define_ generate_ The params function is used to generate the required parameters for each gate, which is called to define the parameters of input gate, output gate, forgetting gate, and intermediate state tanh. Each gate has three parameters, and the weight and offset value of X and H are input.

Next, the LSTM cycle calculation is started, and the input gate calculation is to input embedded_ The input matrix is multiplied by the input gate parameter X_ In, plus the result of multiplying h with the corresponding parameter, and finally adding the offset value B_ After sigmoid, the result of the input gate is obtained.

Similarly, the results of forgetting gate and output gate are obtained by matrix multiplication and bias operation. The operation of intermediate tanh is similar to that of three gates, except that it passes through tanh function finally.

Multiply the previous implicit state state by the forgetting gate and the input gate by the intermediate state to get the current implicit state

The current state is passed through the tanh function and the output gate to get the output h of the current round

After several input cycles, the final output of the LSTM network is obtained.

#Implement LSTM network
  #Parameters required for cell mesh generation
  def _generate_paramas(x_size, h_size, b_size):
    x_w = tf.get_variable('x_weight', x_size)
    h_w = tf.get_variable('h_weight', h_size)
    bias = tf.get_variable('bias', b_size, initializer=tf.constant_initializer(0.0))
    return x_w, h_w, bias
  scale = 1.0 / math.sqrt(embedding_size + lstm_nodes[-1]) / 3.0
  lstm_init = tf.random_uniform_initializer(-scale, scale)
  with tf.variable_scope('lstm_nn', initializer=lstm_init):
    #Input gate parameters
    with tf.variable_scope('input'):
      x_in, h_in, b_in = _generate_paramas(
        x_size=[embedding_size, lstm_nodes[0]],
        h_size=[lstm_nodes[0], lstm_nodes[0]],
        b_size=[1, lstm_nodes[0]]
    #Output gate parameters
    with tf.variable_scope('output'):
      x_out, h_out, b_out = _generate_paramas(
        x_size=[embedding_size, lstm_nodes[0]],
        h_size=[lstm_nodes[0], lstm_nodes[0]],
        b_size=[1, lstm_nodes[0]]
    #Forgetting gate parameters
    with tf.variable_scope('forget'):
      x_f, h_f, b_f = _generate_paramas(
        x_size=[embedding_size, lstm_nodes[0]],
        h_size=[lstm_nodes[0], lstm_nodes[0]],
        b_size=[1, lstm_nodes[0]]
    #Intermediate state parameters
    with tf.variable_scope('mid_state'):
      x_m, h_m, b_m = _generate_paramas(
        x_size=[embedding_size, lstm_nodes[0]],
        h_size=[lstm_nodes[0], lstm_nodes[0]],
        b_size=[1, lstm_nodes[0]]
    #Two initialization States, implicit state state and initial input H
    state = tf.Variable(tf.zeros([batch_size, lstm_nodes[0]]), trainable=False)
    h = tf.Variable(tf.zeros([batch_size, lstm_nodes[0]]), trainable=False)
    #Every cycle of LSTM is traversed, that is, the input process of each word
    for i in range(max_words):
      #Take out each round of input, three-dimensional array embedded_ The second dimension of inputs represents the number of training rounds
      embedded_input = embedded_inputs[:, i, :]
      #The result of reshape is two-dimensional
      embedded_input = tf.reshape(embedded_input, [batch_size, embedding_size])
      #Forgetting gate computing
      forget_gate = tf.sigmoid(tf.matmul(embedded_input, x_f) + tf.matmul(h, h_f) + b_f)
      #Input gate calculation
      input_gate = tf.sigmoid(tf.matmul(embedded_input, x_in) + tf.matmul(h, h_in) + b_in)
      #Output gate
      output_gate = tf.sigmoid(tf.matmul(embedded_input, x_out) + tf.matmul(h, h_out) + b_out)
      #Intermediate state
      mid_state = tf.tanh(tf.matmul(embedded_input, x_m) + tf.matmul(h, h_m) + b_m)
      #Calculate the implied state state and input H
      state = state * forget_gate + input_gate * mid_state
      h = output_gate + tf.tanh(state)
    #The final result of traversal is the output of LSTM
    last_output = h

1.3 text classification

The problem of text classification is to analyze and judge the input text string, and then output the result. The string cannot be directly input into the RNN network, so the text needs to be split into a single phrase before inputEmbedding codingWhen the last phrase is input, the output is also a vector. Embedding corresponds a word to a vector, and each dimension of the vector corresponds to a floating-point value. By dynamically adjusting these floating-point values, embedding code is related to the meaning of the word. In this way, the input and output of the network are vectors, and then the full connection operation is carried out to correspond to different classifications.

The problem of RNN network is that the final output is affected by the nearest input, while the previous far input may not affect the resultInformation bottleneckIn order to solve this problem, a bidirectional LSTM is introduced. The bidirectional LSTM not only increases the reverse information propagation, but also has an output in each round. These outputs are combined and then passed to the full connection layer.

Another text categorization model isHAN(hierarchical attention network), the text is divided into sentence and word levels, the input words are coded and then added to get the sentence code, and then the sentence code is added to get the final text code. Attention refers to adding a weighted value before the code of each level is accumulated, and the code is accumulated according to different weights.


Because the length of input text is not uniform, it is impossible to use neural network to learn directly. In order to solve this problem, the length of input text can be unified to a maximum value, and convolution neural network is barely used for learningTextCNN。 The convolution process of text convolution network adopts multi-channel one-dimensional convolution. Compared with two-dimensional convolution, one-dimensional convolution means that the convolution kernel only moves in one direction. For example, as shown in the left figure, 1 × 1 + 5 × 2 + 2 × 2 + 4 × 3 + 3 × 3 + 3 × 4 = 48, then the convolution kernel moves down one lattice to get 45, and so on. As shown in the figure on the right below, input multiple words of different lengths. Firstly, all of them are filled into six channel embedding array, and then convoluted from top to bottom by using one-dimensional convolution kernel of six channels to get one-dimensional array, and then output through pooling layer and full connection layer.


We can see that CNN can’t deal with the sequential problem of different input length perfectly, but it can process multiple phrases in parallel with higher efficiency, while RNN can better deal with sequential input, which combines the advantages of the twoR-cnn model。 Firstly, the input features are extracted by bidirectional RNN network, and then further extracted by CNN. Then the features of each step are fused by pooling layer, and finally classified by full connection layer.

No matter what model needs to use embedding to convert the input into a vector. When the input is too large, the transformed embedding layer parameters will be too large, which is not conducive to storage, but also causes over fitting. Therefore, it is necessary to compress the embedding layer. The original embedding code is a parameter corresponding to an input, for example, wait corresponds to parameter x1, for corresponds to X2, and the corresponds to X3. If there are too many inputs, the encoding parameters will be very large. The combination of two parameter pairs can be used to code the input, for example, wait corresponds to (x1, x2), for corresponds to (x1, x3)…, which can greatly save the number of parametersShared compression

2. Text classification by text RNN

2.1 data preprocessing

The text classification dataset files downloaded from the Internet are as follows, which are divided into test set and training set data. There are four folders under each training set, each folder is a category, each category has 1000 txt files, and each file has a text of the classification


adopt os.walk After traversing all training set files, the classified text is divided into single phrases through Jieba library, separated by spaces. Then the classified text is added to the beginning, separated by a tab, and the result is output to the train_ segment.txt ,

#The sentences in the file are split into single words through Jieba library
def segment_word(input_file, output_file):
  #Loop through each file of the training data set
  for root, folders, files in os.walk(input_file):
    print('root:', root)
    for folder in folders:
      print('dir:', folder)
    for file in files:
      file_dir = os.path.join(root, file)
      with open(file_dir, 'rb') as in_file:
        #Read text from file
        sentence =
        #The sentence is divided into single phrases through Jieba function library
        words = jieba.cut(sentence)
        #The last two words of the folder path are the category names
        content = root[-2:] + '\t'
        #Remove the space in the phrase and exclude the empty phrase
        for word in words:
          word = word.strip(' ')
          if word != '':
            content += word + ' '
      #Wrap the line and write the text to the output file
      content += '\n'
      with open(output_file, 'a') as outfile:
        outfile.write(content.strip(' '))

The results were as follows:

Because some phrases appear few times and do not have statistical significance, they need to be excluded and get_ The list () method counts the frequency of each phrase. Using the dictionary data type provided by python, we can easily realize phrase data statistics in the format of {“keyword”: frequency}, and frequency records the number of times keyword appears. If a phrase is new, it is added to the dictionary as a new entry, otherwise the frequency value is + 1.

#Count the frequency of each word
def get_list(segment_file, out_file):
  #Save the frequency of each phrase through a dictionary
  word_dict = {}
  with open(segment_file, 'r') as seg_file:
    lines = seg_file.readlines()
    #Traverse each line of the file
    for line in lines:
      line = line.strip('\r\n')
      #Divide a line into each word according to the space, and count the dictionary
      for word in line.split(' '):
        #If this phrase is not in word_ If it appears in dict dictionary, a new dictionary item will be created and set to 0
        word_dict.setdefault(word, 0)
        #The dictionary word_ Add one to the item count corresponding to the phrase word in Dict
        word_dict[word] += 1
    #Sort the list in the dictionary, and the key is the item whose subscript is 1, and in reverse order
    sorted_list = sorted(word_dict.items(), key=lambda d: d[1], reverse=True)
    with open(out_file, 'w') as outfile:
      #Writes each sorted dictionary entry to a file
      for item in sorted_list:
        outfile.write('%s\t%d\n' % (item[0], item[1]))

The statistical results are as follows:

2.2. Data reading

If you use phrases directly, you can’t learn by coding. You need to convert phrases into embedding codes. According to the train generated just now_ List list. Each phrase is numbered in the order from front to back. If the frequency of the phrase is less than the threshold value, it will be excluded. Through word_ List class to build training data, test data phrase object, in the class constructor__ init__ () realize the encoding of phrases. The class method, sentence2id, is defined to convert the split sentence phrase into the corresponding ID array. If there is no such word in the phrase list, the value is set to – 1.

Before defining a class, some superparameters are specified for subsequent use

#Defining a super parameter
embedding_ Size = 32 # the length of each phrase vector
max_ Words = 10 ා maximum phrase length of a sentence
lstm_ Layers = 2 # LSTM network layers
lstm_ Nodes = [64, 64] ා LSTM node number per layer
Fc_ Nodes = 64 ා number of nodes in full connection layer
batch_ Size = 100 # sample data of each batch
lstm_ Grads = 1.0 ා LSTM network gradient
learning_ Rate = 0.001 ා learning rate
word_ Threshold = 10 ා vocabulary frequency threshold. Words below this value are not counted
num_ Classes = 4 ා the final classification results have four categories
class Word_list:
  def __init__(self, filename):
    #Use the dictionary type to save the phrases and their frequencies to be counted
    self._word_dic = {}
    with open(filename, 'r',encoding='GB2312',errors='ignore') as f:
      lines = f.readlines()
    for line in lines:
      word, freq = line.strip('\r\n').split('\t')
      freq = int(freq)
      #If the frequency of the phrase is less than the threshold, skip not counting
      if freq < word_threshold:
      #Each phrase in the phrase list is not repeated and is added to word in order_ DIC, the next phrase ID is the current word_ Length of DIC
      word_id = len(self._word_dic)
      self._word_dic[word] = word_id
  def sentence2id(self, sentence):
    #Returns a sentence separated by spaces to word_ The ID of the corresponding phrase in DIC. If it does not exist, - 1 is returned
    sentence_id = [self._word_dic.get(word, -1)
            for word in sentence.split()]
    return sentence_id
train_list = Word_list(train_list_dir)

Define the textdata class to complete the data reading and management__ init__ The () function reads the train that has just been processed_ segment.txt According to the tab, the category mark and sentence phrase are divided into the number ID. If the phrase of a sentence exceeds the maximum threshold, the redundant words are cut off, and if not enough, fill it with – 1. Defining class functions_ shuffle_ Data() is used to clean data, next_ Batch() is used to return data and tags by batch, get_ Size() is used to return the total number of phrases.

class TextData:
  def __init__(self, segment_file, word_list):
    self.inputs = []
    self.labels = []
    #Managing text categories through dictionaries
    self.label_ DIC = {'Sports': 0,' Campus': 1, 'female': 2, 'Publishing': 3}
    self.index = 0
    with open(segment_file, 'r') as f:
      lines = f.readlines()
      for line in lines:
        #The text is divided by tabs, with categories in front and sentences after
        label, content = line.strip('\r\n').split('\t')[0:2]
        self.content_size = len(content)
        #Convert category to numeric ID
        label_id = self.label_dic.get(label)
        #Convert the sentence to an embedding array
        content_id = word_list.sentence2id(content)
        #If the length of a sentence exceeds the maximum, cut max_ ID value within words length
        content_id = content_id[0:max_words]
        #If not, fill - 1 until max_ Words length
        padding_num = max_words - len(content_id)
        content_id = content_id + [-1 for i in range(padding_num)]
    self.inputs = np.asarray(self.inputs, dtype=np.int32)
    self.labels = np.asarray(self.labels, dtype=np.int32)
  #Scramble the data according to (input, label) pairs
  def _shuffle_data(self):
    r_index = np.random.permutation(len(self.inputs))
    self.inputs = self.inputs[r_index]
    self.labels = self.labels[r_index]
  #Returns the data of a batch
  def next_batch(self, batch_size):
    #Current index + batch size gets the end index of the batch
    end_index = self.index + batch_size
    #If the ending index is greater than the total number of samples, all samples are scrambled from the beginning
    if end_index > len(self.inputs):
      self.index = 0
      end_index = batch_size
    #Returns the data of a batch by index
    batch_inputs = self.inputs[self.index:end_index]
    batch_labels = self.labels[self.index:end_index]
    self.index = end_index
    return batch_inputs, batch_labels
  #Get the number of thesaurus
  def get_size(self):
    return self.content_size
#Training data set object
train_set = TextData(train_segment_dir, train_list)
# print(data_set.next_batch(10))
#The number of phrases in training data set
train_list_size = train_set.get_size()

2.3. Construct calculation chart model

Define function create_ Model to realize the construction of computational graph model. First, define the model input place holder, which is the ratio of input text inputs, output label outputs, dropout_ prob。

Firstly, the embedding layer is constructed, and the input codes are extracted and spliced into a matrix. For example, for the input [1,8,3], embedding [1], embedding [8] and embedding [3] are extracted and spliced into a matrix

Next, the LSTM network is constructed. Here, two layers of network are constructed. The node number of each layer is the previous parameter LSTM_ Node [] array. Each cell is constructed through a function tf.contrib.rnn . basiclstmcell, and then dropout. Then the two cells are merged into a LSTM network tf.nn.dynamic_ RNN will enter embedded_ Inputs are input into LSTM network for training to get output RNN_ output。 This is a three-dimensional array. The second dimension represents the number of training steps. We only take the result of the last dimension, that is, the subscript value is – 1

Next, build the full connectivity layer through the tf.layers.dense The function defines the full connection layer. After a dropout operation, the output is mapped to the category. The class parameter num_ Class, get the estimated value Logits

Then we can calculate the loss, accuracy and other evaluation values. The cross entropy loss between the predicted Logits and the tag outputs is calculated, and then arg is used_ Max to calculate the predicted value, and then calculate the accuracy

Next, the training method is defined and applied to the variable by gradient clipping to prevent the gradient from disappearing.

Finally, the input placeholder, loss and other evaluation values and other training parameters are returned to the outside of the calling function.

#Create calculation diagram model
def create_model(list_size, num_classes):
  #Define input and output placeholders
  inputs = tf.placeholder(tf.int32, (batch_size, max_words))
  outputs = tf.placeholder(tf.int32, (batch_size,))
  #Defines whether dropout is a ratio
  keep_prob = tf.placeholder(tf.float32, name='keep_rate')
  #Record the total number of training sessions
  global_steps = tf.Variable(tf.zeros([], tf.float32), name='global_steps', trainable=False)
  #Convert input to embedding encoding
  with tf.variable_scope('embedding',
              initializer=tf.random_normal_initializer(-1.0, 1.0)):
    embeddings = tf.get_variable('embedding', [list_size, embedding_size], tf.float32)
    #Extract the embedding value of the specified row
    embedded_inputs = tf.nn.embedding_lookup(embeddings, inputs)
  #Implement LSTM network
  scale = 1.0 / math.sqrt(embedding_size + lstm_nodes[-1]) / 3.0
  lstm_init = tf.random_uniform_initializer(-scale, scale)
  with tf.variable_scope('lstm_nn', initializer=lstm_init):
    #Two layers of LSTM are constructed, and the number of nodes in each layer is LSTM_ nodes[i]
    cells = []
    for i in range(lstm_layers):
      cell = tf.contrib.rnn.BasicLSTMCell(lstm_nodes[i], state_is_tuple=True)
      #Implement dropout operation
      cell = tf.contrib.rnn.DropoutWrapper(cell, output_keep_prob=keep_prob)
    #Merging two LSTM cells
    cell = tf.contrib.rnn.MultiRNNCell(cells)
    #Embedded_ Inputs are input into RNN for training
    initial_state = cell.zero_state(batch_size, tf.float32)
    # runn_output:[batch_size,num_timestep,lstm_outputs[-1]
    rnn_output, _ = tf.nn.dynamic_rnn(cell, embedded_inputs, initial_state=initial_state)
    last_output = rnn_output[:, -1, :]
  #Build full connectivity layer
  fc_init = tf.uniform_unit_scaling_initializer(factor=1.0)
  with tf.variable_scope('fc', initializer=fc_init):
    fc1 = tf.layers.dense(last_output, fc_nodes, activation=tf.nn.relu, name='fc1')
    fc1_drop = tf.contrib.layers.dropout(fc1, keep_prob)
    logits = tf.layers.dense(fc1_drop, num_classes, name='fc2')
  #Define evaluation indicators
  with tf.variable_scope('matrics'):
    #Calculate the loss value
    softmax_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=outputs)
    loss = tf.reduce_mean(softmax_loss)
    #Calculate the predicted value and find the subscript of the maximum value in the first dimension, for example [1,1,5,3,2] argmax = > 2
    y_pred = tf.argmax(tf.nn.softmax(logits), 1, output_type=tf.int32)
    correct_prediction = tf.equal(outputs, y_pred)
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
  #Define training methods
  with tf.variable_scope('train_op'):
    train_var = tf.trainable_variables()
    # for var in train_var:
    #   print(var)
    #Cut the gradient to prevent it from disappearing or exploding
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, train_var), clip_norm=lstm_grads)
    #Apply gradients to variables
    optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
    train_op = optimizer.apply_gradients(zip(grads, train_var), global_steps)
  #Returns the result as a tuple
  return ((inputs, outputs, keep_prob),
      (loss, accuracy),
      (train_op, global_steps))
#Call the build function to receive the returned parameters
placeholders, matrics, others = create_model(train_list_size, num_classes)
inputs, outputs, keep_prob = placeholders
loss, accuracy = matrics
train_op, global_steps = others

2.4. Training

Run the calculation graph model through the session, and from the train_ In set, get the training set data by batch, fill in the place holder, and run To obtain the loss value, accuracy and other intermediate values for printing

init_op = tf.global_variables_initializer()
train_ keep_ Prob = 0.8 ᦇ dropout ratio of training set
train_steps = 10000
with tf.Session() as sess:
  for i in range(train_steps):
    #Get training set data by batch
    batch_inputs, batch_labels = train_set.next_batch(batch_size)
    #Operation calculation chart
    res =[loss, accuracy, train_op, global_steps],
            feed_dict={inputs: batch_inputs, outputs: batch_labels,
                 keep_prob: train_keep_prob})
    loss_val, acc_val, _, g_step_val = res
    if g_step_val % 20 == 0:
      Print ('Round% D, loss: 3.3F, accuracy rate: 3.5F '% (G_ step_ val, loss_ val, acc_ val))

After 10000 rounds of training in my dataset, the accuracy of the training set hovered around 90%


Source code and related data files:

More interested readers about Python related content can view the special topics of this website: Python data structure and algorithm tutorial, python encryption and decryption algorithm and skills summary, python coding operation skills summary, python function use skills summary, python character string operation skills summary and python introduction and advanced classic tutorial

I hope this article will be helpful to python programming.