Pytorch implementation of cyclic neural network for text generation

Time:2020-11-1

By Dr. Vaibhav Kumar
Compile | VK
Source | analytics in diamag

Natural language processing (NLP) has many interesting applications, and text generation is one of them.

When a machine learning model works on sequence models such as recurrent neural network, lstm-rnn and Gru, they can generate the next sequence of input text.

Pytorch provides a powerful set of tools and libraries that add momentum to these NLP based tasks. It not only needs less preprocessing, but also speeds up the training process.

In this paper, we will train several kinds of recurrent neural networks (RNNs) in pytorch. After successful training, the RNN model predicts the names of languages that begin with input letters.

Pytorch implementation

This implementation is done in Google colab, where the dataset is retrieved from the Google drive. So, first, we’re going to install the Google drive with colab notebook.

from google.colab import drive
drive.mount('/content/gdrive')

Now we will import all the required libraries.

from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os
import unicodedata
import string
import torch
import torch.nn as nn
import random
import time
import math
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

The following code fragment reads the dataset.

all_let = string.ascii_letters + " .,;'-"
n_let = len(all_let) + 1

def getFiles(path):
  return glob.glob(path)

#Unicode string to ASCII
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_let
    )

#Read a file and divide it into lines
def getLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

#Establish cat_ Lin dictionary, which stores a list of rows for each category
cat_lin = {}
all_ctg = []
for filename in getFiles('gdrive/My Drive/Dataset/data/data/names/*.txt'):
    categ = os.path.splitext(os.path.basename(filename))[0]
    all_ctg.append(category)
    lines = getLines(filename)
    cat_lin[categ] = lines

n_ctg = len(all_ctg)

In the next step, we will define the module class to generate the name. The module will be a cyclic neural network.

class NameGeneratorModule(nn.Module):
    def __init__(self, inp_size, hid_size, op_size):
        super(NameGeneratorModule, self).__init__()
        self.hid_size = hid_size

        self.i2h = nn.Linear(n_ctg + inp_size + hid_size, hid_size)
        self.i2o = nn.Linear(n_ctg + inp_size + hid_size, op_size)
        self.o2o = nn.Linear(hid_size + op_size, op_size)
        self.dropout = nn.Dropout(0.1)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, category, input, hidden):
        inp_comb = torch.cat((category, input, hidden), 1)
        hidden = self.i2h(inp_comb)
        output = self.i2o(inp_comb)
        op_comb = torch.cat((hidden, output), 1)
        output = self.o2o(op_comb)
        output = self.dropout(output)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hid_size)

The following functions will be used to select random items from a list and random rows from categories

def randChoice(l):
    return l[random.randint(0, len(l) - 1)]

def randTrainPair():
    category = randChoice(all_ctg)
    line = randChoice(cat_lin[category])
    return category, line

The following functions convert the data to a compatible format of the RNN module.

def categ_Tensor(categ):
    li = all_ctg.index(categ)
    tensor = torch.zeros(1, n_ctg)
    tensor[0][li] = 1
    return tensor

def inp_Tensor(line):
    tensor = torch.zeros(len(line), 1, n_let)
    for li in range(len(line)):
        letter = line[li]
        tensor[li][0][all_let.find(letter)] = 1
    return tensor

def tgt_Tensor(line):
    letter_indexes = [all_let.find(line[li]) for li in range(1, len(line))]
    letter_id.append(n_let - 1) # EOS
    return torch.LongTensor(letter_id)

The following function creates a random training example, including categories, input, and target tensors.

#Loss
criterion = nn.NLLLoss()
#Learning rate
lr_rate = 0.0005

def train(category_tensor, input_line_tensor, target_line_tensor):
    target_line_tensor.unsqueeze_(-1)
    hidden = rnn.initHidden()

    rnn.zero_grad()

    loss = 0

    for i in range(input_line_tensor.size(0)):
        output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
        l = criterion(output, target_line_tensor[i])
        loss += l

    loss.backward()

    for p in rnn.parameters():
        p.data.add_(p.grad.data, alpha=-lr_rate)

    return output, loss.item() / input_line_tensor.size(0)

To display the time during the training, define the following function.

def time_taken(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

In the next step, we will define the RNN model.

model = NameGenratorModule(n_let, 128, n_let)

We will see the parameters of the defined RNN model.

print(model)

Next, the model will train 10000 epochs.

epochs = 100000
print_every = 5000
plot_every = 500
all_losses = []
total_ Loss = 0 ා reset on each iteration

start = time.time()

for iter in range(1, epochs + 1):
    output, loss = train(*rand_train_exp())
    total_loss += loss

    if iter % print_every == 0:
        print('Time: %s, Epoch: (%d - Total Iterations: %d%%),  Loss: %.4f' % (time_taken(start), iter, iter / epochs * 100, loss))

    if iter % plot_every == 0:
        all_losses.append(total_loss / plot_every)
        total_loss = 0

We will visualize the loss in training.

plt.figure(figsize=(7,7))
plt.title("Loss")
plt.plot(all_losses)
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.show()

Finally, we will test our model to test the generation of language specific names given the initial alphabet letter.

max_length = 20
#Examples in categories and starting letters
def sample_model(category, start_letter='A'):
    with torch.no_grad():  # no need to track history in sampling
        category_tensor = categ_Tensor(category)
        input = inp_Tensor(start_letter)
        hidden = NameGenratorModule.initHidden()

        output_name = start_letter

        for i in range(max_length):
            output, hidden = NameGenratorModule(category_tensor, input[0], hidden)
            topv, topi = output.topk(1)
            topi = topi[0][0]
            if topi == n_let - 1:
                break
            else:
                letter = all_let[topi]
                output_name += letter
            input = inp_Tensor(letter)

        return output_name

#Get multiple samples from a category and multiple starting letters
def sample_names(category, start_letters='XYZ'):
    for start_letter in start_letters:
        print(sample_model(category, start_letter))

Now we’ll examine the sample model and generate the name given the language and initial letter.

print("Italian:-")
sample_names('Italian', 'BPRT')
print("\nKorean:-")
sample_names('Korean', 'CMRS')
print("\nRussian:-")
sample_names('Russian', 'AJLN')
print("\nVietnamese:-")
sample_names('Vietnamese', 'LMT')

So, as we saw above, our model has generated names that belong to the language category, starting with typing letters.

reference:

  1. Trung Tran, “Text Generation with Pytorch”.
  2. “NLP from scratch: Generating names with a character level RNN”, PyTorch Tutorial.
  3. Francesca Paulin, “Character-Level LSTM in PyTorch”, Kaggle.

Link to the original text: https://analyticsindiamag.com/recurrent-neural-network-in-pytorch-for-text-generation/

Welcome to visit pan Chuang AI blog station:
http://panchuang.net/

Sklearn machine learning Chinese official document:
http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource collection station:
http://docs.panchuang.net/