Extension data tecdat: Python for NLP: deep learning text generation using keras

Time:2021-7-30

Original link:http://tecdat.cn/?p=8448

Text generation is NLPLatest applicationsone of. Deep learning technology has been used in various text generation tasks, such as writing poetry, generating movie scripts and even creating music. However, in this article, we will see a very simple text generation example in which given the input word string, we will predict the next word. We will use the original text of Shakespeare’s famous novel Macbeth and predict the next word according to a given series of input words.

After completing this article, you will be able to execute with the selected datasetText generation

Import libraries and datasets

The first step is to import the libraries and datasets needed to execute the scripts in this article. The following code imports the required libraries:

import numpy as np
from keras.models import Sequential, load_model
from keras.layers import Dense, Embedding, LSTM, Dropout

The next step is to download the dataset. We will download the dataset using Python’s nltk library.

download('gutenberg')

You should see the following output:

\['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'\]

This file contains the original text of the novel “Macbeth”. To read text from this file, you can use therawmethod:

macbeth_text = corpus.gutenberg.raw('shakespeare-macbeth.txt')

Let’s start from the datasetoutputFirst 500 characters:

print(macbeth_text\[:500\])

This is the output:

Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
  2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come, Gray-Malkin

   All. Padock calls anon: faire is foule, and foule is faire,
Houer through

You will see that the text contains many special characters and numbers. The next step is to clean up the dataset.

Data preprocessing

To remove punctuation and special characters, we will define a function calledpreprocess_text()

def preprocess_text(sen):
    #   Delete punctuation marks and numbers
    sentence = re.sub('\[^a-zA-Z\]', ' ', sen)
...

    return sentence.lower()

preprocess_textThe function takes a text string as an argument and returns it in lowercaseCleanedText string.

Now let’s clean up the text and try againoutputFirst 500 characters:

macbeth\_text = preprocess\_text(macbeth_text)
macbeth_text\[:500\]

This is the output:

the tragedie of macbeth by william shakespeare actus primus scoena prima thunder and lightning enter three witches when shall we three meet againe in thunder lightning or in raine when the hurley burley done when the battaile lost and wonne that will be ere the set of sunne where the place vpon the heath there to meet with macbeth come gray malkin all padock calls anon faire is foule and foule is faire houer through the fogge and filthie ayre exeunt scena secunda alarum within enter king malcom

Convert words to numbers

The deep learning model is based on statistical algorithm. Therefore, in order to use the deep learning model, we need to convert words into numbers.

In this article, we will use a very simple method to convert words to a single integer. Before converting words to integers, we need to mark the text as a singleword

The following script marks the text in our dataset, and thenoutputThe total number of words in the dataset and the total number of unique words in the dataset:

from nltk.tokenize import word_tokenize

...

print('Total Words: %d' % n_words)
print('Unique Words: %d' % unique_words)

The output is as follows:

Total Words: 17250
Unique Words: 3436

Our text has a total of 17250 words, of which 3436 words are unique. To convert tokenized words to numbers, you can use thekeras.preprocessing.text。 You need to call thefit_on_textsMethod and pass it to the word list. A dictionary will be created in which the key will represent the word and the integer will represent the corresponding value of the dictionary.

Look at the following script:

from keras.preprocessing.text import Tokenizer
...

To access a dictionary containing words and their corresponding indexes,word_indexYou can use the properties of the tokenizer object:

vocab\_size = len(tokenizer.word\_index) + 1
word\_2\_index = tokenizer.word_index

If you check the length of the dictionary, it will contain 3436 words, which is the total number of unique words in our dataset.

Now let’s learn from the dictionaryoutputThe 500th unique word and its integer value.

print(macbeth\_text\_words\[500\])
print(word\_2\_index\[macbeth\_text\_words\[500\]\])

This is the output:

comparisons
1456

Modify data shape

LSTM accepts data in three-dimensional format (number of samples, number of time steps, characteristics of each time step). Since the output will be a single word, the shape of the output will be two-dimensional (number of samples, number of unique words in the corpus).

The following script modifies the shape of the input sequence and the corresponding output.

input_sequence = \[\]
output_words = \[\]
input\_seq\_length = 100

for i in range(0, n\_words - input\_seq_length , 1):
    in\_seq = macbeth\_text\_words\[i:i + input\_seq_length\]
...

In the above script, we declare two empty listsinput_sequenceandoutput_words。 takeinput_seq_lengthIs set to 100, which means that our input sequence will include 100 words. Next, we execute a loop that appends the integer values of the first 100 words in the text to the first iterationinput_sequenceIn the list. The 101st word will be appended tooutput_wordsIn the list. In the second iteration, the word sequence from the second word in the text to the end of the 101st word is stored in theinput_sequenceIn the list, the 102nd word is stored inoutput_wordsArray, and so on. Since there are 17250 words in the dataset (100 less than the total number of words), a total of 17150 input sequences will be generated.

Now let’s outputinput_sequenceValue of the first sequence in the list:

print(input_sequence\[0\])

Output:

\[1, 869, 4, 40, 60, 1358, 1359, 408, 1360, 1361, 409, 265, 2, 870, 31, 190, 291, 76, 36, 30, 190, 327, 128, 8, 265, 870, 83, 8, 1362, 76, 1, 1363, 1364, 86, 76, 1, 1365, 354, 2, 871, 5, 34, 14, 168, 1, 292, 4, 649, 77, 1, 220, 41, 1, 872, 53, 3, 327, 12, 40, 52, 1366, 1367, 25, 1368, 873, 328, 355, 9, 410, 2, 410, 9, 355, 1369, 356, 1, 1370, 2, 874, 169, 103, 127, 411, 357, 149, 31, 51, 1371, 329, 107, 12, 358, 412, 875, 1372, 51, 20, 170, 92, 9\]

Let’s normalize the input sequence by dividing the integer in the sequence by the maximum integer value. The following script also converts the output to 2D format.

The following script outputs the input and the shape of the corresponding output.

print("X shape:", X.shape)
print("y shape:", y.shape)

Output:

X shape: (17150, 100, 1)
y shape: (17150, 3437)

Training model

The next step is to train our model. There are no hard and fast rules on how many layers and neurons should be used to train the model.

We will create three LSTM layers, each with 800 neurons. Finally, a dense layer with 1 neuron will be added to predict the index of the next word, as follows:

...
model.summary()

model.compile(loss='categorical_crossentropy', optimizer='adam')

Since the output word can be one of 3436 unique words, our problem is a multi class classification problemcategorical_crossentropyLoss function. In case of binary classification,binary_crossentropyUse this function.implementFrom the above script, you can see the model summary:

Model: "sequential_1"
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Layer (type)                 Output Shape              Param #
=================================================================
lstm_1 (LSTM)                (None, 100, 800)          2566400
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
lstm_2 (LSTM)                (None, 100, 800)          5123200
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
lstm_3 (LSTM)                (None, 800)               5123200
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
dense_1 (Dense)              (None, 3437)              2753037
=================================================================
Total params: 15,565,837
Trainable params: 15,565,837
Non-trainable params: 0

To train the model, we can simply use itfit()method.

model.fit(X, y, batch_size=64, epochs=10, verbose=1)

forecast

For prediction, we will start frominput_sequenceRandomly select a sequence from the list, convert it to a 3D shape, and then pass it topredict()Method of training model. Then pass the index value toindex_2_wordDictionary, in which the word index is used as a key. Shouldindex_2_wordThe dictionary will return the index words that belong to the key dictionary passed in.

The following script randomly selects an integer sequence and outputs the corresponding word sequence:

...
print(' '.join(word_sequence))

For the scripts in this article, the following order is selected randomly:

amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane

Next, we will output the next 100 words in the order of the above words:

for i in range(100):
    int\_sample = np.reshape(random\_seq, (1, len(random_seq), 1))
    int\_sample = int\_sample / float(vocab_size)

...

word_sequenceNow, the variable contains the sequence of words we entered and the next 100 predicted words. Shouldword_sequenceVariable contains a list of word sequences. We can simply connect the words in the list to obtain the final output sequence, as shown below:

final_output = ""
for word in word_sequence:
...
print(final_output)

This is the final output:

amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and

conclusion

In this article, we saw how to use deep learning to create a text generation model through Python’s keras library.