Text generation is NLPLatest applicationsone of. Deep learning technology has been used in various text generation tasks, such as writing poetry, generating movie scripts and even creating music. However, in this article, we will see a very simple text generation example in which given the input word string, we will predict the next word. We will use the original text of Shakespeare’s famous novel Macbeth and predict the next word according to a given series of input words.
After completing this article, you will be able to execute with the selected datasetText generation。
Import libraries and datasets
The first step is to import the libraries and datasets needed to execute the scripts in this article. The following code imports the required libraries:
import numpy as np from keras.models import Sequential, load_model from keras.layers import Dense, Embedding, LSTM, Dropout
The next step is to download the dataset. We will download the dataset using Python’s nltk library.
You should see the following output:
\['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'\]
This file contains the original text of the novel “Macbeth”. To read text from this file, you can use the
macbeth_text = corpus.gutenberg.raw('shakespeare-macbeth.txt')
Let’s start from the datasetoutputFirst 500 characters:
This is the output:
Actus Primus. Scoena Prima. Thunder and Lightning. Enter three Witches. 1. When shall we three meet againe? In Thunder, Lightning, or in Raine? 2. When the Hurley-burley's done, When the Battaile's lost, and wonne 3. That will be ere the set of Sunne 1. Where the place? 2. Vpon the Heath 3. There to meet with Macbeth 1. I come, Gray-Malkin All. Padock calls anon: faire is foule, and foule is faire, Houer through
You will see that the text contains many special characters and numbers. The next step is to clean up the dataset.
To remove punctuation and special characters, we will define a function called
def preprocess_text(sen): # Delete punctuation marks and numbers sentence = re.sub('\[^a-zA-Z\]', ' ', sen) ... return sentence.lower()
preprocess_textThe function takes a text string as an argument and returns it in lowercaseCleanedText string.
Now let’s clean up the text and try againoutputFirst 500 characters:
macbeth\_text = preprocess\_text(macbeth_text) macbeth_text\[:500\]
This is the output:
the tragedie of macbeth by william shakespeare actus primus scoena prima thunder and lightning enter three witches when shall we three meet againe in thunder lightning or in raine when the hurley burley done when the battaile lost and wonne that will be ere the set of sunne where the place vpon the heath there to meet with macbeth come gray malkin all padock calls anon faire is foule and foule is faire houer through the fogge and filthie ayre exeunt scena secunda alarum within enter king malcom
Convert words to numbers
The deep learning model is based on statistical algorithm. Therefore, in order to use the deep learning model, we need to convert words into numbers.
In this article, we will use a very simple method to convert words to a single integer. Before converting words to integers, we need to mark the text as a singleword。
The following script marks the text in our dataset, and thenoutputThe total number of words in the dataset and the total number of unique words in the dataset:
from nltk.tokenize import word_tokenize ... print('Total Words: %d' % n_words) print('Unique Words: %d' % unique_words)
The output is as follows:
Total Words: 17250 Unique Words: 3436
Our text has a total of 17250 words, of which 3436 words are unique. To convert tokenized words to numbers, you can use the
keras.preprocessing.text。 You need to call the
fit_on_textsMethod and pass it to the word list. A dictionary will be created in which the key will represent the word and the integer will represent the corresponding value of the dictionary.
Look at the following script:
from keras.preprocessing.text import Tokenizer ...
To access a dictionary containing words and their corresponding indexes,
word_indexYou can use the properties of the tokenizer object:
vocab\_size = len(tokenizer.word\_index) + 1 word\_2\_index = tokenizer.word_index
If you check the length of the dictionary, it will contain 3436 words, which is the total number of unique words in our dataset.
Now let’s learn from the dictionaryoutputThe 500th unique word and its integer value.
This is the output:
Modify data shape
LSTM accepts data in three-dimensional format (number of samples, number of time steps, characteristics of each time step). Since the output will be a single word, the shape of the output will be two-dimensional (number of samples, number of unique words in the corpus).
The following script modifies the shape of the input sequence and the corresponding output.
input_sequence = \[\] output_words = \[\] input\_seq\_length = 100 for i in range(0, n\_words - input\_seq_length , 1): in\_seq = macbeth\_text\_words\[i:i + input\_seq_length\] ...
In the above script, we declare two empty lists
input_seq_lengthIs set to 100, which means that our input sequence will include 100 words. Next, we execute a loop that appends the integer values of the first 100 words in the text to the first iteration
input_sequenceIn the list. The 101st word will be appended to
output_wordsIn the list. In the second iteration, the word sequence from the second word in the text to the end of the 101st word is stored in the
input_sequenceIn the list, the 102nd word is stored in
output_wordsArray, and so on. Since there are 17250 words in the dataset (100 less than the total number of words), a total of 17150 input sequences will be generated.
Now let’s output
input_sequenceValue of the first sequence in the list:
\[1, 869, 4, 40, 60, 1358, 1359, 408, 1360, 1361, 409, 265, 2, 870, 31, 190, 291, 76, 36, 30, 190, 327, 128, 8, 265, 870, 83, 8, 1362, 76, 1, 1363, 1364, 86, 76, 1, 1365, 354, 2, 871, 5, 34, 14, 168, 1, 292, 4, 649, 77, 1, 220, 41, 1, 872, 53, 3, 327, 12, 40, 52, 1366, 1367, 25, 1368, 873, 328, 355, 9, 410, 2, 410, 9, 355, 1369, 356, 1, 1370, 2, 874, 169, 103, 127, 411, 357, 149, 31, 51, 1371, 329, 107, 12, 358, 412, 875, 1372, 51, 20, 170, 92, 9\]
Let’s normalize the input sequence by dividing the integer in the sequence by the maximum integer value. The following script also converts the output to 2D format.
The following script outputs the input and the shape of the corresponding output.
print("X shape:", X.shape) print("y shape:", y.shape)
X shape: (17150, 100, 1) y shape: (17150, 3437)
The next step is to train our model. There are no hard and fast rules on how many layers and neurons should be used to train the model.
We will create three LSTM layers, each with 800 neurons. Finally, a dense layer with 1 neuron will be added to predict the index of the next word, as follows:
... model.summary() model.compile(loss='categorical_crossentropy', optimizer='adam')
Since the output word can be one of 3436 unique words, our problem is a multi class classification problem
categorical_crossentropyLoss function. In case of binary classification,
binary_crossentropyUse this function.implementFrom the above script, you can see the model summary:
Model: "sequential_1" \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ Layer (type) Output Shape Param # ================================================================= lstm_1 (LSTM) (None, 100, 800) 2566400 \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ lstm_2 (LSTM) (None, 100, 800) 5123200 \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ lstm_3 (LSTM) (None, 800) 5123200 \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_ dense_1 (Dense) (None, 3437) 2753037 ================================================================= Total params: 15,565,837 Trainable params: 15,565,837 Non-trainable params: 0
To train the model, we can simply use it
model.fit(X, y, batch_size=64, epochs=10, verbose=1)
For prediction, we will start from
input_sequenceRandomly select a sequence from the list, convert it to a 3D shape, and then pass it to
predict()Method of training model. Then pass the index value to
index_2_wordDictionary, in which the word index is used as a key. Should
index_2_wordThe dictionary will return the index words that belong to the key dictionary passed in.
The following script randomly selects an integer sequence and outputs the corresponding word sequence:
... print(' '.join(word_sequence))
For the scripts in this article, the following order is selected randomly:
amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane
Next, we will output the next 100 words in the order of the above words:
for i in range(100): int\_sample = np.reshape(random\_seq, (1, len(random_seq), 1)) int\_sample = int\_sample / float(vocab_size) ...
word_sequenceNow, the variable contains the sequence of words we entered and the next 100 predicted words. Should
word_sequenceVariable contains a list of word sequences. We can simply connect the words in the list to obtain the final output sequence, as shown below:
final_output = "" for word in word_sequence: ... print(final_output)
This is the final output:
amen when they did say god blesse vs lady consider it not so deepely mac but wherefore could not pronounce amen had most need of blessing and amen stuck in my throat lady these deeds must not be thought after these wayes so it will make vs mad macb me thought heard voyce cry sleep no more macbeth does murther sleepe the innocent sleepe sleepe that knits vp the rauel sleeue of care the death of each dayes life sore labors bath balme of hurt mindes great natures second course chiefe nourisher in life feast lady what doe you meane and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and
In this article, we saw how to use deep learning to create a text generation model through Python’s keras library.