Method of generating fluent text

Time:2020-10-25

By Aaron Abrahamson
VK
Source: toward Data Science

Training text generation model on sandcastle 2000

Sand dune castle is a story of a distant feudal society. It focused on a duke and his family, who were forced to become administrators of the desert planet arakis. Frank Herbert published the classic in 1965. Almost any modern science fiction can be traced back to certain elements of sand dunes.

I recently completed the dune’s sequel, the sand dune Messiah, and just started the third part of the dune’s children series. There are six stories written by Herbert at first, and then many by his son. I haven’t read those.

I’ve been exploring text generation models. I think it’s fun to try sand dunes. Many “classic” machine learning models are used for prediction and clustering. Generative modeling allows the model to create the training data learned from it. A recent example of the ability to generate modeling is stylegan. Take a look at this video( https://www.youtube.com/watch?v=kSLJriaOumA )。

Here’s a link to the colab notebook I used for this project( https://drive.google.com/file/d/15Z7SNBnBL12acmUGvvMLQ-OoMspb-B5k/view?usp=sharing )。

Treatment process

  • Corpus for obtaining text data

  • Data cleaning. I have some Unicode characters, and the word “page” appears whenever there is a page break, which is useless. At the beginning of each chapter there was a memoir or book from the world, and I decided to take them out. I also deleted the second half of each chapter to help with the processing time.

  • Tagging. This is to remove the punctuation, make the content lowercase, and then split the long string into each individual word. The model will learn the order and frequency of these word tags. Also note that for this NLP task, we do not delete the stop word

  • Build the model. Make sure that you use the LSTM layer and that the output layer is the size of the vocabulary. Basically, what it does is sort out what the next word might be, just type in a small amount of text https://my.openwrite.cn/logout

  • Training model. Keras suggested at least 20 epochs, and I ran 33 epochs.

  • Generate text. I’ll show you some of the model’s output below

Chapter one: Baron

I want to test it after a while and see what the results will be. The seed word is Baron, a mean opponent in the book.

‘Baron The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron Of The Baron’

It’s always been like this. It’s not good at all.

The model after 33 epoch works very well, but it’s still in a loop, just constantly issuing various nouns. Here is the output of the seed word Spice:

The Spice Itself Stood Out To The Left Wall The Fremen Seeker Followed The Chains The Troop Was A Likely Shadow And The Natural Place Of The Great Places That Was A Subtle City Of The Room'S Features That The Man Was A Master Of The Cavern The Growing The Bronze The Sliding Hand

Here’s the output of “Paul”:

Paul Stood Unable To The Duke And The Reverend Mother Ramallo To The Guard Captain And The Man Looked At Him And The Child Was A Relief One Of The Fremen Had Been In The Doorway And The Fedaykin Control Them To Be Like The Spice Diet Out Of The Wind And The Duke Said I Am The Fremen To Get The Banker Said When The Emperor Asked His Fingers Nefud I Know You Can Take The Duchy Of Government The Sist The Duke Said He Turned To The Hand Beside The Table The Baron Asked The Emperor Will Hold

Here is the output of “she looked”:

'She Looked At The Transparent End Of The Table Saw A Small Board In The Room And The Way Of The Old Woman He Had Been Sent By The Wind Of The Duke And The Worms They Had Seen The Waters Of The Desert And The Sandworms The Troop Had Been Subtly Prepared By The Wind Of The Worm Had Been Subtly Always In The Deep Sinks Of The Women And The Duke Had Been Given Last Of Course But The Others Had Been In The Fremen Had Been Shaped On The Light Of The Light Of The Hall Had Had Seen'

Ideas and next steps

I think it’s definitely progress and progress. I wanted to train it to at least 100 epochs, but progress was slow. Each epoch is about 11 minutes, so the total is over 18 hours. I need a better computer.

Finally, I would like to add that the irony of doing so did not make me forget. In the sand dune universe, at some time in ancient times, the “thinking computer” rebelled against human beings and almost wiped out human beings. In the era of this book, computers have been replaced by “mentats”. Instead, human beings are trained and trained to imitate the computing power of computers.

Link to the original text: https://towardsdatascience.com/the-text-must-flow-3bb4edff7b5b

Welcome to visit pan Chuang AI blog station:
http://panchuang.net/

Sklearn machine learning Chinese official document:
http://sklearn123.com/

Welcome to pay attention to pan Chuang blog resource collection station:
http://docs.panchuang.net/