Littlewhiteone basic design


LittleWhite One

New version of Xiaobai implemented by chatterbot framework~

brief introduction

  • Core: core framework, modified chatterbot
  • Corpus: English and Chinese training corpus
  • Nltk_data: the required nltk data set

Text similarity algorithm

  • LevenshteinDistance
  • SpacySimilarity
  • JaccardSimilarity

Levenshtein Distance Algorithm

Chinese Name: levinstein distance
Reference:’u distance

Levenshtein Distance is the concept of representing or measuring the difference between two strings. Taking the words kitten and siting as examples, this paper defines three standard editing methods: substitution, insert and delete to eliminate the difference between the two words. After one standard editing method, Levenshtein Distance is added once, kitten needs to be replaced twice and inserted once Character to get the word sitting, so the so-called Levenshtein Distance should be 3. According to its definition, the distance is proportional to the difference degree of the string.

Levinstein distance between two strings a, B
Littlewhiteone basic design

Editing distance is the basic algorithm for NLP to measure text similarity, which can be used as one of the important features of text similarity tasks, such as spell checking, paper duplicate checking, gene sequence analysis and so on. But its shortcomings are also obvious, the algorithm is based on the structure of the text itself to calculate, there is no way to get the semantic level of information.

Spacysimilarity algorithm

Spacysimilarity belongs to the calculation method of semantic similarity.

Reference: ා sim

First, average the word vectors of the sentences, obtain the semantic representation of the sentences, and then calculate the cosine similarity of the semantic representation of the two sentences.

Jaccardsimilarity algorithm


Given two sets a, B, Jaccard coefficients are defined as the ratio of the size of intersection of a and B to the size of union of a and B, as follows:

Littlewhiteone basic design

When both sets a and B are empty, J (a, b) is defined as 1.

The index related to Jaccard coefficient is called Jaccard distance, which is used to describe the dissimilarity between sets. The larger the Jaccard distance is, the lower the sample similarity is. The formula is defined as follows:

Littlewhiteone basic design


Algorithm optimization

  • [] try some other methods of calculating text similarity based on word vector:

    • [] use word vector to calculate average similarity
    • [] word vector TFIDF weighted average calculation similarity
    • [] word vector weighted PCA computing similarity
  • [] calculate semantic similarity of sentences based on deep learning

Training optimization

  • [] training with new large volume corpus
  • [] access Tai to retrieve chat corpus
  • [] learn from user conversation with littlewhite V3’s dual engine