New version of Xiaobai implemented by chatterbot framework~
- Core: core framework, modified chatterbot
- Corpus: English and Chinese training corpus
- Nltk_data: the required nltk data set
Text similarity algorithm
Levenshtein Distance Algorithm
Chinese Name: levinstein distance
Reference: https://en.wikipedia.org/wiki/levenshtein’u distance
Levenshtein Distance is the concept of representing or measuring the difference between two strings. Taking the words kitten and siting as examples, this paper defines three standard editing methods: substitution, insert and delete to eliminate the difference between the two words. After one standard editing method, Levenshtein Distance is added once, kitten needs to be replaced twice and inserted once Character to get the word sitting, so the so-called Levenshtein Distance should be 3. According to its definition, the distance is proportional to the difference degree of the string.
Levinstein distance between two strings a, B
Editing distance is the basic algorithm for NLP to measure text similarity, which can be used as one of the important features of text similarity tasks, such as spell checking, paper duplicate checking, gene sequence analysis and so on. But its shortcomings are also obvious, the algorithm is based on the structure of the text itself to calculate, there is no way to get the semantic level of information.
Spacysimilarity belongs to the calculation method of semantic similarity.
Reference: https://spacy.io/api/doc/ ා sim
First, average the word vectors of the sentences, obtain the semantic representation of the sentences, and then calculate the cosine similarity of the semantic representation of the two sentences.
Given two sets a, B, Jaccard coefficients are defined as the ratio of the size of intersection of a and B to the size of union of a and B, as follows:
When both sets a and B are empty, J (a, b) is defined as 1.
The index related to Jaccard coefficient is called Jaccard distance, which is used to describe the dissimilarity between sets. The larger the Jaccard distance is, the lower the sample similarity is. The formula is defined as follows:
 try some other methods of calculating text similarity based on word vector:
-  use word vector to calculate average similarity
-  word vector TFIDF weighted average calculation similarity
-  word vector weighted PCA computing similarity
-  calculate semantic similarity of sentences based on deep learning
-  training with new large volume corpus
-  access Tai to retrieve chat corpus
-  learn from user conversation with littlewhite V3’s dual engine