Implementation of custom Chinese full text index in neo4j

Time:2020-12-10

Generally, the first way to optimize the efficiency of database retrieval is to start with the index, and then consider more complex load balancing, read-write separation and distributed horizontal / vertical sub database / table according to the demand. The index improves the retrieval efficiency through information redundancy, which exchanges space for time and reduces the efficiency of data writing. Therefore, the selection of index fields is very important.

Neo4j can create an index for the node of the specified label. When adding / updating qualified node properties, the index will be automatically updated. Neo4j index is implemented by Lucene by default (customizable, such as rtree index implemented by spatial index). However, by default, the newly created index only supports precise matching (get). For fuzzy query, full-text index is needed to control Lucene’s background word segmentation behavior.

The default word breaker of neo4j full-text index is for western languages. For example, the default exact query uses Lucene keyword analyzer, and the fulltext query uses white space tokenizer, which has no meaning for Chinese. Therefore, for Chinese word segmentation, we need to hang a Chinese word segmentation device, such as IK analyzer, Ansj, as for the word segmentation system pullword based on deep learning, which is similar to the home of director Liang, it is even more powerful.

This paper introduces how to create a new full-text index in neo4j to realize fuzzy query by taking the commonly used IK analyzer word breaker as an example.

Ikanalyzer word breaker

IKAnalyzerIs an open source, based on Java language development of lightweight Chinese word segmentation toolkit.

Ikanalyzer3.0 features

  • It adopts the unique “forward iteration fine-grained segmentation algorithm”, supports two segmentation modes of fine-grained and maximum word length, and has the high-speed processing capacity of 830000 words / S (1600kb / s).

  • It adopts multi sub processor analysis mode, supports word segmentation processing of English letters, numbers and Chinese words, and is compatible with optimized dictionary storage of Korean and Japanese characters, and has smaller memory consumption. Support user dictionary extension definition

  • Ikqueryparser, a query analyzer optimized for Lucene full-text retrieval (recommended by the author vomiting blood), introduces a simple search expression and uses ambiguity analysis algorithm to optimize the search arrangement and combination of query keywords, which can greatly improve the hit rate of Lucene search.
    IK analyzer does not have Maven library at present, so you have to manually download and install it to the local library. Next time, you can create a maven private library in GitHub and upload these toolkits that are not in the Maven central library.

Ikanalyzer custom user dictionary

Dictionary file

The dictionary file with suffix name of. DIC must be saved with UTF-8 code without BOM.

Dictionary configuration

Dictionaries and IKAnalyzer.cfg.xml The path problem of configuration file, IKAnalyzer.cfg.xml Must be in the SRC root. Dictionaries can be placed at will, but in the IKAnalyzer.cfg.xml It needs to be configured correctly. This configuration is as follows, ext.dic and stopword.dic It should be in the same directory.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>  
< comment > IK analyzer extended configuration < / comment >

<! -- users can configure their own extended dictionary here -- >
<entry key="ext_dict">/ext.dic;</entry>

<! -- users can configure their own extended stop word dictionary here -- >
<entry key="ext_stopwords">/stopword.dic</entry>
</properties>

Construction of neo4j full text index

Specify ikanalyzer as the analyzer of luncene participle, and create a new full-text index for the specified attributes of all nodes

@Override
  public void createAddressNodeFullTextIndex () {
      try (Transaction tx = graphDBService.beginTx()) {
        IndexManager index = graphDBService.index();
        Index<Node> addressNodeFullTextIndex =
              index.forNodes( "addressNodeFullTextIndex", MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "analyzer", IKAnalyzer.class.getName()));

        ResourceIterator<Node> nodes = graphDBService.findNodes(DynamicLabel.label( "AddressNode"));
        while (nodes.hasNext()) {
            Node node = nodes.next();
            //Create a new full-text index for the text field
            Object text = node.getProperty( "text", null);
            addressNodeFullTextIndex.add(node, "text", text);
        }
        tx.success();
      }
  }

Neo4j full text index test

For keywords (such as’ limited company ‘), multi keyword fuzzy query (such as’ Suzhou Education Company’) can be retrieved by default, and the retrieval results have been sorted according to the relevance degree.

package uadb.tr.neodao.test;

import org.junit.Test;
import org.junit.runner.RunWith;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.index.Index;
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.graphdb.index.IndexManager;
import org.neo4j.helpers.collection.MapUtil;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.test.context.ContextConfiguration;
import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;
import org.wltea.analyzer.lucene.IKAnalyzer;

import com.lt.uadb.tr.entity.adtree.AddressNode;
import com.lt.util.serialize.JsonUtil;

/**
 * AddressNodeNeoDaoTest
 *
 * @author geosmart
 */
@RunWith(SpringJUnit4ClassRunner. class)
@ContextConfiguration(locations = { "classpath:app.neo4j.cfg.xml" })
public class AddressNodeNeoDaoTest {
      @Autowired
      GraphDatabaseService graphDBService;

      @Test
      public void test_selectAddressNodeByFullTextIndex() {
             try (Transaction tx = graphDBService.beginTx()) {
                  IndexManager index = graphDBService.index();
                  Index<Node> addressNodeFullTextIndex = index.forNodes("addressNodeFullTextIndex" ,
                              MapUtil. stringMap(IndexManager.PROVIDER, "lucene", "analyzer" , IKAnalyzer.class.getName()));
                  IndexHits<Node> foundNodes = addr essNodeFullTextIndex.query ("text", "Suzhou Education Company");
                   for (Node node : foundNodes) {
                        AddressNode entity = JsonUtil.ConvertMap2POJO(node.getAllProperties(), AddressNode. class, false, true);
                        System.  out.println ( entity.getAll Real full name of address ());
                  }
                  tx.success();
            }
      }
}

Using custom full text index query in cyperql

Regular query

profile  
Match (A: addressnode {rulebbr: 'tow', text: 'Weiting town'}) < - [R: belongto] - (B: addressnode {rulebbr: 'STR'})
Where b.text = ~'jinling. * '
return a,b

Full text index query

profile
START b= node:addressNodeFullTextIndex ("text: Jinling *")
Match (A: addressnode {rulebbr: 'tow', text: 'Weiting town'}) < - [R: bellongto] - (B: addressnode)
where b.ruleabbr='STR'
return a,b

Establishing joint exact and fulltext indexes in legacyindex

For nodes whose label is addressnode, the addressnode is classified according to the node attribute rulebbr_ fulltext_ Index (province, city, District, county, township, street, street / property community) / addressnode_ exact_ Index (house number, building number, unit number, floor number, and room number), different types of indexes are created for the property text

profile
START a= node:addressnode_ fulltext_ Index ("text: Mall"), B= node:addressnode_ exact_ Index ("text: phase II 19")
match (a:AddressNode{ruleabbr:'STR'})-[r:BELONGTO]-(b:AddressNode{ruleabbr:'TAB'})
return a,b limit 10