In order to realize the complex query of online library, are you still double writing?


1、 Online complex query is not supported in online library

Do online business developers often encounter such a problem: online database running slightly more complex queries, online business will hang up! Whether it is a stand-alone database such as mysql, PG or distributed database, HBase, mongodb and Cassandra all have this problem. Next, this paper takes HBase as an example to illustrate the problem, and the principles of other libraries are similar.

As a massive online storage engine, HBase is widely used in recommendation, risk control, Internet of things, portrait, form and other big data scenarios. Phoenix, as the SQL layer of HBase, greatly reduces the user‘s threshold, and realizes a large number of practical functions such as secondary index, salt table, dynamic column, etc. The underlying storage of HBase is based on LSM. LSM can change the random write of business into sequential write, which can effectively improve the write throughput. However, its query is only suitable for prefix matching of rowkey, and the query mode is single. Phoenix secondary index, the underlying is the index table associated with the original table, which is also prefix matching. A table can have multiple indexes, which can increase the query mode, but the number of indexes is not It can be too much, or the problem of write amplification will be more serious.

For more complex query scenarios, such as fuzzy search in form and log query, random condition combination in user profile, etc., the combination of HBase + Phoenix cannot be supported. This problem is a general problem of NoSQL online database based on LSM. In addition to HBase, Cassandra, leveldb, rocksdb and mongodb engines all have the same problems.

Some developers choose to do complex query on the standby database. However, as mentioned above, the query ability of online database itself is often limited, or it is very slow, or it cannot be found out, which can not meet the real-time requirements of online complex query.

2、 Problems in double writing

In order to realize the complex query of online library, are you still double writing?

In order to solve problem 1, users naturally think of using search engines, such as es, Solr, Lucene, etc. to solve this problem. Many users choose the way of double writing, that is, each record is written to the online library and search engine at the same time, which seems simple, but there are many problems in the actual use process. We have learned that customers who have better solutions to this set of solutions tend to invest a lot of time and manpower at the monthly level. Take double write HBase and Solr as examples. For example, several users have encountered many problems.

  1. Consistency is hard to guarantee
    It is difficult to ensure the consistency between online library and search engine. For example, if two links are written concurrently and there are modification operations, it is difficult to ensure that the write order of the same field in HBase is the same as that of the same doc in Solr. Then the data in HBase and Solr are inconsistent and the problems are difficult to troubleshoot. In addition, in the line library, only the data in the latest period of time needs to be saved, and the data beyond TTL will be automatically cleaned up Well, that’s what’s needed in Solr. But HBase makes TTL according to kV and Solr according to doc. There will also be inconsistencies between the two in data cleaning. There are a lot of inconsistent scenes, which will not be introduced here.
  2. Write performance degradation
    Under the same configuration, the throughput of HBase is much higher than that of Solr, which is due to many factors such as different starting points of software design and different optimization directions. If there are two writes, Solr’s write throughput will limit HBase’s write throughput.
  3. Synchronization of historical data
    Double write only solves the problem of new data, but not for historical data. Users need to solve the problem of batch synchronization of historical data by themselves. Especially, for the scenes that can not be stopped, how to solve the problem of mutual coverage with new data and historical data in the process of historical data rebuild is also a very difficult problem.
  4. Redundant storage space
    The retrieval engine specially solves the index problem, and its data storage format is more complex than the online library. The data of an online library may need to store multiple copies in the retrieval engine, such as the original data storage, inverted index storage, and the storage of docvalue for improving the aggregation and sorting columns. Then, there is bound to be the problem of storage redundancy, and how to reduce costs is also a big challenge.
  5. stability
    Double write requires HBase and Solr to ensure stability at the same time. If Solr fails, the write process will be blocked, which will affect online business.

3、 HBase + Solr is not easy to use

Alibaba cloud HBase Solr full-text search engine adopts the method of data conversion and synchronization at the system level to solve most problems encountered by users when using dual engines. However, users who have tried it will have an experience that it is too flexible to use, and the steps are complicated and easy to have problems. If it is not for senior players, it is difficult to control. Here are a few user pain points:

  1. Use threshold height
    Users need to understand HBase, Solr, indexer (data synchronization service), operate HBase shell, indexer command line and Solr interface at the same time to get through the process.
  2. It is difficult to ensure the alignment between HBase of schemaless and Solr data type of strong schema
    First of all, users need to define their own mapping from HBase column to Solr field; second, users need to ensure that the type actually written to HBase is correct. For example, a column in HBase corresponds to a long type in Solr, because the HBase API does not check whether the actual value written by the user is legal, resulting in a successful write of HBase, but synchronization to Solr is not successful. This requires users to write a set of type checking system based on HBase API, which is time-consuming and laborious.
  3. HBase + Solr are not friendly to solve the problem of redundant data storage
    Users need to decide whether to turn on the stored and docvalued options in Solr. For fields that only turn on the indexed option, users can get the final result data by reading back HBase. For fields that turn on the stored or docvalued options, the performance of returning the result directly from Solr will be better. This set of optimized logic needs to be managed and implemented by users themselves.

4、 Searchindex flexible and easy to use integrated online library engine

In order to realize the complex query of online library, are you still double writing?

Searchindex is a new index implementation of Alibaba cloud HBase SQL (Phoenix) based on HBase + Solr dual engines. Its architecture is shown in the figure above. Phoenix layer transforms SQL (DDL, DML) statements into specific operations for HBase and Solr. Searchservice is responsible for index synchronization, consistency, metadata management, etc. Searchservice will manage the corresponding relationship between timestamp in HBase and docversion in Solr to achieve final consistency. In short, the docversion of Solr row data is equal to the maximum timestamp of each column of the corresponding row of HBase that has been synchronized. When solving the disorder, if the previous new cell has been synchronized, the old cell will be directly lost. For TTL, we implement the HBase comparison mechanism based on rows to ensure the consistency.

Searchindex solves all the problems mentioned above. In just a few minutes, a few SQL statements can run through the whole process. You can refer to the quick start document. Phoenix strong type directly maps Solr type and supports complex types such as word segmentation and array. The optimization strategy of adaptive query better solves the problem of redundant data storage. Compared with HBase Solr full-text search engine, it greatly improves the usability and covers most of the scenarios and requirements. But at present, searchindex can’t completely replace HBase + Solr. For senior players, they prefer to write the flexibility brought by HBase API and Solr API directly. They can still choose to use HBase Solr full-text search engine.

Searchindex is an integrated cloud native online NoSQL database engine customized and developed for public cloud customers of Alibaba cloud. It has the characteristics of low cost, flexibility, ease of use and stability. It has been used in industries and scenarios such as logistics gun, offline payment form, e-commerce form, pharmaceutical experiment log, etc. the user data volume has reached tens of billions, and has experienced the test of the double 11. In the first step, users can only purchase HBase instances, full-text services and SQL services can be opened and upgraded separately. Welcome interested developers to communicate with us.

Author: Mingshuo

Read the original text

This is the original content of yunqi community, which can not be reproduced without permission.