How to break fuzzy matching and similarity query? Look at the PG billion level retrieval millisecond response

Time:2021-7-29

Reading makes people precise, writing makes people agile.

Demand scenario assumptions
Aken heard a song in a coffee shop one day. He thought it was very nice, but he didn’t know the specific name of the song. He only knew that the song was “folk song”, and the lyrics included “a broken guitar”, “traveler”, and then a male singer.

If I want to collect this song, how can Xiaobian find the target song, corresponding singer and album for me?

Why discuss this problem
First of all, there are too many actual scenarios of search requirements, which can be seen everywhere, such as:

  • Baidu and Google input keyword search information;
  • Search the corresponding target audio and video files in the pan entertainment industry;
  • Face recognition, fingerprint verification, specific motion capture authentication, etc.

Secondly, in the era of Internet of things, there is a vast amount of data and information generated by the network. Efficient search engine technology is needed to help us quickly capture highly relevant matching information.

Therefore, if we don’t do technology research that is incompatible with the business application scenario, we might as well look at more technologies that fit the business.

Then, the technical problems discussed here are closely related to today’s and future business scenarios. The solution of relevant technical problems will certainly promote the implementation of applications.

Traditional solutions based on DB
If we are a student who has used relational database, we can easily think of SQL fuzzy search. For example:

  • Post fuzzy search through BTREE

select * from tab_aken where col like 'aken%';

  • Or the pre fuzzy search realized by BTREE reverse index

select * from tab_aken where col like '%aken';

However, many times, users do not know what prefix or suffix they need. They usually only remember some keywords or incomplete information, and then find the most similar target.

For example, witnesses only remember some characteristics of criminals: beard, scar on face, black jacket, obscene… And then lock a certain range of suspects according to keyword fuzzy query.

Therefore, in reality, more query requirements may be as follows:

  • Front and back blur

select * from tab_aken where col like '%aken%';

  • Multi field combination

select * from tab_aken where a=? and b=? or c=? and d=? or e between ? and ? and f in (?);

  • Similarity query, vector distance

select * from tab_aken order by similarity(col, 'aken') desc limit 100;

N query scenarios are omitted here··················

So the problem is that the database usually does not have the efficient retrieval ability of the fuzzy query before and after the above and more scenes after it.

Let’s take MySQL as an example to see a negative case.

On the 32c-64g-ssd high-end storage device, a mysql-5.7.27 is run here. In the example, there is a table with a data volume of about 800W:

mysql> select count(*) from test.tab_test_txt;

+----------+

| count(*) |

+----------+

|  8814848 |

+----------+

1 row in set (3.46 sec)

mysql>

mysql>select table_name,sum(truncate((data_length+index_length)/1024/1024/1024, 2)) data_GB,sum(truncate((data_length)/1024/1024/1024, 2)) tabsize_gb,sum(truncate((index_length)/1024/1024/1024, 2)) idxsize_gb from information_schema.tables where table_name

+--------------+---------+------------+------------+

| table_name   | data_GB | tabsize_gb | idxsize_gb |

+--------------+---------+------------+------------+

| tab_test_txt |    5.59 |       4.12 |       1.47 |

+--------------+---------+------------+------------+

1 row in set (0.00 sec)

mysql>

Then use the field name to do fuzzy query, hit more than 600W data, the performance is very good, and it takes 11.14 seconds.

mysql> SELECT name from test.tab_test_txt

WHERE (NOT (`tab_ test_ Txt `. ` name ` like binary '% test%' and ` tab_ test_ txt`.`name` IS NOT NULL)

AND NOT (`tab_ test_ Txt `. ` name ` like binary '%' and ` tab_ test_ txt`.`name` IS NOT NULL)

AND NOT (`tab_ test_ Txt `. ` name ` like binary '% monitor%' and ` tab_ test_ txt`.`name` IS NOT NULL)) ;

+--------------------------------------+

| name                                 |

+--------------------------------------+

|[n] [QQ wide fill]                        |

|[n] [k]                      |

|[newpc customer side connector [easy access]|

|            .............             |

|[conference information edit server]_ [upper]    |

|[conference information edit server]_ [di Bian]    | 

+--------------------------------------+

6578432 rows in set (11.14 sec)

In order to eliminate the influence of large results, let’s look at small results. The simplest query:

mysql> SELECT * from test.tab_ test_ txt    Where name like '% search test module%';

Empty set, 1 warning (11.86 sec)

mysql>

It can be seen that no matching results or single line records take 11.86 seconds, which is not different from the case of a large number of result sets, because they are all scanned.

mysql> insert into test.tab_ test_ Txt (ID, name) values (666666, 'MySQL full text retrieval test');

Query OK, 1 row affected, 0 warnings (0.59 sec)

mysql> select count(*) from test.tab_test_txt  where id = 666666;

+----------+

|  count(*)   |

+----------+

|        1   |

+----------+

1 row in set (0.00 sec)

mysql> SELECT * from test.tab_ test_ txt    Where name like '% SQL full%';

+--------+---------+--------+-----------+-------+------+-------+---------------------+---------------------+------------+----------------+----------------+--------------+--------+----------+--------------+------------+--------+--------+

| id     | name    | owners | parent_id | busid | uid  | level | update_date         | create_date         | limit_load | children_count | limit_low_load | history_load | status | group_id | _alarm_types | star_level | remark | enable |

+--------+---------+--------+-----------+-------+------+-------+---------------------+---------------------+------------+----------------+----------------+--------------+--------+----------+--------------+------------+--------+--------+

|666666 MySQL full null   |       NULL |    NULL | NULL |    NULL | 0000-00-00 00:00:00 | 0000-00-00 00:00:00 |           65 |                0 |               30 | NULL          |       0 |      21576 | NULL          |           0 | NULL    |       1 |

+--------+---------+--------+-----------+-------+------+-------+---------------------+---------------------+------------+----------------+----------------+--------------+--------+----------+--------------+------------+--------+--------+

1 row in set (12.69 sec)

mysql>

In fact, today’s search demand is increasingly developing towards socialization, real-time, situational, mobile and other trends, which will be mixed with more association relations, composite conditions and other factors, making query changes more complex.

For example, considering the social relationship, the above statement is slightly upgraded, which is an example of multi table Association:

SELECT `core_data`.`id` FROM `core_data`

INNER JOIN `core_extend` ON ( `core_data`.`ip` = `core_extend`.`ip` )

INNER JOIN `core_business` ON ( `core_extend`.`business_id` = `core_business`.`busid` )

WHERE (NOT (`core_ Business `. ` name ` like binary '% test%' and ` core_ business`.`name` IS NOT NULL)

 AND NOT (`core_ Business `. ` name ` like binary '%' and ` core_ business`.`name` IS NOT NULL)

 AND NOT (`core_ Business `. ` name ` like binary '% host%' and ` core_ business`.`name` IS NOT NULL)

 AND NOT (`core_extend`.`business_id` IN (789274, 879201, 1334489, ......,1121451, 1162547, 1168113, 1071955))

 AND `core_data`.`is_pushed` IS NULL

 AND `core_data`.`update_date` > '2020-12-03 23:47:00');

Let’s not discuss whether there is room for optimization in the writing of the above statements. This is a real example encountered in the production environment recently, which hits the weakness of MySQL multi table Association and text retrieval.

For the performance of MySQL in text retrieval, see the following article: MySQL full-text retrieval performance test and problem summary

Therefore, if there is a need for full-text retrieval in the application, running similar queries directly in the database may not be far away. If the data volume and performance requirements are slightly higher, it can not be supported, and there is no need to talk about massive real-time and high concurrency.

Industry wide solutions
Text retrieval is usually the specialty of search engines. Therefore, the popular general scheme in the industry is to synchronize data to professional search engine systems to support users’ search needs, such as using ES clusters.

However, the scheme also introduces the problem of cross product interaction, so there will be problems of data synchronization delay and data consistency. There are still many challenges for scenes with high real-time and consistency requirements.

In addition, the function of search engine is not perfect in similarity retrieval, regular expression and fuzzy query.

At present, ES cluster can achieve the millisecond level in the accurate query of static text retrieval, but for the above demand scenario with multi table Association and text search, ES obviously can not solve it. If the data will change frequently, it may be necessary to consider integrating other schemes.

So, is there a search engine scheme that can efficiently realize data retrieval without losing the inherent attributes of the database?

Can’t we not only realize text retrieval efficiently, but also deal with transaction processing, data change, association query, similarity matching and complex regular expressions efficiently?

Wait·······

Exploration of new scheme
Here we try to explore a search engine scheme based on DB. If it can support efficient text retrieval and still maintain the inherent attributes of the database, it is a friendly alternative, especially for companies that do not want to consume additional technical costs to build a special search engine, so they do not need to maintain an additional technical component.

This involves features such as word segmentation algorithm, vector distance calculation, fuzziness and regularity. PostgreSQL, as the most versatile open source database product, has these features for a long time. For example:

  • Word segmentation
akendb=# select to_tsvector('english', 'akengan-love-db,oracle mysql postgresql');

 to_tsvector 

--------------------------------------------------------------------------------

 'akengan':2 'akengan-love-db':1 'db':4 'love':3 'mysql':6 'oracl':5 'postgresql':7

(1 row)

akendb=#

akendb=# select show_trgm('hello');

 show_trgm 

---------------------------------

 {"  h"," he",ell,hel,llo,"lo "}

(1 row) akendb=#
  • Similarity calculation
akendb=# SELECT smlar('{5,2,0}'::int[], '{5,2,8}');   -- Calculate the similarity of two arrays

 smlar 

-----------

 0.6666667

(1 row)

akendb=#

akendb=# SELECT word_ similarity('aken', 'akengan');    -- Calculate the similarity of two words

 word_similarity

-----------------

 0.8

(1 row)

akendb=#
  • Regular matching

select * from tab_ account where email ~ ‘^[A-H]’; — Query email addresses starting with A-H
Therefore, we might as well take a look at the ability of PostgreSQL in full-text retrieval.

  • Create test table

Partition table, 64 partitions in total, parallelism 64.

akendb=# d+ tab_aken_text

 Unlogged table "public.tab_aken_text"

 Column |  Type   | Collation | Nullable | Default | Storage  | Stats target | Description

--------+---------+-----------+----------+---------+----------+--------------+-------------

 id     | integer |           | not null |         | plain    |              |

 info   | text    |           |          |         | extended |              |

Indexes:

 "tab_aken_text_pkey" PRIMARY KEY, btree (id)

 "idx_tab_aken_text_info" gin (info gin_trgm_ops)

Child tables: tab_aken_text0,

 tab_aken_text1,

 tab_aken_text10,

 tab_aken_text11,

 tab_aken_text12,

 tab_aken_text13,

 tab_aken_text14,

 tab_aken_text15,

 tab_aken_text16,

 tab_aken_text17,

 tab_aken_text18,

 tab_aken_text19,

 tab_aken_text2,

 tab_aken_text20,

 tab_aken_text21,

 tab_aken_text22,

 tab_aken_text23,

 tab_aken_text24,

 tab_aken_text25,

 tab_aken_text26,

 tab_aken_text27,

 tab_aken_text28,

 tab_aken_text29,

 tab_aken_text3,

 tab_aken_text30,

 tab_aken_text31,

 tab_aken_text32,

 tab_aken_text33,

 tab_aken_text34,

 tab_aken_text35,

 tab_aken_text36,

 tab_aken_text37,

 tab_aken_text38,

 tab_aken_text39,

 tab_aken_text4,

 tab_aken_text40,

 tab_aken_text41,

 tab_aken_text42,

 tab_aken_text43,

 tab_aken_text44,

 tab_aken_text45,

 tab_aken_text46,

 tab_aken_text47,

 tab_aken_text48,

 tab_aken_text49,

 tab_aken_text5,

 tab_aken_text50,

 tab_aken_text51,

 tab_aken_text52,

 tab_aken_text53,

 tab_aken_text54,

 tab_aken_text55,

 tab_aken_text56,

 tab_aken_text57,

 tab_aken_text58,

 tab_aken_text59,

 tab_aken_text6,

 tab_aken_text60,

 tab_aken_text61,

 tab_aken_text62,

 tab_aken_text63,

 tab_aken_text7,

 tab_aken_text8,

 tab_aken_text9

Access method: heap

Options: parallel_workers=64

akendb=#
  • Test data accuracy

In order to simulate the massive data scene, 1 billion random 64 character Chinese text data are inserted.

akendb=# select count(*)  from tab_aken_text;

 count 

------------

 1000000000

(1 row)

akendb=#

akendb=# select * from tab_aken_text limit 3;

 id    |                                                                                               info 

---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 27993938: it's a doubt that the two sides of the two parties have a lot of ideas, and the people's idea is that the people's idea is that the people's idea is that the people's idea is that the people's idea is that the people's idea is that the people's idea is that the people's idea is that the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people's idea, the people, the people's idea, the people, the people, the people, the people, the people, the people, the people, the people, the people, the people, the people, the people, the people, the people, the I'm talking about you in the garden

 The 27999939 is a great way to make a difference and make a difference. It's a great way to make a difference. It's a great way to make a difference. It's a great way to make a difference. It's a great way to make a difference. It's a great way to make a difference. It is a great way to make a difference. It is a great way to make a difference. It is a great way to make a difference. It is a great way to make a difference. It is a great way to make a difference. It is also a great way to make a difference. It is a great way to make a difference. It is also a great way to make a difference. It is a long way to make a big deal. It is an indicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicaindicawhat's the matter with the fire? It's not easy to get excited

 The 27999440 is a young man who is going to make a lot of money in the account book. He is going to make a lot of money in the account book. He is going to make a lot of money in the account book. He is going to make a lot of money in the account book. He is going to make a lot of money in the book book. He is going to make a lot of money in the account book. He is going to make a lot of money in the account book. He is going to make a lot of money in the account book. He is going to consider the idea of making a lot of money, making a lot of money, making a lot of money, making a lot of money, making a lot of money, making a lot of money, making a lot of money, making a lot of people, making a lot of money, making a lot of people, making a lot of money, making a lot of money, making a lot of money, making a lot of money, making a lot of money, making a lot of money, making a lot of money, making a lot of money, making a lot of hook locking ring

(3 rows)

akendb=#

To facilitate the test, the text information of the line id = 10325230 is modified into the target information:

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test' where id = 10325230;
  • Query performance verification

1. Fuzzy query before and after full-text retrieval

It can be seen that with a data volume of 1 billion, fuzzy query takes only 9.899 Ms.

akendb=# select   *  from tab_ aken_ Text where info like '% fuzzy match%';

 id    |  info 

----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 10325230 | search engine full-text retrieval fuzzy matching test search engine full-text retrieval fuzzy matching test search engine full-text retrieval fuzzy matching test search engine full-text retrieval fuzzy matching test

(1 row)

Time: 9.899 ms

akendb=#

2. Text retrieval similarity query

Prepare 10 target test data in 1 billion data

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search 80% test target' where id = 2400002;

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search 80% test target' where id = 2500002;

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search 80% test target' where id = 2600002;

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search 80% test target' where id = 2200002;

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search module 90% target' where id = 2000002;

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search module 90% target' where id = 2700002;

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search module 90% target' where id = 2800002;

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching 95' where id = 2100002;

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching 95' where id = 2300002;

update tab_ aken_ Text6 set info = 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching 95' where id = 2900002;

It takes 61.175 MS to query the target data with more than 50% similarity among 1 billion data, and no consumption of CPU and other resources is observed.

akendb=# select set_limit(0.5);

 set_limit

-----------

 0.5

(1 row)

Time: 0.283 ms

Akendb = #select similarity (info, 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test') as SML*

from tab_aken_text 

Where info% 'search engine full-text retrieval fuzzy matching test search engine full-text retrieval fuzzy matching test search engine full-text retrieval fuzzy matching test search engine full-text retrieval fuzzy matching test'    --  Similarity exceeds threshold 

order by sml desc;

 sml     |    id    |                                                                                         info 

------------+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 1 | 10325230 | test of search engine full-text retrieval fuzzy matching test of search engine full-text retrieval fuzzy matching test of search engine full-text retrieval fuzzy matching test of search engine full-text retrieval fuzzy matching

 0.77272725 |    2900002 | search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching 95

 0.77272725 |    2100002 | search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching 95

 0.77272725 |    2300002 | search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching 95

 0.68 |    2800002 | search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search module 90% target

 0.68 |    2000002 | search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search module 90% target

 0.68 |    2700002 | search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search module 90% target

 0.56666666 |    2200002 | search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search 80% test target

 0.56666666 |    2400002 | search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search 80% test target

 0.56666666 |    2500002 | search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search 80% test target

(10 rows)

Time: 61.175 ms

akendb=#

If the query similarity exceeds 90%, the higher the accuracy requirements, the faster the query performance. As follows, the query of 1 billion data takes only 29.020 Ms.

akendb=# select set_limit(0.9);

 set_limit

-----------

 0.9

(1 row)

Time: 0.239 ms

Akendb = #select similarity (info, 'search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test search engine full-text search fuzzy matching test') as SML*

from tab_aken_text 

Where info% 'search engine full-text retrieval fuzzy matching test search engine full-text retrieval fuzzy matching test search engine full-text retrieval fuzzy matching test search engine full-text retrieval fuzzy matching test' 

order by sml desc limit 10;

 sml |    id    |                                                                                         info 

-----+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 1 | 10325230 | test of search engine full-text retrieval fuzzy matching test of search engine full-text retrieval fuzzy matching test of search engine full-text retrieval fuzzy matching test of search engine full-text retrieval fuzzy matching

(1 row)

Time: 29.020 ms

akendb=#

Therefore, using PostgreSQL can achieve millisecond retrieval of more than 1 billion level data. In fact, in the case of 10 billion level data, the performance of PostgreSQL text retrieval can still be comparable to that of 1 billion level, which can fully meet the needs of real-time search engines. Moreover, one advantage of using PostgreSQL is that various functions can be expanded as needed.

reference material
1.https://www.postgresql.org/do…
2.https://github.com/eulerto/pg…

For more wonderful content, please pay attention to the following platforms and websites:

Official account of China PostgreSQL branch (technical articles and technical activities):
PostgreSQL branch of Open Source Software Alliance

Technical Q & a community of China PostgreSQL branch:
www.pgfans.cn

Official website of China PostgreSQL branch:
www.postgresqlchina.com

Recommended Today

Implementation example of go operation etcd

etcdIt is an open-source, distributed key value pair data storage system, which provides shared configuration, service registration and discovery. This paper mainly introduces the installation and use of etcd. Etcdetcd introduction etcdIt is an open source and highly available distributed key value storage system developed with go language, which can be used to configure sharing […]