For a long time, the optimization of fuzzy matching in search has been a headache. Fortunately, powerful PgSQL provides an optimization scheme. Here’s a brief talk about how to optimize fuzzy matching through index.
We have an inspection report form with tens of millions of data. We need to fuzzy search a certain condition through the inspection report. First, we create the following index:
CREATE INDEX lab_report_report_name_index ON lab_report USING btree (report_name);
After searching for a simple fuzzy matching condition such as LIKE “blood routine%”, we can find that the query plan is generated as follows, and the index is not used, because the traditional BTREE index does not support the fuzzy matching.
After consulting the document, it is found that PgSQL can specify operators on the Btree index: text_pattern_ops, varchar_pattern_ops and bpchar_pattern_ops, which correspond to field types text, varchar and char, respectively, and are officially interpreted as“The difference between them and the default operator class is that the comparison of values is strictly based on character rather than region-related sorting rules. This makes these operator classes suitable for queries involving pattern matching expressions (LIKE or POSIX regular expressions) when a database does not use the standard “C” region.” There’s some abstraction. Let’s try it first. Create the following index and query the condition LIKE “blood routine%” (refer to the PgSQL document https://www.postgresql.org/docs/10/indexes-opclass.html)
CREATE INDEX lab_report_report_name_index ON lab.lab_report (report_name varchar_pattern_ops);
It is found that indexed scanning can indeed be done, and the execution time is optimized from 213 MS to 125 Ms. However, if LIKE’% blood routine%’is searched, the whole table scanning will be done again! Here we introduce the protagonists of this blog, “pg_trgm” and “pg_bigm”.
Before creating these two indexes, we need to introduce the following two extension packages:
CREATE EXTENSION pg_trgm; CREATE EXTENSION pg_bigm;
The difference between the two indexes is that “pg_tigm” is the official index for pgsql, and “pg_tigm” is for Japanese developers. The following is a detailed comparison: (refer to the pg_bigm document http://pgbigm.osdn.jp/pg_bigm_en-1-2.html)
Comparison with pg_trgm
Thepg_trgmcontrib module which provides full text search capability using 3-gram (trigram) model is included in PostgreSQL. The pg_bigm was developed based on the pg_trgm. They have the following differences:
|Functionalities and Features||pg_trgm||pg_bigm|
|Phrase matching method for full text search||3-gram||2-gram|
|Available index||GIN and GiST||GIN only|
|Available text search operators||LIKE (~~), ILIKE (~~*), ~, ~*||LIKE only|
|Full text search for non-alphabetic language
|Not supported (*1)||Supported|
|Full text search with 1-2 characters keyword||Slow (*2)||Fast|
|Similarity search||Supported||Supported (version 1.1 or later)|
|Maximum indexed column size||238,609,291 Bytes (~228MB)||107,374,180 Bytes (~102MB)|
(*1) You can use full text search for non-alphabetic language by commenting out KEEPONLYALNUM macro variable in contrib/pg_trgm/pg_trgm.h and rebuilding pg_trgm module. But pg_bigm provides faster non-alphabetic search than such a modified pg_trgm.
(*2) Because, in this search, only sequential scan or index full scan (not normal index scan) can run.
pg_bigm 1.1 or later can coexist with pg_trgm in the same database, but pg_bigm 1.0 cannot.
If “pg_bigm” is recommended without special requirements, let’s test the effect:
Bitmap index scanning can be used. For this case, the effect of using pg_trgm is the same as that of pg_bigm.
This article is just a brief introduction to many details without in-depth analysis. Welcome to leave a comment or discussion.
Above is the whole content of this article. I hope the content of this article has some reference value for your study or work. Thank you for your support to developpaer.