Based on the full-text retrieval features of gaussdb (DWS), let’s have a look?

Time:2021-3-18

abstract: full text retrieval is a widely used feature in the Internet scene, which can be used in search engine, website search, e-commerce search and other scenes. Gaussdb (DWS) also supports full-text retrieval function, which is based on gin index. Let’s introduce the full-text retrieval function of gaussdb (DWS) in detail.

The function of full-text retrieval is to search the relevant information from the full-text field according to the keywords. When the full-text retrieval feature is not used, the fuzzy matching can only be done by the way of like ‘% keyword%’, the index can not be used, and the full table scan can only be carried out, which is very inefficient. The full-text retrieval feature can effectively improve the retrieval performance.

The basis of full-text retrieval is gin index, which is also called general inverted index. It is an index structure that stores the collection of key and posting list, in which key is a key value and posting list is a group of locations where keys have appeared. For example, in (‘Hello ‘, 2,3), it means that hello has appeared in 2 and 3.

Based on the full-text retrieval features of gaussdb (DWS), let's have a look?

Let’s look at a few interfaces first

to_tsvector

to_tsvector(text, text)

This function converts a text into a vector that holds words and the order in which they appear,

test=# SELECT to_tsvector(‘english’, ‘huawei cloud data warehouse‘);

            to_tsvector --------------------------------------------

‘cloud’:2 ‘data’:3 ‘huawei’:1 ‘warehous’:4 (1 row)

test=# SELECT to_ Tsvector (‘zhparser ‘,’huawei cloud data warehouse’);

        to_tsvector -----------------------------------

‘cloud’: 2 ‘warehouse’: 4 ‘Huawei’: 1 ‘data’: 3 (1 row)

to_tsquery

to_tsquery(text)

This function is used to convert text into a query able statement

SELECT to_ Tsquery (‘huawei & Shenzhen ‘);

In this way, we can find sentences that contain both “Huawei” and “Shenzhen”, & is and and operation, | is or or operation

If you want to know whether a tsvector makes tsquery work, you can use the @ operator, for example:

SELECT to_ Tsvector (‘zhparser ‘,’huawei Shenzhen’) @ @ to_ Tsquery (‘huawei & Shenzhen ‘);

This statement returns true

ts_rank

ts_rank(tsvector, tsquery)

This function can calculate the approximation degree of tsvector and tsquery. After rank is calculated by this function, it can be sorted

SELECT

ts_rank(

to_ Tsvector ('zhparser ','huawei Shenzhen'),

to_ Tsquery ('huawei & Shenzhen ')

);

After talking about these interfaces, let’s do some specific practices

1. Create database

CREATE DATABASE test ENCODING ‘utf8’ template = template0;

[note] full text retrieval must be performed on databases with utf8 or GBK encoding.

2. Create a table

CREATE TABLE t1(id int, news text, location text);

3. Import data

Insert into T1 values (1, ‘Huawei cloud gaussdb (DWS) data warehouse has passed the evaluation and certification of ICT Institute with 2048 large-scale nodes, and Huawei cloud data warehouse has become the first commercial data warehouse product with a single cluster breakthrough of 2000 in China’,’Beijing ‘);

Insert into T1 values (2),’gaussdb (DWS) is a leading enterprise cloud distributed data warehouse service in the industry after ten years of training ‘,’shenzhen’);

Insert into T1 values (3, ‘Huawei gaussdb (DWS) data warehouse helps CMB “everyone uses data and goes ahead with innovation. CMB customers said in Huawei forum that Lianchuang Laboratory of Huawei and CMB will further explore the practice of cloud data warehouse supporting OBS storage’,’shenzhen ‘);

Insert into T1 values (4, ‘digital intelligence finance enables innovation, Huawei digital intelligence Finance Forum 2020 is successfully held in Xicun’,’dongguan ‘);

Insert into T1 values (5, ‘Huawei cloud AI training camp Xi’an station: “modelarts pro – a new tool for Industry AI landing” theme salon grand opening’,’xi’an ‘);

4. Create index

CREATE INDEX t1_news_idx ON t1 USING gin(to_tsvector(‘zhparser’, news));

Execute query:

SELECT * FROM t1 WHERE to_ tsvector(‘zhparser’,news) @@ to_ Tsquery (‘huawei ‘);

Query results:

As long as the phrase “Huawei” is included in the news, it will be retrieved

5. Create a multi field joint index

CREATE INDEX t1_news_location_idx ON t1 USING gin(to_tsvector(‘zhparser’, news||location));

Implementation results:

  • Find two phrases at the same time

SELECT * FROM t1 WHERE to_ tsvector(‘zhparser’, news||location) @@ to_ Tsquery (‘huawei & Shenzhen ‘);

News and location will be retrieved as long as one field contains “Huawei” and “Shenzhen”

test=# SELECT * FROM t1 WHERE to_ tsvector(‘zhparser’, news||location) @@ to_ Tsquery (‘huawei & Shenzhen ‘);
id | news | location —-+———————————————————————————————————————————————-+———-
3 | Huawei’s Gaussian dB (DWS) data warehouse helps CMB to “use data for everyone and move forward with innovation”. CMB customers said in Huawei forum that Lianchuang Laboratory of Huawei and CMB will further explore the practice of cloud data warehouse supporting OBS storage | Shenzhen
(1 row)

  • The query contains one of the phrases

SELECT * FROM t1 WHERE to_ tsvector(‘zhparser’, news||location) @@ to_ Tsquery (‘cloud Shenzhen ‘);

Any one of news and location containing the words “cloud” and “Shenzhen” will be retrieved.

test=# SELECT * FROM t1 WHERE to_ tsvector(‘zhparser’, news||location) @@ to_ Tsquery (‘cloud Shenzhen ‘);
id | news | location —-+———————————————————————————————————————————————-+———-
5 | Huawei cloud AI training camp Xi’an station: the theme Salon of “modelarts pro – a new tool for Industry AI landing” opened grandly | Xi’an 3| Huawei’s gaussdb (DWS) data warehouse helps CMB to “use data for everyone and move forward in innovation”. CMB customers said in Huawei forum that Lianchuang Laboratory of Huawei and CMB will further explore the practice of cloud data warehouse supporting OBS storage | Shenzhen 1| Huawei cloud gaussdb (DWS) data warehouse has passed the evaluation and certification of ICT Institute with 2048 large-scale nodes. Huawei cloud data warehouse has become the first commercial data warehouse product with a single cluster breakthrough of 2000 in China. After ten years of training, Beijing 2 gaussdb (DWS) is the industry-leading enterprise cloud distributed data warehouse service |Shenzhen
(4 rows)

6. Sorting

  • Sort by the weight of a phrase contained in a column

SELECT id, news, location ,ts_ rank_ cd(to_ tsvector(‘zhparser’,news), query) AS rank FROM t1, to_ Tsquery (‘huawei ‘) query where query @ to_ tsvector(‘zhparser’,news) order by rank DESC;

Implementation results:

test=# SELECT id, news, location ,ts_ rank_ cd(to_ tsvector(‘zhparser’,news), query) AS rank FROM t1, to_ Tsquery (‘huawei ‘) query where query @ to_ tsvector(‘zhparser’,news) order by rank DESC;
id | news | location | rank —-+———————————————————————————————————————————————-+———-+——
3 | Huawei gaussdb (DWS) data warehouse helps CMB to “use data for everyone and move forward with innovation”. CMB customers said in Huawei forum that Lianchuang Laboratory of Huawei and CMB will further explore the practice of cloud data warehouse supporting OBS storage | Shenzhen | 3
1. Huawei cloud gaussdb (DWS) data warehouse has passed the evaluation and certification of ICT Institute with 2048 large-scale nodes. Huawei cloud data warehouse has become the first commercial data warehouse product with a single cluster breakthrough of 2000 in China | Beijing | 2
4. Digital finance enables innovation. Huawei digital Finance Forum 2020 is successfully held in Xicun. 1
5 | Huawei cloud AI training camp Xi’an station: the theme Salon of “modelarts pro – a new tool for Industry AI landing” grand opening | Xi’an | 1 (4 rows)

Search out the phrases containing “Huawei” and sort them according to the weight

  • Sort by the weight of multiple phrases in a column (including two phrases at the same time)

SELECT id, news, location,ts_ rank_ cd(to_ tsvector(‘zhparser’, news||location), query) AS rank FROM t1, to_ Tsquery (‘huawei & Shenzhen ‘) query where query @ to_ tsvector(‘zhparser’,news||location) order by rank DESC;

Implementation results:

test=# SELECT id, news, location,ts_ rank_ cd(to_ tsvector(‘zhparser’, news||location), query) AS rank FROM t1, to_ Tsquery (‘huawei & Shenzhen ‘) query where query @ to_ tsvector(‘zhparser’,news||location) order by rank DESC;
id | news | location | rank —-+———————————————————————————————————————————————-+———-+———–
3 | Huawei gaussdb (DWS) data warehouse helps CMB to “use data for everyone and move forward with innovation”. CMB customers said in Huawei forum that Lianchuang Laboratory of Huawei and CMB will further explore the practice of cloud data warehouse supporting OBS storage | Shenzhen | 0055556 (1 row)

Search out the phrases containing “Huawei” and “Shenzhen”, and sort them according to the weight

  • Sort by the weight of multiple phrases in a column (including one of two phrases)

SELECT id, news, location,ts_ rank_ cd(to_ tsvector(‘zhparser’, news||location), query) AS rank FROM t1, to_ Tsquery (‘cloud Shenzhen ‘) query where query @ to_ tsvector(‘zhparser’, news||location) order by rank DESC;

To view the execution results:

test=# SELECT id, news, location,ts_ rank_ cd(to_ tsvector(‘zhparser’, news||location), query) AS rank FROM t1, to_ Tsquery (‘cloud Shenzhen ‘) query where query @ to_ tsvector(‘zhparser’, news||location) order by rank DESC;
id | news | location | rank —-+———————————————————————————————————————————————-+———-+——
1. Huawei cloud gaussdb (DWS) data warehouse has passed the evaluation and certification of ICT Institute with 2048 large-scale nodes. Huawei cloud data warehouse has become the first commercial data warehouse product with a single cluster breakthrough of 2000 in China | Beijing | 2
2 | gaussdb (DWS) is a leading enterprise class cloud distributed data warehouse service in the industry after ten years of training | Shenzhen | 2
5 | Huawei cloud AI training camp Xi’an station: the theme Salon of “modelarts pro – a new tool for Industry AI landing” grand opening | Xi’an | 1
3 | Huawei gaussdb (DWS) data warehouse helps CMB to “use data for everyone and move forward with innovation”. CMB customers said in Huawei forum that Lianchuang Laboratory of Huawei and CMB will further explore the practice of cloud data warehouse supporting OBS storage | Shenzhen | 1 (4 rows)

Retrieve the records containing “cloud” or “Shenzhen”, and sort them according to the weight.

Through the above cases, I believe you have some understanding of the full-text retrieval of gaussdb (DWS). In fact, there are other usages of full-text retrieval, such as NGram segmentation, custom dictionary and so on. If you are interested, you can visit DWS product documents or ask questions in the community for more comprehensive answers.

This article is shared from the Huawei cloud community “preliminary study on the full text retrieval features of Gaussian dB (DWS)”, original author: DWS_ Jack 。

Click follow to learn about Huawei’s new cloud technology for the first time~

Recommended Today

Review of SQL Sever basic command

catalogue preface Installation of virtual machine Commands and operations Basic command syntax Case sensitive SQL keyword and function name Column and Index Names alias Too long to see? Space Database connection Connection of SSMS Connection of command line Database operation establish delete constraint integrity constraint Common constraints NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY DEFAULT […]