Detailed explanation of Elasticsearch SQL

Time:2022-11-24

Elasticsearch SQL is an X-Pack component that allows users to query in ES using SQL-like syntax. Users can use SQL in REST, JDBC, and command lines to perform data retrieval and data aggregation operations on ES. ES SQL has the following characteristics:

  • Local integration, the SQL module is built by ES itself and directly integrated into the released version.
  • No external components are required, and the use of SQL modules does not require additional dependencies such as hardware, runtime libraries, etc.
  • Lightweight and efficient, the SQL module does not abstract ES and its search capabilities, but exposes the SQL interface, allowing proper full-text search in the same declarative, concise manner.

In the following content, we will learn the functions provided by the Elasticsearch SQL module based on ES 7.13.

If you are interested in ES, welcome to subscribe to my Elasticsearch booklet from entry to practice, let’s learn and progress together!

1. Use of Elasticsearch SQL

Before starting to use the functions provided by the SQL module, execute the following commands in kibana to create data:

PUT /library/_bulk?refresh
{"index":{"_id": "Leviathan Wakes"}}
{"name": "Leviathan Wakes", "author": "James S.A. Corey", "release_date": "2011-06-02", "page_count": 561}
{"index":{"_id": "Hyperion"}}
{"name": "Hyperion", "author": "Dan Simmons", "release_date": "1989-05-26", "page_count": 482}
{"index":{"_id": "Dune"}}
{"name": "Dune", "author": "Frank Herbert", "release_date": "1965-06-01", "page_count": 604}

After importing data, you can execute the following SQL to search for data:

POST /_sql?format=txt
{
  "query": "SELECT * FROM library WHERE release_date < '2000-01-01'"
}

As in the above example, use _sql to indicate the use of the SQL module, and specify the SQL statement to be executed in the query field. Use format to specify the format of the returned data. The data format options are as follows, all of which are self-explanatory:

format Accept Http header illustrate
csv text/csv comma separated
json application/json JSON format
tsv text/tab-separated-values tab separated
txt text/plain text format
yaml application/yaml yaml
cbor application/cbor compact binary object representation format
smile application/smile Another binary format similar to cbor

The results of the above SQL execution are as follows:
Detailed explanation of Elasticsearch SQL
More return formats, you can try it yourself.

In addition to directly executing SQL, you can also filter the results. Use the filter field to specify filter conditions in parameters. You can use standard ES DSL query statements to filter the results of SQL operations. Examples are as follows:

POST /_sql?format=txt
{
  "query": "SELECT * FROM library ORDER BY page_count DESC",
  "filter": {
    "range": {
      "page_count": {
        "gte" : 500,
        "lte" : 600
      }
    }
  },
  "fetch_size": 5
}

In the example above, the result is:

     author     |     name      |  page_count   |      release_date      
----------------+---------------+---------------+------------------------
James S.A. Corey|Leviathan Wakes|561            |2011-06-02T00:00:00.000Z

In addition, you can use the ‘?’ placeholder to pass parameters, and then assemble the parameters and statements into a complete SQL statement:

POST /_sql?format=txt
{
        "query": "SELECT YEAR(release_date) AS year FROM library WHERE page_count > ? AND author = ? GROUP BY year HAVING COUNT(*) > ?",
        "params": [300, "Frank Herbert", 0]
}

As in the example above, use the ‘?’ placeholder to pass parameters.

2. Conceptual mapping relationship between traditional SQL and Elasticsearch SQL

While SQL and Elasticsearch have different terms for how data is organized (and different semantics), essentially they serve the same purpose. The following is their mapping relationship table:

SQL Elasticsearch illustrate
column field In Elasticsearch fields, SQL calls such entries as columns. Note that in Elasticsearch, a field can contain multiple values ​​of the same type (essentially a list), whereas in SQL, a column can contain only one value representing a type. Elasticsearch SQL will do its best to preserve SQL semantics and, depending on the query, reject fields that return multiple values.
row document Columns and fields do not exist by themselves; they are part of the row or document. The semantics of the two are slightly different: rows tend to be strict (and have more enforcement), while documents tend to be more flexible or loose (while still having structure).
table index Whether to query in SQL or Elasticsearch against the target
schema implicit In relational databases, the schema is primarily a namespace for tables, often used as a security boundary. Elasticsearch doesn’t provide an equivalent concept for it.

Although the mapping between these concepts is somewhat different semantically, there are more commonalities than differences between them.

3. SQL Translate API

The SQL Translate API receives SQL statements in JSON format and converts them into ES DSL query statements, but this statement will not be executed. We can use this API to translate SQL to DSL statements. Examples are as follows:

POST /_sql/translate
{
  "query": "SELECT * FROM library ORDER BY page_count DESC",
  "fetch_size": 10
}

As in the above example, the translated DSL is as follows:

{
  "size": 10,
  "_source": false,
  "fields": [
    { "field": "author" },
    { "field": "name" },
    { "field": "page_count" },
    {
      "field": "release_date",
      "format": "strict_date_optional_time_nanos"
    }
  ],
  "sort": [
    {
      "page_count": {
        "order": "desc",
        "missing": "_first",
        "unmapped_type": "short"
      }
    }
  ]
}

4. Introduction to SQL syntax

Let’s learn about the SQL syntax and semantics provided by ES.

1. Lexical structure

The lexical structure of ES SQL is largely similar to ANSI SQL itself. ES SQL can only accept one command at a time, and the command here is a sequence of tokens terminated by the end of the input stream. These tokens can be keywords, identifiers (with or without quotes), text (or constants), special character symbols (usually delimiters).

  1. keywords
    Keyword This is actually the same as the definition of keywords that we write SQL statements, for example, SELECT, FROM, etc. are all keywords. It should be noted that keywords are not case-sensitive.

SELECT * FROM my_table

As in the above example, there are 4 tokens in total: SELECT,, FROM, my_table, where SELECT,, FROM is a keyword, which means a word with a fixed meaning in SQL. And my_table is an identifier, which represents entities in SQL, such as tables, columns, and so on.

It can be seen that keywords and identifiers have the same lexical structure, which are almost the same length in SQL, and sometimes it is difficult to distinguish. There are many keywords supported by ES SQL, which are not listed here. You can refer to the official documentation.

  1. identifier

There are two types of identifiers: quoted and unquoted, examples are as follows:

SELECT ip_address FROM "hosts-*"

As in the example above, there are two identifiers in the query, ip_address without quotes and hosts-* with quotes (wildcard pattern). Since ip_address does not conflict with any keywords, it can be unquoted. And hosts-* conflicts with – (minus sign operation) and *, so quotes are required.

For identifiers, you should try to avoid using complex names and names that conflict with keywords, and use quotation marks as identifiers when typing, which can eliminate ambiguity.

  1. direct constant

ES SQL supports two types of implicit type constants: strings and numbers.

  • String, string can be delimited by single quotes, for example: ‘mysql’. If a single quote is included in the string, it needs to be escaped with another single quote, for example: ‘Captain EO”s Voyage’.
  • Numerical constants, numeric constants can be represented using decimal and scientific notation, examples are as follows:
1969    -- integer notation
3.14    -- decimal notation
.1234   -- decimal notation starting with decimal point
4E5     -- scientific notation (with exponent marker)
1.2e-3  -- scientific notation with decimal point

A numeric constant containing a decimal point is interpreted as a Double. If it is suitable to be parsed as an integer, it is parsed as an Integer, otherwise it is parsed as a long integer (Long).

  1. single quotes, double quotes

In SQL, single quotes and double quotes have different meanings and cannot be used interchangeably. Single quotes are used to declare strings while double quotes are used to denote identifiers. Examples are as follows:

SELECT "first_name" FROM "musicians"  WHERE "last_name"  = 'Carroll'    

As in the example above, first_name, musicians, and last_name are all identifiers, enclosed in double quotes. And Carroll is a string, use single quotes.

  1. Special characters

Some non-numeric and alphabetical characters have special meanings different from operators, special characters are:

character describe
* In some contexts, it represents all the fields of the data table, and can also represent the parameters of some aggregate functions.
, elements used to enumerate lists
. Used for numeric constants or delimited identifier qualifiers (table, column, etc.)
() Used for specific SQL commands, function declarations, or to enforce precedence.
  1. operator

Most operators in ES SQL have the same precedence and are left associative. If you need to modify the priority, use parentheses to force the change of its priority. The following table is the operators supported by ES SQL and their precedence:

operator Combination illustrate
. left associative qualifier or separator
:: left associative PostgreSQL-style type conversion operators
+ – right associative unary plus and minus
* / % left associative multiplication, division, modulo
+ – left associative Addition and subtraction operations
BETWEEN IN LIKE range contains, character matches
< > <= >= = <=> <> != comparison operation
NOT right associative logical NOT
AND left associative logic and
OR left associative logical or
  1. note

ES SQL supports two kinds of comments: single-line and multi-line comments, examples of which are as follows:

-- single line comment, single line comment

/* multi
   line
   comment
   that supports /* nested comments */
   multiline comment
   */

2. SQL commands

Let’s introduce the SQL commands.

  1. DESCRIBE TABLE

Use this command to view the structure of the index, its syntax is as follows:

DESCRIBE
    [table identifier | 
    [LIKE pattern]]     

On line 1, the keyword DESCRIBE can be abbreviated to DESC.
Line 2, single table identifier or double quote ES multi-index mode.
Line 3, SQL Like matching pattern.

An example of using the DESCRIBE command is as follows:

DESCRIBE table;
  1. SELECT

This is actually very familiar to us. Use SELECT to return the columns to be displayed. The syntax is as follows:

SELECT [TOP [ count ] ] select_expr [, ...]
[ FROM table_name ]
[ WHERE condition ]
[ GROUP BY grouping_element [, ...] ]
[ HAVING condition]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count ] ]
[ PIVOT ( aggregation_expr FOR column IN ( value [ [ AS ] alias ] [, ...] ) ) ]

The syntax of using SELECT query in ES is basically the same as that in database, so it will not be described here.

  1. SHOW COLUMNS

Use the SHOW COLUMNS command to list all the columns, their types, and other attributes of the table. The syntax is as follows:

SHOW COLUMNS [ FROM | IN ]?
    [table identifier | 
    [LIKE pattern] ] 

An example of its use is as follows:

SHOW COLUMNS IN emp;
SHOW COLUMNS IN emp LIKE 'birth_da%'; // match columns starting with birth_da
  1. SHOW FUNCTIONS

Use SHOW FUNCTIONS to list all SQL-supported functions and their types. The LIKE clause matches the corresponding results, and the syntax is as follows:

SHOW FUNCTIONS [LIKE pattern?]? 

Examples are as follows:

SHOW FUNCTIONS;
SHOW FUNCTIONS LIKE 'ABS'; // exact match
SHOW FUNCTIONS LIKE 'A__'; // A '_' represents a character, so exactly match A + two characters, such as AVG, ABS.
SHOW FUNCTIONS LIKE '%DAY%'; // Match functions with DAY
  1. SHOW TABLES

We can use SHOW TABLES to view all tables (indexes in ES), the syntax is as follows:

SHOW TABLES
    [INCLUDE FROZEN]?   
    [table identifier | 
    [LIKE pattern ]]?   

A simple example is as follows:

SHOW TABLES;
SHOW TABLES "*,-l*"; // Match using ES multi-target syntax
SHOW TABLES LIKE 'emp'; // exact match
SHOW TABLES LIKE 'emp%'; // table matching emp + multiple characters
SHOW TABLES LIKE 'em_'; // table matching em + single character
SHOW TABLES LIKE '%em_'; // table matching multiple characters + em + single character
  1. index pattern
    ES SQL supports two types of pattern matching methods to match multiple indexes or tables: multi-index pattern and LIKE pattern.
  • multi-index mode
    Wildcard * or exclusive matching is supported, examples are as follows:

    SHOW TABLES "*,-l*";

    Multi-index mode is supported through ES multi-target syntax.

  • LIKE mode
    We are more familiar with this, and there are many examples above, so I won’t go into details here.

Five, ES SQL usage practice

Before using SQL, we first prepare the data, here we will use the flight data provided by Kibana:
Detailed explanation of Elasticsearch SQL

As shown above, in Kibana, click Overview under Analytics in the left column, select the Tab of Sample data in the pop-up page, and then click the add data button to add flight data.

The data structure of the flight data can be viewed using the following statement:

POST /kibana_sample_data_flights/_search
{
  "query": { "match_all": {} }
}

ok, let’s take a look at how to write commonly used SQL.

  1. WHERE

We filter out data destined for US:

POST /_sql?format=txt
{
  "query": "SELECT FlightNum, OriginWeather, OriginCountry, Carrier FROM kibana_sample_data_flights WHERE DestCountry = 'US'"
}

As in the above example, it will certainly be familiar to you who have used SQL for data query, and the final result is:
Detailed explanation of Elasticsearch SQL

  1. GROUP BY

You can use the GROUP BY statement to perform group aggregation and statistical operations on data, such as querying the average flight distance of flight groups. An example of this is as follows:

POST /_sql?format=txt
{
  "query": "SELECT count(*),max(DistanceMiles), avg(DistanceMiles) FROM kibana_sample_data_flights GROUP BY DestCountry"
}

As in the above example, we group by destination country, and then count the number of each group, the maximum flight distance, and the average flight distance. The result is as follows:
Detailed explanation of Elasticsearch SQL

  1. HAVING

HAVING can be used to perform secondary filtering on the grouped data, such as filtering data with more than 100 records in the grouping, the example is as follows:

POST /_sql?format=txt
{
  "query": "SELECT count(*),max(DistanceMiles), avg(DistanceMiles) FROM kibana_sample_data_flights GROUP BY DestCountry HAVING COUNT(*) > 100"
}

As in the above example, we filter out the data with more than 100 records in the group, and the results are as follows:

Detailed explanation of Elasticsearch SQL

  1. ORDER BY

We can use ORDER BY to sort, for example, to sort the average flight distance in descending order, the example is as follows:

POST /_sql?format=txt
{
  "query": "SELECT count(*),max(DistanceMiles), avg(DistanceMiles) as avgDistance FROM kibana_sample_data_flights GROUP BY DestCountry HAVING COUNT(*) > 100 ORDER BY avgDistance desc"
}

As in the above example, we sort the data by the average distance, and the result is:
Detailed explanation of Elasticsearch SQL

  1. paging

There are many ways to implement paging, and you can use limit, top, and fetch_size for paging.

1. limit paging operation

POST /_sql?format=txt
{
  "query": "SELECT FlightNum, OriginWeather, OriginCountry, Carrier FROM kibana_sample_data_flights WHERE DestCountry = 'US' limit 10"
}

2. Use top to paginate

POST /_sql?format=txt
{
  "query": "SELECT top 10 FlightNum, OriginWeather, OriginCountry, Carrier FROM kibana_sample_data_flights WHERE DestCountry = 'US'"
}

3. Use fetch_size for pagination

POST /_sql?format=txt
{
  "query": "SELECT FlightNum, OriginWeather, OriginCountry, Carrier FROM kibana_sample_data_flights WHERE DestCountry = 'US'",
  "fetch_size": 10
}
  1. subquery

ES SQL can support simple subqueries like SELECT X FROM (SELECT * FROM Y), examples are as follows:

POST /_sql?format=txt
{
  "query": "SELECT avg(data.DistanceMiles) from (SELECT FlightNum, OriginWeather, OriginCountry, Carrier, DistanceMiles FROM kibana_sample_data_flights WHERE DestCountry = 'US') as data"
}

It should be noted that subqueries that may be more complicated will not be supported. For more restrictions, please refer to the official documentation.

6. Summary

This article introduces the relevant knowledge of ES SQL in detail. Generally speaking, most of them are literally translated based on official documents. For more information on how to use ES SQL, please refer to the official documentation.