[MySQL source code analysis] MySQL lexical analysis

Time:2020-10-31

preface

Recently, I have been studying MySQL source code intensively. I have just divided into several topics, including lexical parsing, syntax parsing, querier, optimizer, etc. Just to prepare the PPT content to pick out the corresponding articles.

MySQL version: 8.0.20

Debugging tool: lldb

System environment: MacOS 10.14.3

Before we understand the lexical analysis, we start with several questions

(1) What is lexical analysis?

(2) What is the optimization of MySQL 8.0.20 lexical parsing?

(3) What is the process of MySQL 8 lexical parsing?

1. What is lexical analysis?

Lexical analysis is the process of converting character sequence into token sequence in computer science. The program or function for lexical analysis is called lexer, also known as scanner. Lexical analyzer generally exists in the form of function, which is called by parser.

Lexical analysis stage is the first stage of the compilation process and the basis of compilation. The task of this stage is to read a character from left to right into the source program, that is to scan the character stream of the source program, and then recognize words (also known as word symbols or symbols) according to word formation rules. Lexical analysis program to achieve this task. Lexical analysis program can be automatically generated by using tools such as lex.

Lexical analysis is the first and necessary stage of compiler; the core task of lexical analysis is to scan and identify words and give qualitative and fixed length processing to the identified words; the common ways to realize lexical analysis program are automatic generation and manual generation.

The automatic generation of lexicon can refer to an article I wrote before:
https://blog.csdn.net/byxiaoyuonly/article/details/107851764

2. Lexical analysis

2.1 lexical parsing state machine

   lexical parsing state machine is a process executed in the scanning phase of lexical resolution. Figure 2-1-1 shows the execution process of state resolution token
[MySQL source code analysis] MySQL lexical analysis
< center > figure 2-1-1 token process of state machine

The main purpose of state machine is to analyze the execution process of token, such as my_ LEX_ After the ident status loop matches the characters, it parses the characters and returns the corresponding token.

Corresponding state machine remarks
MY_LEX_START Start parsing token
MY_LEX_CHAR Parse single characters such as *,:;
MY_LEX_IDENT Parse strings and match keywords, such as “table”, “select”, etc
MY_LEX_IDENT_SEP Found character ‘
MY_LEX_IDENT_START Parsing token from “.”
MY_LEX_REAL Incomplete real numbers
MY_LEX_HEX_NUMBER Hex string
MY_LEX_BIN_NUMBER Bin string
MY_LEX_CMP_OP Incomplete comparison operator
MY_LEX_LONG_CMP_OP Incomplete comparison operator
MY_LEX_STRING character string
MY_LEX_COMMENT Comment
MY_LEX_END end
MY_LEX_NUMBER_IDENT number
MY_LEX_INT_OR_REAL Complete integer or incomplete real number
MY_LEX_REAL_OR_POINT Parse. Returns an incomplete real number, or the character ‘
MY_LEX_BOOL Boolean
MY_LEX_EOL If it is EOF, the state end is set,
MY_LEX_LONG_COMMENT Long notes
MY_LEX_END_LONG_COMMENT End of remarks
MY_LEX_SEMICOLON Separator;
MY_LEX_SET_VAR Check:=
MY_LEX_USER_END End ‘@’
MY_LEX_HOSTNAME Resolving hostname
MY_LEX_SKIP Space
MY_LEX_USER_VARIABLE_DELIMITER Quotation mark character
MY_LEX_SYSTEM_VAR For example, parsing [email protected] , resolved [email protected]
MY_LEX_IDENT_OR_KEYWORD Return string status or keyboard key value
MY_LEX_IDENT_OR_HEX Hex digit
MY_LEX_IDENT_OR_BIN Bin digit
MY_LEX_IDENT_OR_NCHAR Return character status, or string status
MY_LEX_STRING_OR_DELIMITER Return string status or space character status

2.2 debugging and parsing source code

   we can follow the source code together. If we can’t install and compile, we can take a look at my previous article
Why MySQL sometimes selects wrong index and cost calculation
https://blog.csdn.net/byxiaoyuonly/article/details/107651106

   we start debugging, first start mysql8.0.20. Then prepare two terminals: one for operating MySQL statements and the other for debugging, as shown in figure 2-2-1.
[MySQL source code analysis] MySQL lexical analysis
< center > figure 2-2-1 open terminal < / center >
You can use lldb for debugging:

#Lldb - P process ID

[MySQL source code analysis] MySQL lexical analysis
< center > figure 2-2-2 lexical resolution call process
  according to figure 2-2-2, we can know that mysql8.0.20 will call the mysqllex method for lexical parsing, and mysqllex will call lex_ one_ Token is used to parse a single token. If we want to debug, we can debug lex_ one_ Token to the next breakpoint.

(lldb)b lex_one_token

    after the breakpoint is set, make a statement in the MySQL operation terminal, such as “select * from T1;”, at this time, the debugging terminal will capture the breakpoint and debug to figure 2-2-3.
[MySQL source code analysis] MySQL lexical analysis
< center > figure 2-2-3 debugging diagram
   according to the above figure, we know that the first state machine is my_ LEX_ Start: after the state machine enters switch, a character will be obtained through the yypeek method, as shown in figure 2-2-4. To determine whether the character is a space or not, after it is not a space, you can use “state = state”_ Map [C]; “returns a state machine. State is used to judge_ Map parsing, state_ Map is described in Section 2.3.

[MySQL source code analysis] MySQL lexical analysis
< center > figure 2-2-4 my_ LEX_ Start debugging chart < / center >
   because the single character obtained is s, s corresponds to state_ The state machine in map is my_ LEX_ IDENT,MY_ LEX_ Ident status can match the corresponding keyword and return token. The first matching keyword is select.

[MySQL source code analysis] MySQL lexical analysis
[MySQL source code analysis] MySQL lexical analysis
< center > figure 2-2-5 my_ LEX_ Ident debug chart
   according to figure 2-2-5, we know that through find_ The keyword method can match the corresponding token. After matching “select” for the first time, we get a token (748), which corresponds to select_ SYM can be found in / mysql-8.0.20/sql/sql_ Found in the yacc. H file. At this time, M_ The PTR parameter value is “* from T1”, which is moved to the left by calling lip->yyUnget () before returning. lip->next_ State is set to my again_ LEX_ START。
[MySQL source code analysis] MySQL lexical analysis
< center > figure 2-2-6 my_ LEX_ END_ LONG_ Comment debug chart < / center >
   when we call Lex again_ one_ When token is used, my is processed_ LEX_ When starting the state machine, a space character will be filtered. Continue to get the “*” character and set the state machine to my_ LEX_ END_ LONG_ Comment, then the execution status opportunity is set to my_ LEX_ Char, when returning, the next state is set to my_ LEX_ START。 Finally, a token (42) is returned, which is actually ASCII to “*”. At this time, M_ The value of PTR parameter is “from T1”. Execute my again_ LEX_ The start process will set the state machine to my_ LEX_ Ident, execute my_ LEX_ Token (452) will be returned after ident state machine, which can be found in / mysql-8.0.20/sql/sql_ Found in the yacc. H file. Corresponding to from. When executed again, it will return to the state machine ident_ Quoted, finally returned to state machine my_ LEX_ EOL, finally returned to my_ LEX_ End.
[MySQL source code analysis] MySQL lexical analysis
< center > figure 2-2-7 debugging and parsing complete process

2.3 state_ Map introduction

state_ Map is the key step to verify the state machine, and the initialization process is mainly in / mysql-8.0.20/mysys/sql_ chars.cc Init of file_ state_ In the maps method, the method is implemented as follows:

bool init_state_maps(CHARSET_INFO *cs) {
  uint i;
  uchar *ident_map;
  enum my_lex_states *state_map = nullptr;

  lex_state_maps_st *lex_state_maps = (lex_state_maps_st *)my_once_alloc(
      sizeof(lex_state_maps_st), MYF(MY_WME));

  if (lex_ state_ Maps = = nullptr) return true; // empty finger needle OOM

  cs->state_maps = lex_state_maps;
  state_map = lex_state_maps->main_map;

  if (!(cs->ident_map = ident_map = (uchar *)my_once_alloc(256, MYF(MY_WME))))
    return true;  // OOM

  hint_lex_init_maps(cs, lex_state_maps->hint_map);

  /*Fill state for faster parsers*/
  for (i = 0; i < 256; i++) {
    if (my_isalpha(cs, i))
      state_ map[i] = MY_ LEX_ Ident; // string state machine
    else if (my_isdigit(cs, i))
      state_map[i] = MY_LEX_NUMBER_IDENT;
    else if (my_ismb1st(cs, i))
      /* To get whether it's a possible leading byte for a charset. */
      state_map[i] = MY_LEX_IDENT;
    else if (my_isspace(cs, i))
      state_ map[i] = MY_ LEX_ Skip; // space state machine
    else
      state_ map[i] = MY_ LEX_ Char; // character state machine
  }
  state_map[(uchar)'_'] = state_map[(uchar)'$'] = MY_LEX_IDENT;
  state_map[(uchar)'\''] = MY_LEX_STRING;
  state_map[(uchar)'.'] = MY_LEX_REAL_OR_POINT;
  state_map[(uchar)'&gt;'] = state_map[(uchar)'='] = state_map[(uchar)'!'] =
      MY_ LEX_ CMP_ OP; // the operation conforms to the matching state machine
  state_map[(uchar)'&lt;'] = MY_LEX_LONG_CMP_OP;
  state_map[(uchar)'&'] = state_map[(uchar)'|'] = MY_LEX_BOOL;
  state_map[(uchar)'#'] = MY_LEX_COMMENT;
  state_map[(uchar)';'] = MY_LEX_SEMICOLON;
  state_map[(uchar)':'] = MY_LEX_SET_VAR;
  state_ map[0] = MY_ LEX_ EOL; // end flag state machine
  state_map[(uchar)'/'] = MY_LEX_LONG_COMMENT;
  state_ map[(uchar)'*'] = MY_ LEX_ END_ LONG_ Comment; // * character matching state machine
  state_ map[(uchar)'@'] = MY_ LEX_ USER_ End; // @ character matching state machine
  state_map[(uchar)'`'] = MY_LEX_USER_VARIABLE_DELIMITER;
  state_map[(uchar)'"'] = MY_LEX_STRING_OR_DELIMITER;

  /*
    Create a second map to speed up finding identifiers
  */
  for (i = 0; i &lt; 256; i++) {
    ident_map[i] = (uchar)(state_map[i] == MY_LEX_IDENT ||
                           state_map[i] == MY_LEX_NUMBER_IDENT);
  }

  /* Special handling of hex and binary strings */
  state_map[(uchar)'x'] = state_map[(uchar)'X'] = MY_LEX_IDENT_OR_HEX;
  state_map[(uchar)'b'] = state_map[(uchar)'B'] = MY_LEX_IDENT_OR_BIN;
  state_map[(uchar)'n'] = state_map[(uchar)'N'] = MY_LEX_IDENT_OR_NCHAR;

  return false;
}

The reason why the code can quickly match the state machine is that a large number of state machine maps are initialized, and different state machines can be matched according to the characters. The macro of state machine is in mysql-8.0.20/include/sql_ Chars. H file.

2.4 source code analysis

Key code lex_ one_ Token analysis:

static int lex_one_token(Lexer_yystype *yylval, THD *thd) {
  uchar c = 0;
  bool comment_closed;
  int tokval, result_state;
  uint length;
  enum my_lex_states state;
  Lex_ input_ stream *lip = &thd->m_ parser_ state->m_ Lip; // get input information
  const CHARSET_ Info * CS = thd - > charset(); // get character set
  const my_ lex_ states *state_ map = cs->state_ maps->main_ Map; // get status
  const uchar *ident_ map = cs->ident_ Map; // string segmentation conforms to
  Lip > yylval = yylval; // global status
  
  lip->start_ Token(); // initialize token string
  state = lip->next_ State; // get the next state
  lip->next_ state = MY_ LEX_ Start; // sets the next state
  
  For (;;;) {// loop parsing state machine
    switch (state) {
      case MY_ LEX_ Start: // start parsing token
         while (state_ map[c = lip->yyPeek()] == MY_ LEX_ Skip) {// parse the token and determine whether it is a space
          if (c == '\n') lip->yylineno++;

          Lip > yyskip(); // process spaces
        }

        /* Start of real token */
        lip->restart_ Token(); // set M_ tok_ Start and M_ cpp_ tok_ start
        C = lip > yyget(); // get a single character and set M_ cpp_ PTR, and m_ PTR shift
        state = state_ Map [C]; // returns my if it is a string_ LEX_ Ident status
      break;
      //...
      case MY_ LEX_ Ident: // parse string keywords, such as select, tables, etc
        const char *start;
        if (use_mb(cs)) {
          result_state = IDENT_QUOTED;
          switch (my_mbcharlen(cs, lip->yyGetLast())) {
            case 1:
              break;
            case 0:
              if (my_mbmaxlenlen(cs) &lt; 2) break;
              /* else fall through */
            default:
              int l =
                  my_ismbchar(cs, lip->get_ptr() - 1, lip->get_end_of_query());
              if (l == 0) {
                state = MY_LEX_CHAR;
                continue;
              }
              lip->skip_binary(l - 1);
          }
          while (ident_ Map [C = lip > yyget()]) {// loop to get string
            switch (my_mbcharlen(cs, c)) {
              case 1:
                break;
              case 0:
                if (my_mbmaxlenlen(cs) &lt; 2) break;
                /* else fall through */
              default:
                int l;
                if ((l = my_ismbchar(cs, lip->get_ptr() - 1,
                                     lip->get_end_of_query())) == 0)
                  break;
                lip->skip_binary(l - 1);
            }
          }
        } else {
          for (result_state = c; ident_map[c = lip->yyGet()]; result_state |= c)
            ;
          /* If there were non-ASCII characters, mark that we must convert */
          result_state = result_state & 0x80 ? IDENT_QUOTED : IDENT;
        }
        length = lip->yyLength();
        start = lip->get_ptr();
        if (lip->ignore_space) {
          /*
            If we find a space then this can't be an identifier. We notice this
            below by checking start != lex->ptr.
          */
          for (; state_map[c] == MY_LEX_SKIP; c = lip->yyGet()) {
            if (c == '\n') lip->yylineno++;
          }
        }
        if (start == lip->get_ ptr() && c == '.' && ident_ Map [lip > yypeek()] // determine whether the character is'. '"
          lip->next_state = MY_LEX_IDENT_SEP;
        else {  // '(' must follow directly if function
          lip->yyUnget();
          if ((tokval = find_ Keyword (lip, length, C = = '(')) {// find token
            lip->next_state = MY_LEX_START;  // Allow signed numbers
            Return (tokval); // return token
          }
          lip->yySkip();  // next state does a unget
        }
        yylval->lex_str = get_token(lip, 0, length);
        //...
        return (result_state);  // IDENT or IDENT_QUOTED
        //...
        case MY_ LEX_ EOL: // '\ 0' Terminator
            if (lip->eof()) {
          lip->yyUnget();  // Reject the last '
static int lex_one_token(Lexer_yystype *yylval, THD *thd) {
uchar c = 0;
bool comment_closed;
int tokval, result_state;
uint length;
enum my_lex_states state;
Lex_ input_ stream *lip = &thd->m_ parser_ state->m_ Lip; // get input information
const CHARSET_ Info * CS = thd - > charset(); // get character set
const my_ lex_ states *state_ map = cs->state_ maps->main_ Map; // get status
const uchar *ident_ map = cs->ident_ Map; // string segmentation conforms to
Lip > yylval = yylval; // global status
lip->start_ Token(); // initialize token string
state = lip->next_ State; // get the next state
lip->next_ state = MY_ LEX_ Start; // sets the next state
For (;;;) {// loop parsing state machine
switch (state) {
case MY_ LEX_ Start: // start parsing token
while (state_ map[c = lip->yyPeek()] == MY_ LEX_ Skip) {// parse the token and determine whether it is a space
if (c == '\n') lip->yylineno++;
Lip > yyskip(); // process spaces
}
/* Start of real token */
lip->restart_ Token(); // set M_ tok_ Start and M_ cpp_ tok_ start
C = lip > yyget(); // get a single character and set M_ cpp_ PTR, and m_ PTR shift
state = state_ Map [C]; // returns my if it is a string_ LEX_ Ident status
break;
//...
case MY_ LEX_ Ident: // parse string keywords, such as select, tables, etc
const char *start;
if (use_mb(cs)) {
result_state = IDENT_QUOTED;
switch (my_mbcharlen(cs, lip->yyGetLast())) {
case 1:
break;
case 0:
if (my_mbmaxlenlen(cs) &lt; 2) break;
/* else fall through */
default:
int l =
my_ismbchar(cs, lip->get_ptr() - 1, lip->get_end_of_query());
if (l == 0) {
state = MY_LEX_CHAR;
continue;
}
lip->skip_binary(l - 1);
}
while (ident_ Map [C = lip > yyget()]) {// loop to get string
switch (my_mbcharlen(cs, c)) {
case 1:
break;
case 0:
if (my_mbmaxlenlen(cs) &lt; 2) break;
/* else fall through */
default:
int l;
if ((l = my_ismbchar(cs, lip->get_ptr() - 1,
lip->get_end_of_query())) == 0)
break;
lip->skip_binary(l - 1);
}
}
} else {
for (result_state = c; ident_map[c = lip->yyGet()]; result_state |= c)
;
/* If there were non-ASCII characters, mark that we must convert */
result_state = result_state & 0x80 ? IDENT_QUOTED : IDENT;
}
length = lip->yyLength();
start = lip->get_ptr();
if (lip->ignore_space) {
/*
If we find a space then this can't be an identifier. We notice this
below by checking start != lex->ptr.
*/
for (; state_map[c] == MY_LEX_SKIP; c = lip->yyGet()) {
if (c == '\n') lip->yylineno++;
}
}
if (start == lip->get_ ptr() && c == '.' && ident_ Map [lip > yypeek()] // determine whether the character is'. '"
lip->next_state = MY_LEX_IDENT_SEP;
else {  // '(' must follow directly if function
lip->yyUnget();
if ((tokval = find_ Keyword (lip, length, C = = '(')) {// find token
lip->next_state = MY_LEX_START;  // Allow signed numbers
Return (tokval); // return token
}
lip->yySkip();  // next state does a unget
}
yylval->lex_str = get_token(lip, 0, length);
//...
return (result_state);  // IDENT or IDENT_QUOTED
//...
case MY_ LEX_ EOL: // '\ 0' Terminator
if (lip->eof()) {
lip->yyUnget();  // Reject the last '\0'
lip->set_echo(false);
lip->yySkip();
lip->set_echo(true);
/* Unbalanced comments with a missing '*' '/' are a syntax error */
if (lip->in_comment != NO_COMMENT) return (ABORT_SYM);
lip->next_ state = MY_ LEX_ End; // set the next state machine to my_ LEX_ End, continue the cycle
return (END_ OF_ Input); // return token
}
} 
}
}
' lip->set_echo(false); lip->yySkip(); lip->set_echo(true); /* Unbalanced comments with a missing '*' '/' are a syntax error */ if (lip->in_comment != NO_COMMENT) return (ABORT_SYM); lip->next_ state = MY_ LEX_ End; // set the next state machine to my_ LEX_ End, continue the cycle return (END_ OF_ Input); // return token } } } }

3. What are the optimizations of lexical parsing in MySQL 8.0.20?

   mysql5 will be one step more my in parsing the select * from T1; statement_ LEX_ OPERATOR_ OR_ Ident. This process is optimized in MySQL 8.0.20, as shown in Figure 3-1.
[MySQL source code analysis] MySQL lexical analysis
< center > Figure 3-1 debugging mysql5.6.48 process

summary

(1) The state machine macro is in / mysql-8.0.20/include/sql_ In chars. H.

(2) Fast match character status is due to init_ state_ The maps method initializes the state map in advance.

(3) The corresponding token can be in / mysql-8.0.20/sql/sql_ Found in the yacc. H file.

(4) Some characters return ASCII directly, such as*

Recommended Today

Regular expression sharing for checking primes

This regular expression is shown as follows: Regular expressions for checking prime numbers or not To use this positive regular expression, you need to convert the natural number into multiple 1 strings. For example, 2 should be written as “11”, 3 should be written as “111”, 17 should be written as “11111111111”. This kind of […]