preface
Recently, I have been studying MySQL source code intensively. I have just divided into several topics, including lexical parsing, syntax parsing, querier, optimizer, etc. Just to prepare the PPT content to pick out the corresponding articles.
MySQL version: 8.0.20
Debugging tool: lldb
System environment: MacOS 10.14.3
Before we understand the lexical analysis, we start with several questions
(1) What is lexical analysis?
(2) What is the optimization of MySQL 8.0.20 lexical parsing?
(3) What is the process of MySQL 8 lexical parsing?
1. What is lexical analysis?
Lexical analysis is the process of converting character sequence into token sequence in computer science. The program or function for lexical analysis is called lexer, also known as scanner. Lexical analyzer generally exists in the form of function, which is called by parser.
Lexical analysis stage is the first stage of the compilation process and the basis of compilation. The task of this stage is to read a character from left to right into the source program, that is to scan the character stream of the source program, and then recognize words (also known as word symbols or symbols) according to word formation rules. Lexical analysis program to achieve this task. Lexical analysis program can be automatically generated by using tools such as lex.
Lexical analysis is the first and necessary stage of compiler; the core task of lexical analysis is to scan and identify words and give qualitative and fixed length processing to the identified words; the common ways to realize lexical analysis program are automatic generation and manual generation.
The automatic generation of lexicon can refer to an article I wrote before:
https://blog.csdn.net/byxiaoyuonly/article/details/107851764
2. Lexical analysis
2.1 lexical parsing state machine
lexical parsing state machine is a process executed in the scanning phase of lexical resolution. Figure 2-1-1 shows the execution process of state resolution token
< center > figure 2-1-1 token process of state machine
The main purpose of state machine is to analyze the execution process of token, such as my_ LEX_ After the ident status loop matches the characters, it parses the characters and returns the corresponding token.
Corresponding state machine | remarks |
---|---|
MY_LEX_START | Start parsing token |
MY_LEX_CHAR | Parse single characters such as *,:; |
MY_LEX_IDENT | Parse strings and match keywords, such as “table”, “select”, etc |
MY_LEX_IDENT_SEP | Found character ‘ |
MY_LEX_IDENT_START | Parsing token from “.” |
MY_LEX_REAL | Incomplete real numbers |
MY_LEX_HEX_NUMBER | Hex string |
MY_LEX_BIN_NUMBER | Bin string |
MY_LEX_CMP_OP | Incomplete comparison operator |
MY_LEX_LONG_CMP_OP | Incomplete comparison operator |
MY_LEX_STRING | character string |
MY_LEX_COMMENT | Comment |
MY_LEX_END | end |
MY_LEX_NUMBER_IDENT | number |
MY_LEX_INT_OR_REAL | Complete integer or incomplete real number |
MY_LEX_REAL_OR_POINT | Parse. Returns an incomplete real number, or the character ‘ |
MY_LEX_BOOL | Boolean |
MY_LEX_EOL | If it is EOF, the state end is set, |
MY_LEX_LONG_COMMENT | Long notes |
MY_LEX_END_LONG_COMMENT | End of remarks |
MY_LEX_SEMICOLON | Separator; |
MY_LEX_SET_VAR | Check:= |
MY_LEX_USER_END | End ‘@’ |
MY_LEX_HOSTNAME | Resolving hostname |
MY_LEX_SKIP | Space |
MY_LEX_USER_VARIABLE_DELIMITER | Quotation mark character |
MY_LEX_SYSTEM_VAR | For example, parsing [email protected] , resolved [email protected] |
MY_LEX_IDENT_OR_KEYWORD | Return string status or keyboard key value |
MY_LEX_IDENT_OR_HEX | Hex digit |
MY_LEX_IDENT_OR_BIN | Bin digit |
MY_LEX_IDENT_OR_NCHAR | Return character status, or string status |
MY_LEX_STRING_OR_DELIMITER | Return string status or space character status |
2.2 debugging and parsing source code
we can follow the source code together. If we can’t install and compile, we can take a look at my previous article
Why MySQL sometimes selects wrong index and cost calculation
https://blog.csdn.net/byxiaoyuonly/article/details/107651106
we start debugging, first start mysql8.0.20. Then prepare two terminals: one for operating MySQL statements and the other for debugging, as shown in figure 2-2-1.
< center > figure 2-2-1 open terminal < / center >
You can use lldb for debugging:
#Lldb - P process ID
< center > figure 2-2-2 lexical resolution call process
according to figure 2-2-2, we can know that mysql8.0.20 will call the mysqllex method for lexical parsing, and mysqllex will call lex_ one_ Token is used to parse a single token. If we want to debug, we can debug lex_ one_ Token to the next breakpoint.
(lldb)b lex_one_token
after the breakpoint is set, make a statement in the MySQL operation terminal, such as “select * from T1;”, at this time, the debugging terminal will capture the breakpoint and debug to figure 2-2-3.
< center > figure 2-2-3 debugging diagram
according to the above figure, we know that the first state machine is my_ LEX_ Start: after the state machine enters switch, a character will be obtained through the yypeek method, as shown in figure 2-2-4. To determine whether the character is a space or not, after it is not a space, you can use “state = state”_ Map [C]; “returns a state machine. State is used to judge_ Map parsing, state_ Map is described in Section 2.3.
< center > figure 2-2-4 my_ LEX_ Start debugging chart < / center >
because the single character obtained is s, s corresponds to state_ The state machine in map is my_ LEX_ IDENT,MY_ LEX_ Ident status can match the corresponding keyword and return token. The first matching keyword is select.
< center > figure 2-2-5 my_ LEX_ Ident debug chart
according to figure 2-2-5, we know that through find_ The keyword method can match the corresponding token. After matching “select” for the first time, we get a token (748), which corresponds to select_ SYM can be found in / mysql-8.0.20/sql/sql_ Found in the yacc. H file. At this time, M_ The PTR parameter value is “* from T1”, which is moved to the left by calling lip->yyUnget () before returning. lip->next_ State is set to my again_ LEX_ START。
< center > figure 2-2-6 my_ LEX_ END_ LONG_ Comment debug chart < / center >
when we call Lex again_ one_ When token is used, my is processed_ LEX_ When starting the state machine, a space character will be filtered. Continue to get the “*” character and set the state machine to my_ LEX_ END_ LONG_ Comment, then the execution status opportunity is set to my_ LEX_ Char, when returning, the next state is set to my_ LEX_ START。 Finally, a token (42) is returned, which is actually ASCII to “*”. At this time, M_ The value of PTR parameter is “from T1”. Execute my again_ LEX_ The start process will set the state machine to my_ LEX_ Ident, execute my_ LEX_ Token (452) will be returned after ident state machine, which can be found in / mysql-8.0.20/sql/sql_ Found in the yacc. H file. Corresponding to from. When executed again, it will return to the state machine ident_ Quoted, finally returned to state machine my_ LEX_ EOL, finally returned to my_ LEX_ End.
< center > figure 2-2-7 debugging and parsing complete process
2.3 state_ Map introduction
state_ Map is the key step to verify the state machine, and the initialization process is mainly in / mysql-8.0.20/mysys/sql_ chars.cc Init of file_ state_ In the maps method, the method is implemented as follows:
bool init_state_maps(CHARSET_INFO *cs) {
uint i;
uchar *ident_map;
enum my_lex_states *state_map = nullptr;
lex_state_maps_st *lex_state_maps = (lex_state_maps_st *)my_once_alloc(
sizeof(lex_state_maps_st), MYF(MY_WME));
if (lex_ state_ Maps = = nullptr) return true; // empty finger needle OOM
cs->state_maps = lex_state_maps;
state_map = lex_state_maps->main_map;
if (!(cs->ident_map = ident_map = (uchar *)my_once_alloc(256, MYF(MY_WME))))
return true; // OOM
hint_lex_init_maps(cs, lex_state_maps->hint_map);
/*Fill state for faster parsers*/
for (i = 0; i < 256; i++) {
if (my_isalpha(cs, i))
state_ map[i] = MY_ LEX_ Ident; // string state machine
else if (my_isdigit(cs, i))
state_map[i] = MY_LEX_NUMBER_IDENT;
else if (my_ismb1st(cs, i))
/* To get whether it's a possible leading byte for a charset. */
state_map[i] = MY_LEX_IDENT;
else if (my_isspace(cs, i))
state_ map[i] = MY_ LEX_ Skip; // space state machine
else
state_ map[i] = MY_ LEX_ Char; // character state machine
}
state_map[(uchar)'_'] = state_map[(uchar)'$'] = MY_LEX_IDENT;
state_map[(uchar)'\''] = MY_LEX_STRING;
state_map[(uchar)'.'] = MY_LEX_REAL_OR_POINT;
state_map[(uchar)'>'] = state_map[(uchar)'='] = state_map[(uchar)'!'] =
MY_ LEX_ CMP_ OP; // the operation conforms to the matching state machine
state_map[(uchar)'<'] = MY_LEX_LONG_CMP_OP;
state_map[(uchar)'&'] = state_map[(uchar)'|'] = MY_LEX_BOOL;
state_map[(uchar)'#'] = MY_LEX_COMMENT;
state_map[(uchar)';'] = MY_LEX_SEMICOLON;
state_map[(uchar)':'] = MY_LEX_SET_VAR;
state_ map[0] = MY_ LEX_ EOL; // end flag state machine
state_map[(uchar)'/'] = MY_LEX_LONG_COMMENT;
state_ map[(uchar)'*'] = MY_ LEX_ END_ LONG_ Comment; // * character matching state machine
state_ map[(uchar)'@'] = MY_ LEX_ USER_ End; // @ character matching state machine
state_map[(uchar)'`'] = MY_LEX_USER_VARIABLE_DELIMITER;
state_map[(uchar)'"'] = MY_LEX_STRING_OR_DELIMITER;
/*
Create a second map to speed up finding identifiers
*/
for (i = 0; i < 256; i++) {
ident_map[i] = (uchar)(state_map[i] == MY_LEX_IDENT ||
state_map[i] == MY_LEX_NUMBER_IDENT);
}
/* Special handling of hex and binary strings */
state_map[(uchar)'x'] = state_map[(uchar)'X'] = MY_LEX_IDENT_OR_HEX;
state_map[(uchar)'b'] = state_map[(uchar)'B'] = MY_LEX_IDENT_OR_BIN;
state_map[(uchar)'n'] = state_map[(uchar)'N'] = MY_LEX_IDENT_OR_NCHAR;
return false;
}
The reason why the code can quickly match the state machine is that a large number of state machine maps are initialized, and different state machines can be matched according to the characters. The macro of state machine is in mysql-8.0.20/include/sql_ Chars. H file.
2.4 source code analysis
Key code lex_ one_ Token analysis:
static int lex_one_token(Lexer_yystype *yylval, THD *thd) {
uchar c = 0;
bool comment_closed;
int tokval, result_state;
uint length;
enum my_lex_states state;
Lex_ input_ stream *lip = &thd->m_ parser_ state->m_ Lip; // get input information
const CHARSET_ Info * CS = thd - > charset(); // get character set
const my_ lex_ states *state_ map = cs->state_ maps->main_ Map; // get status
const uchar *ident_ map = cs->ident_ Map; // string segmentation conforms to
Lip > yylval = yylval; // global status
lip->start_ Token(); // initialize token string
state = lip->next_ State; // get the next state
lip->next_ state = MY_ LEX_ Start; // sets the next state
For (;;;) {// loop parsing state machine
switch (state) {
case MY_ LEX_ Start: // start parsing token
while (state_ map[c = lip->yyPeek()] == MY_ LEX_ Skip) {// parse the token and determine whether it is a space
if (c == '\n') lip->yylineno++;
Lip > yyskip(); // process spaces
}
/* Start of real token */
lip->restart_ Token(); // set M_ tok_ Start and M_ cpp_ tok_ start
C = lip > yyget(); // get a single character and set M_ cpp_ PTR, and m_ PTR shift
state = state_ Map [C]; // returns my if it is a string_ LEX_ Ident status
break;
//...
case MY_ LEX_ Ident: // parse string keywords, such as select, tables, etc
const char *start;
if (use_mb(cs)) {
result_state = IDENT_QUOTED;
switch (my_mbcharlen(cs, lip->yyGetLast())) {
case 1:
break;
case 0:
if (my_mbmaxlenlen(cs) < 2) break;
/* else fall through */
default:
int l =
my_ismbchar(cs, lip->get_ptr() - 1, lip->get_end_of_query());
if (l == 0) {
state = MY_LEX_CHAR;
continue;
}
lip->skip_binary(l - 1);
}
while (ident_ Map [C = lip > yyget()]) {// loop to get string
switch (my_mbcharlen(cs, c)) {
case 1:
break;
case 0:
if (my_mbmaxlenlen(cs) < 2) break;
/* else fall through */
default:
int l;
if ((l = my_ismbchar(cs, lip->get_ptr() - 1,
lip->get_end_of_query())) == 0)
break;
lip->skip_binary(l - 1);
}
}
} else {
for (result_state = c; ident_map[c = lip->yyGet()]; result_state |= c)
;
/* If there were non-ASCII characters, mark that we must convert */
result_state = result_state & 0x80 ? IDENT_QUOTED : IDENT;
}
length = lip->yyLength();
start = lip->get_ptr();
if (lip->ignore_space) {
/*
If we find a space then this can't be an identifier. We notice this
below by checking start != lex->ptr.
*/
for (; state_map[c] == MY_LEX_SKIP; c = lip->yyGet()) {
if (c == '\n') lip->yylineno++;
}
}
if (start == lip->get_ ptr() && c == '.' && ident_ Map [lip > yypeek()] // determine whether the character is'. '"
lip->next_state = MY_LEX_IDENT_SEP;
else { // '(' must follow directly if function
lip->yyUnget();
if ((tokval = find_ Keyword (lip, length, C = = '(')) {// find token
lip->next_state = MY_LEX_START; // Allow signed numbers
Return (tokval); // return token
}
lip->yySkip(); // next state does a unget
}
yylval->lex_str = get_token(lip, 0, length);
//...
return (result_state); // IDENT or IDENT_QUOTED
//...
case MY_ LEX_ EOL: // '\ 0' Terminator
if (lip->eof()) {
lip->yyUnget(); // Reject the last 'static int lex_one_token(Lexer_yystype *yylval, THD *thd) {
uchar c = 0;
bool comment_closed;
int tokval, result_state;
uint length;
enum my_lex_states state;
Lex_ input_ stream *lip = &thd->m_ parser_ state->m_ Lip; // get input information
const CHARSET_ Info * CS = thd - > charset(); // get character set
const my_ lex_ states *state_ map = cs->state_ maps->main_ Map; // get status
const uchar *ident_ map = cs->ident_ Map; // string segmentation conforms to
Lip > yylval = yylval; // global status
lip->start_ Token(); // initialize token string
state = lip->next_ State; // get the next state
lip->next_ state = MY_ LEX_ Start; // sets the next state
For (;;;) {// loop parsing state machine
switch (state) {
case MY_ LEX_ Start: // start parsing token
while (state_ map[c = lip->yyPeek()] == MY_ LEX_ Skip) {// parse the token and determine whether it is a space
if (c == '\n') lip->yylineno++;
Lip > yyskip(); // process spaces
}
/* Start of real token */
lip->restart_ Token(); // set M_ tok_ Start and M_ cpp_ tok_ start
C = lip > yyget(); // get a single character and set M_ cpp_ PTR, and m_ PTR shift
state = state_ Map [C]; // returns my if it is a string_ LEX_ Ident status
break;
//...
case MY_ LEX_ Ident: // parse string keywords, such as select, tables, etc
const char *start;
if (use_mb(cs)) {
result_state = IDENT_QUOTED;
switch (my_mbcharlen(cs, lip->yyGetLast())) {
case 1:
break;
case 0:
if (my_mbmaxlenlen(cs) < 2) break;
/* else fall through */
default:
int l =
my_ismbchar(cs, lip->get_ptr() - 1, lip->get_end_of_query());
if (l == 0) {
state = MY_LEX_CHAR;
continue;
}
lip->skip_binary(l - 1);
}
while (ident_ Map [C = lip > yyget()]) {// loop to get string
switch (my_mbcharlen(cs, c)) {
case 1:
break;
case 0:
if (my_mbmaxlenlen(cs) < 2) break;
/* else fall through */
default:
int l;
if ((l = my_ismbchar(cs, lip->get_ptr() - 1,
lip->get_end_of_query())) == 0)
break;
lip->skip_binary(l - 1);
}
}
} else {
for (result_state = c; ident_map[c = lip->yyGet()]; result_state |= c)
;
/* If there were non-ASCII characters, mark that we must convert */
result_state = result_state & 0x80 ? IDENT_QUOTED : IDENT;
}
length = lip->yyLength();
start = lip->get_ptr();
if (lip->ignore_space) {
/*
If we find a space then this can't be an identifier. We notice this
below by checking start != lex->ptr.
*/
for (; state_map[c] == MY_LEX_SKIP; c = lip->yyGet()) {
if (c == '\n') lip->yylineno++;
}
}
if (start == lip->get_ ptr() && c == '.' && ident_ Map [lip > yypeek()] // determine whether the character is'. '"
lip->next_state = MY_LEX_IDENT_SEP;
else { // '(' must follow directly if function
lip->yyUnget();
if ((tokval = find_ Keyword (lip, length, C = = '(')) {// find token
lip->next_state = MY_LEX_START; // Allow signed numbers
Return (tokval); // return token
}
lip->yySkip(); // next state does a unget
}
yylval->lex_str = get_token(lip, 0, length);
//...
return (result_state); // IDENT or IDENT_QUOTED
//...
case MY_ LEX_ EOL: // '\ 0' Terminator
if (lip->eof()) {
lip->yyUnget(); // Reject the last '\0'
lip->set_echo(false);
lip->yySkip();
lip->set_echo(true);
/* Unbalanced comments with a missing '*' '/' are a syntax error */
if (lip->in_comment != NO_COMMENT) return (ABORT_SYM);
lip->next_ state = MY_ LEX_ End; // set the next state machine to my_ LEX_ End, continue the cycle
return (END_ OF_ Input); // return token
}
}
}
}
'
lip->set_echo(false);
lip->yySkip();
lip->set_echo(true);
/* Unbalanced comments with a missing '*' '/' are a syntax error */
if (lip->in_comment != NO_COMMENT) return (ABORT_SYM);
lip->next_ state = MY_ LEX_ End; // set the next state machine to my_ LEX_ End, continue the cycle
return (END_ OF_ Input); // return token
}
}
}
}
3. What are the optimizations of lexical parsing in MySQL 8.0.20?
mysql5 will be one step more my in parsing the select * from T1; statement_ LEX_ OPERATOR_ OR_ Ident. This process is optimized in MySQL 8.0.20, as shown in Figure 3-1.
< center > Figure 3-1 debugging mysql5.6.48 process
summary
(1) The state machine macro is in / mysql-8.0.20/include/sql_ In chars. H.
(2) Fast match character status is due to init_ state_ The maps method initializes the state map in advance.
(3) The corresponding token can be in / mysql-8.0.20/sql/sql_ Found in the yacc. H file.
(4) Some characters return ASCII directly, such as*