Markdown Compilation Principle

Time:2020-5-22

This article is quoted to: please call me HR

Because my usual way of writing is to use markdown, I feel that some engines are fast in parsing and slow in parsing, but I have no choice

I like the way you look at me and can’t do it

So, here, a simple analysis is made relative to the markdown syntax engine. Or write a micro mark down parser yourself
The markdown engine is not complicated, as long as you get the corresponding regexp, and then replace the HTML tag. At present, there are only a few popular markdown parsers on the market: marker, markdown JS
At first, markdown was a syntax parser written by John Gruber in Perl. Because MD is too popular behind, there are different support engines. However, later on GitHub, GFM (GitHub flavored markdown) standard was proposed. Most of the engine parsing specifications have been unified
The most basic MD engine should be able to parse: inline HTML, automatic paragraphs, headers, blockquotes, lists, code blocks, horizontal rules, links, emphasis, inline code and images. For details, please refer to: MD features
Next, we officially make a MD parser

preparation in advance

The most basic methods about MD parser are regular and exec methods. Let’s talk about exec methods briefly

Regular Exec

Exec is used to match the specified regular method in a specific str. in fact, you can use String.prototype.match Instead, the basic use is:

regexObj.exec(str)

Return values are array (match to) and null (no match to)
If array is returned:

  • [1] … [n]: content matched by regular grouping

  • Index: start regular matching to the position of string

  • Input: original string

Specific demo:

var re = /quick\s(brown).+?(jumps)/ig;
var result = re.exec('The Quick Brown Fox Jumps Over The Lazy Dog');

//The result is
[ 'Quick Brown Fox Jumps',
  'Brown',
  'Jumps',
  index: 4,
  input: 'The Quick Brown Fox Jumps Over The Lazy Dog' ]

Then there is the basic regular matching:

Basic regularity

Regular expressions are easy to find in the source code

  regexobject: {
    headline: /^(\#{1,6})([^\#\n]+)$/m,
    code: /\s\`\`\`\n?([^`]+)\`\`\`/g,
    hr: /^(?:([\*\-_] ?)+)$/gm,
    lists: /^((\s*((\*|\-)|\d(\.|\))) [^\n]+)\n)+/gm,
    bolditalic: /(?:([\*_~]{1,3}))([^\*_~\n]+[^\*_~\s])/g,
    links: /!?\[([^\]<>]+)\]\(([^ \)<>]+)( "[^\(\)\"]+")?\)/g,
    reflinks: /\[([^\]]+)\]\[([^\]]+)\]/g,
    smlinks: /\@([a-z0-9]{3,})\@(t|gh|fb|gp|adn)/gi,
    mail: /<(([a-z0-9_\-\.])+\@([a-z0-9_\-\.])+\.([a-z]{2,7}))>/gmi,
    tables: /\n(([^|\n]+ *\| *)+([^|\n]+\n))((:?\-+:?\|)+(:?\-+:?)*\n)((([^|\n]+ *\| *)+([^|\n]+)\n)+)/g,
    include: /[\[<]include (\S+) from (https?:\/\/[a-z0-9\.\-]+\.[a-z]{2,9}[a-z0-9\.\-\?\&\/]+)[\]>]/gi,
    url: /<([[email protected]:%_\+.~#?&\/=]{2,256}\.[a-z]{2,4}\b(\/[\[email protected]:%_\+.~#?&\/\/=]*)?)>/g
  }

This paper refers to a markdown grammar for teaching parser.github The source code is interesting, you can check it. It’s very simple to read. There is not too much logic processing. Therefore, this is also the basis for explanation

Simple match

The simplest match should be headline. His regular expression is:/^(\#{1,6})([^\#\n]+)$/m. backmIt is very important. Because all the titles should be written in the first line, such as:

# abc
## sub_abc

usemFlag as the first line matching identifier. Perfect~
Then, only one cycle is needed

var headling = /^(\#{1,6})([^\#\n]+)$/m
while ((stra = headline.exec(str)) !== null) {
      count = stra[1].length;
      str = str.replace(stra[0], '<h' + count + '>' + stra[2].trim() + '</h' + count + '>').trim();
    }

Of course, there is no complete processing involved here. The simplest way is to filter the string, but there are many ways to filter the string. The most direct way is to replace directly

function escape(html, encode) {
  return html
    .replace(!encode ? /&(?!#?\w+;)/g : /&/g, '&amp;')
    .replace(/</g, '&lt;')
    .replace(/>/g, '&gt;')
    .replace(/"/g, '&quot;')
    .replace(/'/g, ''');
}

This is a simple replacement. In addition, there is a replacement scheme built in textnode

//Use textnode's built-in replacement engine to replace characters such as < > $but not 'and“
var escape = function(str) {
    'use strict';
    var div = document.createElement('div');
    div.appendChild(document.createTextNode(str));
    str = div.innerHTML;
    div = undefined;
    return str;
}

Then the above content can be written as:

var headling = /^(\#{1,6})([^\#\n]+)$/m
while ((stra = headline.exec(str)) !== null) {
      count = stra[1].length;
      str = str.replace(stra[0], '<h' + count + '>' + escape(stra[2].trim()) + '</h' + count + '>').trim();
    }

In fact, based on this point, we can conduct simple divergence, such as marked.js According to the regularization, a customized matching pattern is proposed

marked.js feature

There are some regular details and matching details, which we will not discuss here, because the main content of processing is\r\n ' ". let’s take a brief look here marked.js Some of the essence of it, especially the customizable regular style he proposed, that is, the Renderer method.

//Official demo
var marked = require('marked');
var renderer = new marked.Renderer();

renderer.heading = function (text, level) {
  var escapedText = text.toLowerCase().replace(/[^\w]+/g, '-');

  return '<h' + level + '><a name="' +
                escapedText +
                 '" class="anchor" href="#' +
                 escapedText +
                 '"><span class="header-link"></span></a>' +
                  text + '</h' + level + '>';
},

console.log(marked('# heading+', { renderer: renderer }));

We can take a look at the ideas in his source code:
First, he has a renderer constructor:

function Renderer(options) {
  this.options = options || {};
}

Next is the method bound to prototype:

Renderer.prototype.blockquote = function(quote) {
  return '<blockquote>\n' + quote + '</blockquote>\n';
};

Some children’s shoes may think that he didn’t do any grammar parsing here?
Dear, please pay attention to his parametersquote. then look at the rendered content and it will be clear at a glance. Quote is the matching content that has been escaped
Next, let’s look at the call method:

// url (gfm)
 if (!this.inLink && (cap = this.rules.url.exec(src))) {
   src = src.substring(cap[0].length);
   text = escape(cap[1]);
   href = text;
   //Out here is the result of all output
   out += this.renderer.link(href, null, text);
   continue;
 }

Some children’s shoes may ask again. Aren’t you all matched? Won’t this leave out information?
So marked.js In order to implement the custom mode, performance is sacrificed. Let’s take a look at his regular expression:

var block = {
  newline: /^\n+/,
  code: /^( {4}[^\n]+\n*)+/,
  fences: noop,
  hr: /^( *[-*_]){3,} *(?:\n+|$)/,
  heading: /^ *(#{1,6}) *([^\n]+?) *#* *(?:\n+|$)/,
  lheading: /^([^\n]+)\n *(=|-){2,} *(?:\n+|$)/,
  blockquote: /^( *>[^\n]+(\n(?!def)[^\n]+)*\n*)+/,
  list: /^( *)(bull) [\s\S]+?(?:hr|def|\n{2,}(?! )(?!bull )\n*|\s*$)/,
  html: /^ *(?:comment *(?:\n|\s*$)|closed *(?:\n{2,}|\s*$)|closing *(?:\n{2,}|\s*$))/,
  def: /^ *\[([^\]]+)\]: *<?([^\s>]+)>?(?: +["(]([^\n]+)[")])? *(?:\n+|$)/,
  paragraph: /^((?:[^\n]+\n?(?!hr|heading|lheading|blockquote|tag|def))+)\n*/,
  text: /^[^\n]+/
};

As you can see, he didn’t add any pattern… This is marked.js It’s a wonderful place. So, the out above looks like there’s no magic place:

out += this.renderer.link(href, null, text);

Therefore, the override of the method in the renderer object is used to create a custom effect. This is also very good. In addition, there is another point that needs to be explained marked.js Construct the annotation replacement method

function replace(regex, opt) {
  regex = regex.source;
  opt = opt || '';
  return function self(name, val) {
    if (!name) return new RegExp(regex, opt);
    val = val.source || val;
    val = val.replace(/(^|[^\[])\^/g, '$1');
    regex = regex.replace(name, val);
    return self;
  };
}
//Take a look at his call method
//Face to face block.xxx  It's all regular expressions. I won't go into details here
block.paragraph = replace(block.paragraph)
  ('hr', block.hr)
  ('heading', block.heading)
  ('lheading', block.lheading)
  ('blockquote', block.blockquote)
  ('tag', '<' + block._tag)
  ('def', block.def)
  ();
//In fact, the result of this method is to generate a new regular expression. That is, replace the words above with the specified regular expression
//For example, HR, heading in the paragraph
 paragraph: /^((?:[^\n]+\n?(?!hr|heading|lheading|blockquote|tag|def))+)\n*/

Of course, there are other ways. just marked.js It’s perfect here

Market actual parsing order

We mentioned the use of out + = for parsing. Of course, the following questions may come to mind:
How is paragraph nesting syntax parsed?
In fact, he did not do out + = in the nested syntax layer. You can see the following source code:

// code
if (cap = this.rules.code.exec(src)) {
  src = src.substring(cap[0].length);
  cap = cap[0].replace(/^ {4}/gm, '');
  this.tokens.push({
    type: 'code',
    text: !this.options.pedantic
      ? cap.replace(/\n+$/, '')
      : cap
  });
  continue;
}

He passed a tokens here, and then passed them to the outer layer for further parsing

Parser.prototype.tok = function() {
  switch (this.token.type) {
    case 'space': {
      return '';
    }
    case 'hr': {
      return this.renderer.hr();
    }
    ...
} 

So, marked.js In order to complete the customized analysis, we really dug a big hole. But compared with the global matching in the replacement mode, it is a little more flexible.

flexibility + speed = const

OK, now we have a simple understanding of the overall situation marked.js The principle of parsing. Next, let’s take a look at the more difficult code parsing.

Code analysis principle

If it’s just the code parsing of the surface layer, it’s very simple. Use the following regular expression

code: /\s?\`\`\`\n?([^`]+)\`\`\`/g

However, this simply replaces the following format

<pre>
    <code>
        ....
    </code>
<pre>

There is no match with color like the one below

var a =1;
var b =2;

The simple replacement principle is also well explained. Just add different classes to the specified span

//Replacement:

‘s’

//Generate span
<span class="str">'abc'</span>

Its parsing mechanism is to add different classnames according to different syntax rules
For details, we can refer to highlight.js Source code:

  function highlightBlock(block) {
    var node, originalStream, result, resultNode, text;
    var language = blockLanguage(block);
    text = node.textContent;
    ...
    result = language ? highlight(language, text, true) : highlightAuto(text);
    ...
  }

Find out the programming language of the specified code through blocklanguage. There is an important way to find the details:

function registerLanguage(name, language) {
  var lang = languages[name] = language(hljs);
  if (lang.aliases) {
  lang.aliases.forEach(function(alias) {aliases[alias] = name;});
  }
}

This method is used to mount the configuration file of language manually. Let’s take a look at JS’s configuration file

/*
Language: JavaScript
Category: common, scripting
*/

function(hljs) {
  return {
    aliases: ['js', 'jsx'],
    keywords: {
      keyword:
        'in of if for while finally var new function do return void else break catch ' +
        'instanceof with throw case default try this switch continue typeof delete ' +
        'let yield const export super debugger as async await static ' +
        // ECMAScript 6 modules import
        'import from as'
      ,
      literal:
        'true false null undefined NaN Infinity',
      built_in:
        'eval isFinite isNaN parseFloat parseInt decodeURI decodeURIComponent ' +
    ...
}

Then match and replace by the specified regular. Therefore, general MD parser engine parsing does not bring code parsing, because it is too complex… So many programming languages… So much fun. Therefore, highlight has customized a set of common mechanism. On the one hand, there is no case of passing in the specified language

hljs.COMMENT = function (begin, end, inherits) {
    var mode = hljs.inherit(
      {
        className: 'comment',
        begin: begin, end: end,
        contains: []
      },
      inherits || {}
    );
    mode.contains.push(hljs.PHRASAL_WORDS_MODE);
    mode.contains.push({
      className: 'doctag',
      begin: "(?:TODO|FIXME|NOTE|BUG|XXX):",
      relevance: 0
    });
    return mode;
  };
  hljs.C_LINE_COMMENT_MODE = hljs.COMMENT('//', '$');
  hljs.C_BLOCK_COMMENT_MODE = hljs.COMMENT('/\*', '\*/');
  hljs.HASH_COMMENT_MODE = hljs.COMMENT('#', '$');
  hljs.NUMBER_MODE = {
    className: 'number',
    begin: hljs.NUMBER_RE,
    relevance: 0
  };
  hljs.C_NUMBER_MODE = {
    className: 'number',
    begin: hljs.C_NUMBER_RE,
    relevance: 0
  };
  hljs.BINARY_NUMBER_MODE = {
    className: 'number',
    begin: hljs.BINARY_NUMBER_RE,
    relevance: 0
  };
  hljs.CSS_NUMBER_MODE = {
    className: 'number',
    begin: hljs.NUMBER_RE + '(' +
      '%|em|ex|ch|rem'  +
      '|vw|vh|vmin|vmax' +
      '|cm|mm|in|pt|pc|px' +
      '|deg|grad|rad|turn' +
      '|s|ms' +
      '|Hz|kHz' +
      '|dpi|dpcm|dppx' +
      ')?',
    relevance: 0
  };

No, I was teased by the interviewer recently. I was in a bad mood. I put a chicken soup at the end of my blog

Mdzz, what’s the agreed time? An interviewer who doesn’t even follow the time

Recommended Today

Chrome / Firefox browser cross domain mode

Sometimes when debugging code locally, cross domain is not set for the code. At this time, cross domain can be realized by setting the browser. Under mac chrome 1. Close chrome, right-click in the program dock to close completely 2. Open terminal 3. Enter the command open -a “/Applications/Google Chrome.app” –args –disable-web-security  –user-data-dir=/Users/yourname/chromeDevUserData/ Firefox Use […]