An example of using lilac parser instead of regular to grab text

Time:2020-10-23

lilac-parserI use clojurecript to implement a library, can do some regular functions
Looking at the name, this library design is more of a parser idea,
From the use, as a regular is also more smooth. Although not as concise as regular
The disadvantage of regular is that it is written based on the form of string
The lilac parser method is very easy to combine. I will give some examples

First of allis+This rule, for exact matching,

(parse-lilac "x" (is+ "x"))      ; {:ok? true, :rest nil}
(parse-lilac "xyz" (is+ "xyz"))  ; {:ok? true, :rest nil}
(parse-lilac "xy" (is+ "x"))     ; {:ok? false}
(parse-lilac "xy" (is+ "x"))     ; {:ok? true, :rest ("y")}
(parse-lilac "y" (is+ "x"))      ; {:ok? false}

As you can see, the expressions on the header match all return true
Is there any other content to be passed:restField to separate judgment

Of course, exact matching is easier, and then selective matching,

(parse-lilac "x" (one-of+ "xyz"))  ; {:ok? true}
(parse-lilac "y" (one-of+ "xyz"))  ; {:ok? true}
(parse-lilac "z" (one-of+ "xyz"))  ; {:ok? true}
(parse-lilac "w" (one-of+ "xyz"))  ; {:ok? false}
(parse-lilac "xy" (one-of+ "xyz")) ; {:ok? true, :rest ("y")}

Conversely, there can be rules for exclusion,

(parse-lilac "x" (other-than+ "abc"))  ; {:ok? true, :rest nil}
(parse-lilac "xy" (other-than+ "abc")) ; {:ok? true, :rest ("y")}
(parse-lilac "a" (other-than+ "abc"))  ; {:ok? false}

On this basis, some logic is added to show that the rule of judgment may not exist,
Of course, if it does not exist, it can be returned to true at any time,

(parse-lilac "x" (optional+ (is+ "x"))) ; {:ok? true, :rest nil}
(parse-lilac "" (optional+ (is+ "x"))) ; {:ok? true, :rest nil}
(parse-lilac "x" (optional+ (is+ "y"))) ; {:ok? true, :rest("x")}

You can also set rules to judge more than one, that is, more than one (the specific number cannot be controlled at present),

(parse-lilac "x" (many+ (is+ "x")))
(parse-lilac "xx" (many+ (is+ "x")))
(parse-lilac "xxx" (many+ (is+ "x")))
(parse-lilac "xxxy" (many+ (is+ "x")))

If 0 is allowed, it is not many, but some rules,

(parse-lilac "" (some+ (is+ "x")))
(parse-lilac "x" (some+ (is+ "x")))
(parse-lilac "xx" (some+ (is+ "x")))
(parse-lilac "xxy" (some+ (is+ "x")))
(parse-lilac "y" (some+ (is+ "x")))

Correspondingly, the rules of or can be written out,

(parse-lilac "x" (or+ [(is+ "x") (is+ "y")]))
(parse-lilac "y" (or+ [(is+ "x") (is+ "y")]))
(parse-lilac "z" (or+ [(is+ "x") (is+ "y")]))

Combine is used to combine multiple rules in sequence,

(parse-lilac "xy" (combine+ [(is+ "x") (is+ "y")]))  ; {:ok? true, :rest nil}
(parse-lilac "xyz" (combine+ [(is+ "x") (is+ "y")])) ; {:ok? true, :rest ("z")}
(parse-lilac "xy" (combine+ [(is+ "y") (is+ "x")]))  ; {:ok? flase}

Interleave means that two rules are repeated at intervals,
Many of these scenarios are used in the processing of comma spaced expressions,

(parse-lilac "xy" (interleave+ (is+ "x") (is+ "y")))
(parse-lilac "xyx" (interleave+ (is+ "x") (is+ "y")))
(parse-lilac "xyxy" (interleave+ (is+ "x") (is+ "y")))
(parse-lilac "yxy" (interleave+ (is+ "x") (is+ "y")))

In addition, the current code also provides several built-in rules to judge the situation of letters, numbers and Chinese,

(parse-lilac "a" lilac-alphabet)
(parse-lilac "A" lilac-alphabet)
(parse-lilac "." lilac-alphabet) ; {:ok? false}

(parse-lilac "1" lilac-digit)
(parse-lilac "a" lilac-digit) ; {:ok? false}

(parse lilac "Han" Lilac Chinese char)
(parse-lilac "E" lilac-chinese-char)  ; {:ok? false}
(parse-lilac "," lilac-chinese-char)  ; {:ok? false}
(parse-lilac "," lilac-chinese-char) ; {:ok? false}

Some special characters can only be specified through the Unicode range

(parse-lilac "a" (unicode-range+ 97 122))
(parse-lilac "z" (unicode-range+ 97 122))
(parse-lilac "A" (unicode-range+ 97 122))

With these rules, you can combine them to simulate regular functions, such as finding how many matches there are,

(find-lilac "write cumulo and respo" (or+ [(is+ "cumulo") (is+ "respo")]))
; find 2
(find-lilac "write cumulo and phlox" (or+ [(is+ "cumulo") (is+ "respo")]))
; find 1
(find-lilac "write cumulo and phlox" (or+ [(is+ "cirru") (is+ "respo")]))
; find 0

Or directly replace the string, which is similar to regular

(replace-lilac "cumulo project" (or+ [(is+ "cumulo") (is+ "respo")]) (fn [x] "my"))
; "my project"
(replace-lilac "respo project" (or+ [(is+ "cumulo") (is+ "respo")]) (fn [x] "my"))
; "my project"
(replace-lilac "phlox project" (or+ [(is+ "cumulo") (is+ "respo")]) (fn [x] "my"))
; "phlox project"

As you can see, this method is composed, which is longer than regular, but can define variables and do some abstraction

A simple example may not see the use of this, may be that it has been made longer and worse performance
In my project, there is a simple example of JSON parsing, which can’t be done with regular
The direct handling code is as follows:

If true false is judged, a Boolean is returned
(def boolean-parser
  (label+ "boolean" (or+ [(is+ "true") (is+ "false")] (fn [x] (if (= x "true") true false)))))

(def space-parser (label+ "space" (some+ (is+ " ") (fn [x] nil))))

; combine a parser that contains white space and commas. Label is just a comment and can be ignored
(def comma-parser
  (label+ "comma" (combine+ [space-parser (is+ ",") space-parser] (fn [x] nil))))

(def digits-parser (many+ (one-of+ "0123456789") (fn [xs] (string/join "" xs))))

For simplicity, null and undefined return nil directly
(def nil-parser (label+ "nil" (or+ [(is+ "null") (is+ "undefined")] (fn [x] nil))))

In the case of; number, you need to consider that there may be a negative sign before it and a decimal point after it
I'm lazy here. I don't think about scientific notation
(def number-parser
  (label+
   "number"
   (combine+
     ; negative sign.. optional
    [(optional+ (is+ "-"))
     digits-parser
                ; combine the decimal part, which is optional
     (optional+ (combine+ [(is+ ".") digits-parser] (fn [xs] (string/join "" xs))))]
    (fn [xs] (js/Number (string/join "" xs))))))

(def string-parser
  (label+
   "string"
   (combine+
     ; string parsing, with quotation marks beginning and ending
    [(is+ "\"")
            In the middle is a non quoted string, or an escape symbol
     (some+ (or+ [(other-than+ "\"\") (is+ "\\"") (is+ "\\") (is+ "\n")]))
     (is+ "\"")]
    (fn [xs] (string/join "" (nth xs 1))))))

(defparser
 value-parser+
 ()
 identity
 (or+
  [number-parser string-parser nil-parser boolean-parser (array-parser+) (object-parser+)]))

(defparser
 object-parser+
 ()
 identity
 (combine+
  [(is+ "{")
   (optional+
     ; the object is more complex, mainly look at the interleave part, the processing of curly brackets is only outside
    (interleave+
     (combine+
      [string-parser space-parser (is+ ":") space-parser (value-parser+)]
      (fn [xs] [(nth xs 0) (nth xs 4)]))
     comma-parser
     (fn [xs] (take-nth 2 xs))))
   (is+ "}")]
  (fn [xs] (into {} (nth xs 1)))))

(defparser
 array-parser+
 ()
 (fn [x] (vec (first (nth x 1))))
 (combine+
  [(is+ "[")
          ; array, also in the case of interleave
   (some+ (interleave+ (value-parser+) comma-parser (fn [xs] (take-nth 2 xs))))
   (is+ "]")]))

As you can see, when the rule is constructed by lilac parser, it is easier to generate a JSON parser
Although the supported rules are relatively simple and the performance is not ideal, the code is much more readable than regular
I believe it can be used as an idea in many text processing scenarios
In order to provide a simplified version, use it directly in JavaScript instead of regular