Crawler boss, gave me the regular expression he summarized!

Time:2021-4-14

Author: Xiao Fu Ge
Blog:https://bugstack.cn

Precipitation, sharing, growth, so that they and others can gain!

1、 Preface

Programming always produces results in practice!

Regular expression, also known as regular expression. Regular expression, often abbreviated as regex, regexp or re in code, is a concept of computer science. Regular expressions are usually used to retrieve and replace text that conforms to a certain pattern (rule).

Regular engines can be divided into two categories: DFA and NFA. These two engines have a long history (up to now more than 20 years), and many variants have been produced by them! Therefore, the introduction of POSIX avoids the generation of unnecessary variants. In this way, the mainstream regular engine is divided into three categories: first, DFA; second, traditional NFA; third, POSIX NFA.

Regular is also a very interesting technology, but often do not know how to use these symbols in the actual use of programming, so summed up this article, convenient for all small partners can be used as a tool article, convenient to deal with some need to use regular technical content.

2、 Rules

1. Common symbols

  • X character x
  • \Backslash character
  • \0n character n with octal value 0 (0 < = n < = 7)
  • \0nn character NN with octal value 0 (0 < = n < = 7)
  • \0mnn character MNN with octal value 0 (0 < = m < = 3, 0 < = n < = 7)
  • \Xhh character HH with hexadecimal value 0x
  • \Uhhhhh the character hhhh with the hexadecimal value of 0x
  • \T tab (‘\ u0009’)
  • \N new line (newline) character (‘\ u000a’)
  • \R carriage return (‘\ u000d’)
  • \F page feed (‘\ u000c’)
  • \A alarm (Bell) symbol (‘\ u0007’)
  • \E escape (‘\ u001b’)

2. Alphabetic characters

  • [ABC] A, B or C (simple class)
  • 1Any character except a, B or C (negative)
  • [a-za-z] A to Z or a to Z, including the letters at both ends (range)
  • [A-D [M-P]] A to D or m to P: [a-dm-p] (Union)
  • [A-Z & & [def]] d, e or F (intersection)
  • [a-z&&2]A to Z, except B and C: [ad-z] (minus)
  • [a-z&&3]A to Z, not m to P: [a-lq-z] (minus)

3. Predefined characters

  • . any character (may or may not match the end of line)
  • \D number: [0-9]
  • \D non numeric:4
  • \S white space character: [(T / N / x0B / F / R]
  • \S non white space character:5
  • \W word character: [a-za-z]_ 0-9]
  • \W non word character:6

4. POSIX characters

  • \P {lower} lowercase character: [A-Z]
  • \P {upper} uppercase character: [A-Z]
  • \P {ASCII} all ASCII: [X00 – [x7f]
  • \P {alpha} alphabetic character: [[P {lower} P {upper}]
  • \P {digit} decimal: [0-9]
  • \P {alnum} alphanumeric character: [P {alpha} P {digit}]
  • \P {punch} punctuation:! “# $% & ‘() * +, -. /:; < = >? @ []^_ `{|}~
  • \P {graph} visible character: [[P {alnum} P {punct}]
  • \P {print} printable characters: [P {graph} X20]
  • \P {blank} space or tab: [t]
  • \P {CNTRL} control character: [[X00 – [x1f] x7f]
  • \P {xdigit} hexadecimal digit: [0-9a-fa-f]
  • \P {space} white space character: [(T / N / x0B / F / R]

5. Character class

  • \P {javalowercase} is equivalent to java.lang.Character .isLowerCase()
  • \P {javauppercase} is equivalent to java.lang.Character .isUpperCase()
  • \P {javawhitespace} is equivalent to java.lang.Character .isWhitespace()
  • \P {javamirrored} is equivalent to java.lang.Character .isMirrored()

6. Classes of Unicode blocks and categories

  • \Characters in P {ingrek} Greek block (simple block)
  • \P {Lu} capital letter (simple category)
  • \P {SC} currency symbol
  • \P {ingrek} all characters except in the Greek block (negative)
  • [\p{L}&&7]All letters except capital letters (minus)

7. Boundary matcher

  • ^The beginning of the line
  • End of $line
  • \B word boundary
  • \B non word boundary
  • \A the beginning of the input
  • \The end of the previous match on G
  • \Last Terminator (if any)
  • \The end of the Z input

8. Greedy quantifier

  • X? X, once or not
  • X * x, zero or more times
  • X + X, once or more
  • X {n} x, exactly n times
  • X {n,} x, at least N times
  • X {n, m} x, at least N times, but not more than m times

9. Relative quantifier

  • X?? x, once or not
  • X *? X, zero or more times
  • X +? X, once or more
  • X {n}? X, exactly n times
  • X {n,}? X, at least N times
  • X {n, m}? X, at least N times, but not more than m times

10. Possessive quantifiers

  • X? + X, once or not
  • X * + X, zero or more times
  • X + + X, once or more
  • X {n} + X, exactly n times
  • X {n,} + X, at least N times
  • X {n, m} + X, at least N times, but not more than m times

11. Logical operator

  • XY x followed by Y
  • X|y X or Y
  • (10) X as capture group

12. Back reference

  • \N any matching nth capture group

13. References

  • \Nothing, but references the following characters
  • \Q nothing, but references all characters until
  • \E nothing, but ends the reference that starts with the

14. Special construction (non capture)

  • (?: x) x as non capture group
  • (? Idmsux idmsux) nothing, but the flag I D M S U x on – off will be matched
  • (?idmsux- idmsux:X )X, as a non capture group (? = x) x with a given flag I D M S U x on – off, passes through a zero width positive lookahead
  • (?! x) x, through negative lookahead of zero width
  • (? < = x) x, through a positive lookbehind of zero width
  • (? <! X) x, through the negative lookbehind of zero width
  • (? > x) x, as an independent non capture group

3、 Cases

1. Character matching

"a".matches(".")
  • Result: true
  • Description: matches any character

"a".matches("[abc]")
  • Result: true
  • Description: any character containing ABC is matched, and it is matched once by default

"a".matches("[^abc]")
  • Results: false
  • Description: any character except a, B or C (negative)

"A".matches("[a-zA-Z]")
  • Result: true
  • Description: A to Z or a to Z, including the letters at both ends (range)

"A".matches("[a-z]|[A-Z]")
  • Result: true
  • Description: A to Z or a to Z, including the letters at both ends (range)

"A".matches("[a-z(A-Z)]")
  • Result: true
  • Description: A-Z, A-Z, matching range is the same, bracket is capture group

"R".matches("[A-Z&&(RFG)]")
  • Result: true
  • Description: matching the intersection of A-Z and RFG

"a_8".matches("\w{3}")
  • Result: true
  • Description: the word character is equivalent to [a-za-z]_ Match {3} three times

"\".matches("\\")
  • Result: true
  • Description: represents a\

"hello sir".matches("h.*")
  • Result: true
  • Description: any character, * matches zero to multiple times

"hello sir".matches(".*ir$")
  • Result: true
  • Description:. * matches any character to determine the end of the matching line

"hello sir".matches("^h[a-z]{1,3}o\b.*")
  • Result: true
  • Description: ^ H matches the beginning, [A-Z] {1,3} o matches the A-Z one to three times, and then matches the letter O, which does not match any of these word separator characters. It only matches one position. It matches the position after o.

"hellosir".matches("^h[a-z]{1,3}o\b.*")
  • Results: false
  • Description: O followed by s is a letter, not a space, and can’t match the o boundary of the word.

" \n".matches("^[\s&&[^\n]]*\n$")
  • Result: true
  • Description: the match starts with a space^[\\s&&[^\\n]], and cannot be a newline character. The last must be a newline character\\n$

System.out.println("java".matches("(?i)JAVA"));
  • Result: true
  • Description: (? I) in non capture group, this means that case is ignored

2. Pattern matching

2.1 verification matching

Pattern p = Pattern.compile("[a-z]{3,}");
Matcher m = p.matcher("fgha");
System.out.println (m.matches()); // true, match characters 3 times or more
  • Result: true
  • Description: pattern works with matcher. The matcher class provides grouping support for regular expressions and multiple matching support for regular expressions. Using pattern alone can only use Pattern.matches (string regex, charsequence input) the most basic and simplest matching.

2.2 matching function

Pattern p = Pattern.compile("\d{3,5}");
Matcher m = p.matcher("123-4536-89789-000");
System.out.println(m.matches());
m. Reset(); // spit out the characters you eat and match them again. If m2. Matches will eat in, the matching below the characters will not succeed
System.out.println(m.find());
System.out.println (m.start() + "-" + m.end()); // when found, print the first position (you must find it to print)
System.out.println(m.find());
System.out.println (m.start() + "-" + m.end()); // when found, print the first position (you must find it to print)
System.out.println(m.find());
System.out.println (m.start() + "-" + m.end()); // when found, print the first position (you must find it to print)
System.out.println(m.find());
System.out.println (m.lookingat()); // every time I just start looking on my head

test result

false
true
0-3
true
4-8
true
9-14
true
true
  • m. The total matching quantity is
  • m. Reset (), spit out the characters you eat and match them again. If m2. Matches will eat in, the matching below the characters will not be successful
  • m. Find () to find a match
  • m. Start (), the matching string, the starting position
  • m. End (), the matching string, and the end position

2.3 matching common substitution

Pattern p = Pattern.compile("java",Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("java_Java_jAva_jAVa_IloveJava");
System.out.println(m.replaceAll("JAVA"));
  • Results: Java_ JAVA_ JAVA_ JAVA_ IloveJAVA
  • Description: match all lowercase letters Java and Java to uppercase

2.4 matching logic replacement

Pattern p = Pattern.compile("java", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher("java_Java_jAva_jAVa_IloveJava fdasfas");
StringBuffer sb = new StringBuffer();
int i = 0;
while (m.find()) {
    i++;
    if (i % 2 == 0) {
        m.appendReplacement(sb, "java");
    } else {
        m.appendReplacement(sb, "JAVA");
    }
}
m.appendTail(sb);
System.out.println(sb);
  • Results: Java_ java_ JAVA_ java_ IloveJAVA fdasfas
  • Description: according to the program logici % 2For single and even number replacement matching

2.4 grouping matching

Pattern p = Pattern.compile("(\d{3,5})([a-z]{2})");
Matcher m = p.matcher("123bb_78987dd_090po");
while(m.find()){
    System.out.println(m.group(1));
}
  • result:

    123
    78987
    090
    
    Process finished with exit code 0
  • Description: grouping with brackets takes only one group of numbers. Group 0 in the gap bracket is the whole. The first group is the first bracket from the left, and the second group is the second bracket from the left

2.5 greedy matching and non greedy matching

Pattern p = Pattern.compile("(.{3,10}?)[0-9]");
Matcher m = p.matcher("aaaa5dddd8");
while (m.find()) {
    System.out.println(m.start() + "-" + m.end());
}
  • result:

    0-5
    5-10
    
    Process finished with exit code 0
  • Description: no question mark after {3,10} means that greedy matching will last the longest. If {3,10}? Plus? Sign, it means that lazy matching has the least matching. Start with 3. If we use if (M. find) () {M. start() + “-” + M. end()} then it matches the first one

2.6 common capture

Pattern p = Pattern.compile(".{3}");
Matcher m = p.matcher("ab4dd5");
while(m.find()){
    System.out.println(m.group());
}
  • result:

    ab4
    5-10
    
    Process finished with exit code 0
  • Description: match three arbitrary characters at a time, output with M. group().

2.7 non capture group (? = a)

Pattern p = Pattern.compile(".{3}(?=a)");           
 Matcher m = p.matcher("ab4add5");
 while (m.find()) {
     System.out.println ("cannot be followed by a: + m.group ());
 }
  • result:It can't be followed by a: ab4
  • Description: (? = a) this is the meaning of non capture group, the last one is a, and this a is not taken out yet!! (? = a) it would be different if it was written in the front

Pattern p = Pattern.compile("(?!a).{3}");           
Matcher m = p.matcher("abbsab89");
while (m.find()) {
    System.out.println ("front cannot be a ': + m.group ());
}
  • Results: the front can not be a: BBS, the front can not be a: B89
  • Description: (?! a) cannot be preceded by a, so find BBS, B89 in the whole string

2.8 remove > < sign matching

Pattern p = Pattern.compile("(?!>).+(?=<)");
Matcher M = P. matcher ("> Xiao Fu Ge <");
while (m.find()) {
    System.out.println(m.group());
}
  • Results: xiaofuge
  • Description: it can generally match the content information in the special string in the web page.

2.9 forward reference

Pattern p = Pattern.compile("(\d\d)\1");
Matcher m = p.matcher("1212");
System.out.println(m.matches());
  • Result: true
  • Description: 1 is the forward reference, 12 is the first match, and 12 is the same as before next time, so it is true

4、 Summary

Regularization includes a lot of symbols, types, matching ranges, matching numbers, matching principles, etc., such as greed, exclusion, forward reference, etc. these methods are actually not difficult. As long as you follow the regularization standard, you can combine the string content information that you want to match and intercept.

5、 Series recommendation


  1. abc
  2. bc
  3. m-p
  4. 0-9
  5. \s
  6. \w
  7. \p{Lu}

Recommended Today

Libp2p RS version 0.3.0 introduction

V0.3.0 released on 4.23, usingAsyncRead & AsyncWriteTo replace ourReadEx & WriteEx & SplitEx; SimplifiedKad/DHTImplementation logic. modify ReadEx & WriteEx & SplitEx: At first we tried to useasync-traitTo define their own IO operationsTraitFor more pure useasync/awaitTo write code. withReadExFor example, it is roughly as follows: #[async_trait] pub trait ReadEx { async fn read(&mut self, buf: &mut […]