PHP real regular expression (2): extracting HTML elements

Time:2021-1-25

This article introduces how to extract HTML elements from regular expressionspattern modifier Greedy matchingAndNon greedy matchingUnicode modeLook aroundAnd so on.
Before reading this article, it’s better to read the same series of articlesPHP real regular expression (1): verifying mobile phone numberRead it carefully first.

Basic extraction

There is such a form

user name occupation
Kobe Bryant Basketball players
Jay Chou Singer, songwriter, producer, actor, director
Lionel Messi Football player

Its source code is as follows:

<table>
  <thead>
    < tr > < th > user name < / th > < th > occupation < / th > < / TR >
  </thead>
  <tbody>
    <tr>
      <td>Kobe Bryant < / td > < td > basketball player</td>
    </tr>
    <tr>
      <td>Jay Chou < / td > < td > singer, songwriter, producer, actor, director</td>
    </tr>
    <tr>
      <td>Lionel Messi < / td > < td > football player</td>
    </tr>
  </tbody>
</table>

Now it’s time to extract<tbody>first<tr>Elements. The simplest regular expression is as follows:

<tbody>\s+<tr>.*<\/tr>

among

  • \S isPHP real regular expression (1): verifying mobile phone numberIntroduced a shorthand method of character group, representing carriage return, space, tab and other white space characters

  • classifier+Indicates that the number of occurrences of the character or group it modifies is greater than or equal to 1

  • Dot character.In regular expressions, it is a special metacharacter, which can match “any character”

  • Closed label</tr>Slash in/In regular expressions in PHP, it is a pattern separator, so you need to escape to represent the slash character.

But in fact, such an expression cannot be derived from the above<tbody>Extract the first one from the list<tr>Elemental

PHP real regular expression (2): extracting HTML elements

The main problem here is that the dot character is used by default.Cannot match newline\n. There are two ways to solve this problem:

  • Use pattern modifierssThe regular expression is/<tbody>\s+<tr>.*<\/tr>/sor(?s)<tbody>\s+<tr>.*<\/tr>. pattern modifier sThe role of the dot character is to let the dot character.You can match line breaks.

  • use[\s\S]or[\w\W]or[\d\D]Replace dot character.To match all characters, the regular expression is<tbody>\s+<tr>[\s\S]*<\/tr>

PHP real regular expression (2): extracting HTML elements

aboutpattern modifier (pattern modifiers), here we need to introduce them in detail(Click here to see all the schema modifiers supported by PHP)。 Pattern modifiers can change some default rules of regular expressions. Common pattern modifiers include I, s, u, u, etc. we will use some of them later. Here we will not introduce the function of each pattern modifier, but we will use them later. Here we mainly compare the differences between /… / {modifier} and… (? {modifier})… Expressions.

pattern modifier /.../{modifier} ...(?{modifier})...
Examples /<tr>.*<\/tr>/s <tr>(?s).*<\/tr>
Name (PHP manual) pattern modifier In mode modifier
Name (regulatory guidelines) Predefined Constants pattern modifier
Scope of action Entire regular expression When it is not in a group (subexpression), it works on all regular expressions following it; if it is in a group (subexpression), it works on the rest of its group. When there is no grouping and it is placed at the front of the whole regular expression, it is equivalent to/.../{modifier}
Support level All pattern modifiers are supported Partial pattern modifiers are supported
Other programming languages Maybe not I generally support it

As you can see from the GIF above, there are three extracted resultstrAnd not just one. This is because quantifiers in regular expressions default toGreedy matchingHere it is,.*It matches all characters until there are no characters at the end, and it goes back, back to<tbody>The last one in</tr>Time and in regular expressions<\/tr>In order to complete the whole matching process, the final result contains three parts<tr>

You can use the pattern modifierUTo specify that the entire regular expression is in non greedy mode, you can also use theNon greedy matching quantifierSpecify a quantifier as non greedy mode

  • Specifies that the entire regular expression is in non greedy mode:

    • /<tbody>\s+<tr>.*<\/tr>/Us

    • or(?Us)<tbody>\s+<tr>.*<\/tr>

  • Non greedy quantifiers:
    /<tbody>\s+<tr>.*?<\/tr>/s

PHP real regular expression (2): extracting HTML elements

The complete greedy quantifiers (matching priority quantifiers) and non greedy quantifiers (ignoring priority quantifiers) are shown in the following table:

Greedy Quantifiers Non greedy quantifier Limit the number of times
* *? May appear, may not appear, there is no upper limit on the number of occurrences
+ +? At least once, no upper limit
? ?? 0 or 1 occurrences
{m,n} {m,n}? The number of occurrences is greater than or equal to m and less than or equal to n
{m,} {m,}? At least m times, no upper limit
{0,n} {0,n}? 0 – N occurrences

Extracts the row containing the specified content

Let’s say we want to put the information aboutAthletesWe may use it/< tr >. * athlete. * < \ / TR > / SSuch regular expressions.

This expression can match the result in Unicode environment, but not in GBK environment. We can use pattern modifiersuTo specify the Unicode mode:

/< tr >. * athletes. * < \ / TR > / us

In Unicode mode, we can even useCode valueInstead of Chinese characters:

/<tr>.*\x{8fd0}\x{52a8}\x{5458}.*<\/tr>/us

PHP regular\x{hex}The advantage of using code value is that it can be combined with character group to represent a range, such as[\x{4e00}-\x{9fff}]Indicates that all Chinese characters are matched.

PHP real regular expression (2): extracting HTML elements

The above expression matches the result, but it’s not correct. We can see that it matches the first of the entire string<tr>To the last</tr>
Intuitively, we want the regular expression to match the “athlete” first, and then find the nearest one to the left<tr>Find the nearest one to the right</tr>. But in fact, regular expressions match from left to right<tr>Start to search. See the following table for the matching of the whole regular expression (blank characters are not displayed).

expression Match value
/
<tr> <tr>
.* < th > user name < / th > < th > occupation < / th > < / TR > < thead > < tbody > < tr > < td > Kobe Bryant < / td > < td > basketball
Athletes Athletes
.* < / td > < tr > < tr > < td > Jay Chou < / td > < td > singer, songwriter, producer, actor, director < / td > < tr > < td > Lionel Messi < / td > < td > football player</td>
<\/tr> </tr>
/us

Here are two.*More characters were matched than expected. the second.*The reason for matching more characters than expected is that the regular expression defaults to greedy matching pattern, which matches each character in the remaining string until the end of the string, and then goes back to the last one</tr>We can solve this problem by specifying a non greedy matching pattern. But the first one.*It’s normal to match more characters than expected, because regular expressions match from left to right, and the<tr>Matches the first character in the string<tr>In the back.*Matches all the remaining characters up to the end of the string, and then goes back to “athlete.”.

Let’s first look at the results when using non greedy matching

PHP real regular expression (2): extracting HTML elements

As you can see, the second one.*The matching characters are already what we want. So, for the first one.*How to solve the problem of matching more characters than expected?

If you only use the knowledge introduced in my article so far, there is a way to solve it. We can match all the rows from left to right first(<tr>...</tr>), by using thepreg_match_allFunction combined with non greedy matching pattern, and then traverse each line to filter out the line containing “athletes”.

Of course, we can also solve this problem through pure regular expressions. If you have some experience in using regular expressions, it may be easy for you to think about itExclusive character group, we introducedCharacter group[...], which represents the characters that may appear at the same location. andExclusive character groupThen it represents the characters that cannot appear in the same position[^...], by immediately following the square brackets[hinder^To represent an exclusive character group. For example,[^\d]Indicates that the matching character is any character other than a number.
If there are exclusive subexpressions, similar to(^<tr>)*, we just need to specify the first one.*hold<tr>Just exclude it. Unfortunately, there is no exclusive subexpression or exclusive grouping in regular expressions. In this case, we can only useLook around

/< tr > (. (?! < tr >) * athletes. * < \ / TR > / us

PHP real regular expression (2): extracting HTML elements

Look around doesn’t match any characters. It’s used to “stop and look around.”. The above expression uses theLook around in negative orderIn the form of(?!...). Specifically for(.(?!<tr>))*To analyze, whenever.After matching a character, look to the right, if the right side of the current matching character does not appear<tr>It’s a match.

The complete look around is as follows:

name Notation meaning
Look around in positive order (?=...) Look to the right. The contents in the look around appear on the right
Look around in negative order (?!...) Look to the right, the right side does not appear to look around the content to match
Look around in reverse order (?<=...) If you look to the left, the contents in the look around will match
Look around in negative reverse order (?<!...) Look to the left, the left does not appear, the contents in the look around match

Because the above regular expression has a grouping (subexpression), the matching result has subscript 1 in addition to subscript 0. The result of subscript 1 here is of little use. We can use the result introduced beforeNon capture packet

/< tr > (?:. (?! < tr >)) * athletes. * < \ / TR > / us

Our real goal is to extract all the lines containing “athletes”, but only the first one is extracted above, so we need topreg_matchReplace function withpreg_match_all

PHP real regular expression (2): extracting HTML elements