Unix / Linux shell regular expression syntax explanation with usage

Time:2020-10-8

Almost all important problems need to filter out useful data from useless data. Learn how many UNIX? Command-line utilities use regular expressions to refine.
Strangely enough, to this day I can still repeat the classic Saturday morning song “conjunction junction.”. Whether it’s a good thing (watching too much TV) or bad (maybe a harbinger of my current career) remains to be discussed. In any case, the ditty conveys the basic message in a cheerful rhythm.

I haven’t come up with something similar to “junction junction junction” for UNIX learning, but I’ll try to write it myself in the next few months. At the same time, taking advantage of the good mood brought by happy memories, we continue to conquer the command line in the traditional learning way of schoolhouse rock.

Now class begins. Spit out your gum, go back to your seat and take out a number two pencil. And you, Spicoli.

Imitation show

You can think of the UNIX command line as a sentence:

  • Executable commands, such as cat or ls, are verb operations.
  • The output of a command is a noun — the data to look up or use.
  • Shell operators, such as | (pipe) or > (redirect standard output), are conjunctions — used to connect sentences.

For example, the command line:ls -A | wc -l Used to calculate the number of entries in the current directory (ignoring special entries. And..), which contains two sentences. The first sentence LS – A is a verb structure, listing the contents of the current directory. The second sentence WC – L is another verb structure used to calculate the number of lines. The output of the first sentence is used as the input of the second sentence, and the two sentences are connected by connectives (pipes).

Many of the command-line sentence patterns that you may have learned in this series and others have this sentence structure.

However, the command line is unprofessional without grammatical modifiers. Of course, the basic sentence can also complete the work, but it is not beautiful. (I’d like to apologize to Ms. rad and MS. Perlstein, the high school English singing duo.) Solving more interesting problems requires adjectives.

Almost all important problems need to filter out useful data from useless data. Although the number and types of attributes vary, each scheme implicitly or explicitly describes the information it wants to find and process in some way (form or format), thus generating other forms of information.

On the command line, a regular expression acts as an adjective – a description or qualifier. When applied to output, regular expressions can distinguish between related and unrelated data.

Punctuation overview

Let’s look at an example question.

The grep utility filters the input line by line and looks for matches. The simplest use of grep is to print lines that contain text that matches a pattern. Grep can find combinations of characters with a fixed order, or even ignore case by using the – I option.

Therefore, it is assumed that the file heroes.txt Contains the following lines:

Catwoman
Batman
The Tick
Spider Man
Black Cat
Batgirl
Danger Girl
Wonder Woman
Luke Cage
The Punisher
Ant Man
Dead Girl
Aquaman
SCUD
Spider Woman
Blackbolt
Martian Manhunter

Command line:

grep -i man heroes.txt

Will generate:

Catwoman
Batman
Spider Man
Wonder Woman
Ant Man
Aquaman
Martian Manhunter

Grep scan was performed heroes.txt For each line in the file, look for the letter M, followed by a, and then n. In addition to having to be adjacent, these letters can appear anywhere on the line, or even in the middle of a larger word. Regardless of case (- I option), Catwoman, Batman, spider man, wonder woman, ant man, Aquaman, and Martian Manhunter contain the string man.

The grep utility contains other built-in options to optimize your search. For example, the – W option is limited to matching entire words, so grep – I – W man will exclude Catwoman and Batman (for example).

The tool also has an excellent feature to exclude, rather than include, all matching search results. use-vOption to exclude matching rows. For example:

grep -v -i ‘spider’ heroes.txt

All lines except the containing string spider are printed.

Catwoman
Batman
The Tick
Black Cat
Batgirl
Danger Girl
Wonder Woman
Luke Cage
The Punisher
Ant Man
Dead Girl
Aquaman
SCUD
Blackbolt
Martian Manhunter

However, how do you deal with the following situations? Just want words that start with “bat”; or words that start with “bat,” “bat,” “cat,” or “cat”? Or would you like to know how many comic Avengers end with “man”. In these instances, simple string searches similar to the three examples above will not meet the requirements because they are location insensitive.

Location, location, location and alternatives

Regular expressions can filter specific positions, such as the beginning or end of a line, and the beginning and end of a word. Regular expressions (often abbreviated as regex) can also describe: alternatives (you can call them “this” or “that”); repetition of fixed, variable, or indefinite length; ranges (for example, “any letter between A-M”); and classes or classes of characters (“printable characters” or “punctuation”), and other techniques.

Table 1 shows some common regular expression operators. You can combine the elements (and other operators) shown in Table 1 and combine them to build (very) complex regular expressions.

Table 1. Common regular expression operators

Operator purpose
(full stop) Matches any single character.
^(off font) Matches an empty string that appears at the beginning of a line or at the beginning of a string.
$(dollar sign) Matches an empty string that appears at the end of a line.
A Match the capital letter A.
a Match the lowercase letter A.
/d Match any digit.
/D Matches any single non numeric character.
/w Matches any single alphanumeric character, the synonym is [: alnum:].
[A-E] Match any uppercase A, B, C, D, or E.
[^A-E] Matches any character except for a, B, C, D, and E.
X? Matches the capital letter X that appears zero or once.
X* Matches zero or any capital X.
X+ Matches one or more letters X.
X{n} Exactly match n letters X.
X{n,m} Match at least N and no more than m letters X. If you omit m, the expression will try to match at least n X.
(abc|def)+ Match a string of (at least one) ABC or def; ABC and def will match.

Here are some examples of regular expressions that use grep as a search tool. Many other UNIX tools, including interactive editors VI and Emacs, stream editors sed and awk, and all modern programming languages support regular expressions. After you learn the syntax of regular expressions (which may be quite obscure), you can apply your expertise flexibly to different tools, programming languages, and operating systems.

Find names that start with “bat”

To find names that begin with bat, use:

grep -E ‘^Bat’

You can use the – e option to specify regular expressions. ^The (CARET) character matches the beginning of a line or string, which is an imaginary character that appears before the beginning of each line or string. The letters B, a, and t only have literal meaning and match only those specific characters. Therefore, the command grep -E '^Bat' Will generate:

Batman
Batgirl

Since many regex operators are also used by shells (some have different uses, others have similar uses), it is a good practice to use single quotation marks to enclose each regex in the command line to protect the regex operator from shell misunderstanding. For example, * (asterisk) and $(dollar sign) are regex operators and have special meanings for your shell.

Find names ending with “man”

To find names ending with “man,” you can use regex man $to match the sequences m, a, and N, followed by the line (string) that matches the regex operator $.

Find empty lines

Based on the role of ^ and $, you can use regex ^ $to find empty lines (equivalent to lines that end immediately after the start).

Alternative or set operator

To find words that begin with “bat,” “bat,” “cat,” or “cat,” use the following two techniques. The first is the standby option. If any pattern in the standby option matches, the matching result will be produced. For example, the command:

grep -E ‘^(bat|Bat|cat|Cat)’ heroes.txt

This technique can be realized. The regex operator | (vertical bar) represents the alternative, so this| that matches the string this or string that. Therefore, ^ (BAT | bat | cat | cat) means “the line begins immediately with bat, bat, cat, or one of cat.” Of course, you can use grep – I to simplify the regex, so that the case can be ignored, thus simplifying the command to:

grep -i -E ‘^(bat|cat)’ heroes.txt

Another way to match “bat,” “bat,” “cat,” or “cat” is to use the [] (square brackets) set operator. If you put a set of characters in a set, you can match any one of those characters. (you can think of a set as a shorthand for character alternatives.)

For example, the command line:

grep -E ‘^[bcBC]at’ heroes.txt

The same result is generated with the following command:

grep -E ‘^(bat|Bat|cat|Cat)’ heroes.txt

You can use – I again to simplify regex to ^ [BC] at.

Also, you can use the – (hyphen) operator to specify the range of characters contained in the collection. For example, a user name usually starts with a letter. Suppose you want to verify such a user name in a web form submitted to your server, you can use a regex similar to ^ [a-za-z]. This regex means “the beginning of the string is followed by any uppercase (A-Z) or any lowercase (A-Z)” By the way, [A-Z] has the same effect as [a-za-z].

You can also mix ranges and individual characters in a collection. Regex [a-mxyz] will match any uppercase A-M, x, y, and Z.

Also, if you want to reverse the set (that is, exclude any characters in the set), you can use the special set [^] and include the ranges or characters to exclude. Here is an example of reversing a collection. To find all superheroes with at in their names and exclude dark knight and Batman, type:

grep -i -E ‘[^b]at’ heroes.txt

This command generates:

Catwoman
Black Cat

Because some sets need to be used frequently, simplified symbols are designed to replace a large number of characters. For example, the set [a-z0-9_ ]It is very common, so it can be abbreviated as / W. Similarly, the operator / W is the set [^ a-z0-9_ ]Short for. You can also use the symbol [: alnum:] instead of / W and [^ [: alnum:]] instead of / W.

By the way, / w (and synonym [: alnum:]) is region specific, while [a-z0-9] is region specific_ ]That is, letters A-Z, numbers 0-9, and underscores. If you are developing an international application, use a region specific format so that code can be migrated between many regions.

Repeat with me: repeat, repeat, repeat

So far, you’ve covered literal, positional, and two alternative operators. Using this alone, you can match most patterns of predictable length. Now back to the user name, you can make sure that each user name starts with a letter and follows exactly seven letters or numbers with the following regex command:

[a-z][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9][a-z0-9]

But it’s a bit clumsy. Moreover, it only matches user names that are exactly eight characters long. It does not match names between three and eight characters, which is usually a valid user name.

Regular expressions can also include repeating modifiers. Repeat modifiers can specify a number, such as none, one, multiple, one or more, zero or one, five to ten, and exactly three. The repeating modifier must be combined with other patterns, and the modifier itself has no meaning.

For example, regex:

^[A-z][A-z0-9]{2,7}$

The user name filtering function described above can be implemented. A user name starts with a letter and is followed by a string of at least two, but not more than seven letters or numbers, followed by the end of the string.

The location anchor here is very important. If there are no two positional operators, any length of user name is incorrectly accepted. Why? Consider regex:

^[A-z][A-z0-9]{2,7}

This command identifies whether a string begins with a letter and is followed by two or seven letters? However, it does not mention termination conditions. Therefore, the string samuellemens satisfies the condition, but its length is obviously beyond the range of a valid user name. Similarly, omitting the start anchor ^, or omitting both anchors at the same time, will match strings that end with or contain the string like munster1313, respectively. If you have to match a specific length, remember to add separators at the beginning and end of the required pattern.

Here are some other examples:

  • You can use {2,} to find two or more repetitions. Regex ^ G [O] {2,} gle matches Google, Google, Google, and so on.
  • Repeat modifiers?, + and * find zero or one, one or more times, and zero or multiple repetitions, respectively. (for example, you can think of? As a shorthand for {0,1}.)

    Regex boys? Matches boy or boys; regex Google? Matches gogle or Google.

    Regex Google + Google matches Google, Google, Google, etc.

    Construct Google * Google matches gogle, Google, Google, and so on.

  • Repeat modifiers can be applied to individual characters (as shown above) and to more complex combinations. Use (and) parentheses (as in Mathematics) to apply modifiers to subexpressions. Here is an example: given a text file test.txt :

The rain in Spain falls mainly
on the the plain.

It was the best of of times;
it was the worst of times.

Command grep – I – E ‘(/ b (of the) / W +) {2,}’ test.txt Will generate:

on the the plain.
It was the best of of times;

The regex operator / b matches word boundaries or (/ w / w/ w / W). The word ‘regex’ or ‘regex’ is followed by a series of characters You may ask why / W + is required: B is an empty string at the beginning or end of a word. This (or these) characters must be included between words or the regex will not find a match.

Capture what needs attention

Finding text is a common problem, but a more common problem is trying to extract text after it is found. In other words, you want to be refined.

Regular expressions extract information by capturing. If you want to separate the required text from the rest, use parentheses to enclose the pattern. In fact, you already use parentheses to collect terms; by default, parentheses are captured automatically.

To see the capture, switch to Perl. (the grep utility does not support capture because its target is to print lines that contain patterns.)

The following commands:

perl -n -e ‘/^The/s+(.*)$/ && print “$1/n”‘ heroes.txt

Will print:

Tick
Punisher

Using the command Perl – E, you can run Perl programs directly from the command line. The Perl – N command runs the program once for each line of the input file. The regex part of the command, the text (/) between the slashes, means “match the beginning of the string, then the letters’t ‘,’h’,’e ‘followed by one or more space characters / S +, and then captures all characters up to the end of the string.

The Perl capture content is placed in a special Perl variable that starts with $1. The rest of the Perl program prints the captured content.

Each nested pair of parentheses, starting from the left, is added to each left parenthesis and placed in the next special numeric variable. For example:

perl -n -e ‘/^(/w)+-(/w+)$/ && print “$1 $2″‘

Will generate:

Spider Man
Ant Man
Spider Woman

Capturing the text of interest is just a scratch in the ointment. If the material can be accurately identified, other materials can be used to change its appearance. Editors similar to VI and Emacs combine pattern matching and substitution to combine finding and replacing text in one step. You can also use patterns, substitution, and sed to change text from the command line.

Rich themes

Regular expressions are very powerful; there are a large number of operators available. It contains such a wealth of information and practical knowledge, we can list here is very rare.

Fortunately, there are three excellent sources of regular expression theory available:

  • If you have PERL on your system, you can refer to the Perl regular expression man page (type perldoc perlre). It provides an excellent introduction to regex and contains many useful examples. Many programming languages have adopted Perl compatible regular expressions (PCRE), so what you read on this man page has been translated directly into PHP, python, Java? And Ruby Programming languages, as well as many other latest tools.
  • Jeffrey Friedl’s regular expressions (Third Edition) is considered the Bible of regex usage. The book is detailed, accurate, clear, and pragmatic in explaining how matching works, all regex operators, majority precedence (limiting the number of + and * matching characters), and more. In addition, Friedl’s book includes some amazing regular expressions that accurately match fully qualified e-mail addresses and other request for comments (RFC) specific strings.
  • Regular expression recipes, written by Nathan good, provides useful solutions to many common data processing and filtering problems. If you need to extract a postcode, phone number, or quoted string, try Nathan’s solution.

There are many ways to use regular expressions on the command line. Almost every command that processes text supports some form of regular expression. Most shell command syntax also extends regular expressions more or less to match file names (although the functionality of operators may vary).

For example, type LS [a-c] to find a file named a, B, or C. Type LS [a-c] * to find all filenames that start with a, B, or C. The * here does not modify [a-c] in the shell like grep’s interpreter, and * is interpreted as. *. The? Operator works in a shell, but is interpreted as.. that is, to match any single character.

Check the documentation for your favorite utility or shell to determine which regex operators are supported and what might be unique about them.

UNIX grep regular expression metacharacter

A regular expression is a text pattern consisting of ordinary characters (such as characters a to Z) and special characters (called metacharacters). This pattern describes one or more strings to match when looking for the body of a text. The regular expression acts as a template to match a character pattern to the string being searched.
/

Marks the next character as a special character, or an literal character, or a backward reference, or an octal escape character. For example, ‘n’ matches the character “n”. ‘/ N’ matches a newline character. The sequence ‘/ /’ matches’ / ‘, while’ / (‘matches’ (‘).
^
Matches the start of the input string.
$
Matches the end of the input string.
*
Matches the preceding subexpression zero or more times. For example, Zo * can match “Z” and “zoo.”. *It is equivalent to {0,}.
+
Matches the previous subexpression one or more times. For example, ‘Zo +’ matches’ Zo ‘and’ Zo ‘, but not’ Z ‘. +It is equivalent to {1,}.
?
Matches the previous subexpression zero or once. For example, “do (ES)?” can match “do” or “do” in “does”. ? is equivalent to {0,1}.
{n}
N is a nonnegative integer. Match the determined n times. For example, ‘O {2}’ cannot match ‘o’ in ‘Bob’, but can match two o’s in ‘food’.
{n,}
N is a nonnegative integer. Match at least N times. For example, ‘O {2,}’ cannot match ‘o’ in ‘Bob’, but can match all o’s in ‘fooood’. ‘O {1,}’ is equivalent to ‘O +.’. ‘O {0,}’ is equivalent to ‘o *’.
{n,m}
M and N are nonnegative integers, where n < = M. At least N times and at most m times. “O {1,3}” will match the first three o’s in “food”. ‘O {0,1}’ is equivalent to ‘o?’. Note that you cannot have spaces between commas and two numbers.
?
When the character follows any other qualifier (*, +,?, {n}, {n,}, {n, m}), the matching pattern is non greedy. The non greedy pattern matches as few strings as possible, while the default greedy pattern matches as many strings as possible. For example, for the string “oooo”,’O +? ‘will match a single “O”, while’ O + ‘will match all’ o ‘.
.
Matches any single character except ‘/ N’. To match any character, including ‘/ N’, use a pattern like ‘[. / N]’.
(pattern)
Match the pattern and get the match. The resulting matches can be obtained from the resulting matches collection, using the submatches collection in VBScript and $0 in Visual Basic Scripting Edition $9 attribute. To match parenthesis characters, use ‘/ (‘ or ‘/)’.
(?:pattern)
The pattern is matched but the matching result is not obtained, that is to say, this is a non retrieval match and is not stored for future use. This is useful when using the “or” character (|) to combine parts of a pattern. For example, ‘industry (?: y| ies) is a simpler expression than’ industry|industries’.
(?=pattern)
Forward prefetching matches the lookup string at the beginning of any string that matches a pattern. This is a non fetch match, that is, the match does not need to be retrieved for later use. For example, ‘windows (? = 95 | 98 | NT | 2000)’can match “windows” in “Windows 2000”, but not “windows” in “windows 3.1”. Prefetching does not consume characters, that is, after a match occurs, the search for the next match begins immediately after the last match, rather than after the character that contains the prefetch.
(?!pattern)
Negative lookahead matches the search string at any point where a string not matching pattern. This is a non fetch match, that is, the match does not need to be retrieved for later use. For example, ‘windows (?! 95 | 98 | NT | 2000)’can match “windows” in “windows 3.1”, but not “windows” in “Windows 2000”. Prefetching does not consume characters, that is, after a match occurs, the search for the next match begins immediately after the last match, rather than after the character that contains the prefetch.
x|y
Match X or y. For example, ‘z|food’ matches “Z” or “food.”. ‘(z|f) ood’ matches “zood” or “food”.
[xyz]
Character set. Matches any character contained. For example, ‘[ABC]’ can match ‘a’ in ‘plain’.
[^xyz]
Negative character set. Matches any character that is not included. For example, ‘[^ ABC]’ can match ‘p’ in ‘plain’.
[a-z]
Character range. Matches any character in the specified range. For example, ‘[A-Z]’ can match any lowercase character in the range ‘a’ to ‘Z’.
[^a-z]
Negative value character range. Matches any character that is not in the specified range. For example, ‘[^ A-Z]’ can match any character that is not in the range of ‘a’ to ‘Z’.
/b
Match a word boundary, that is, the position between the word and the space. For example, ‘er / B’ can match ‘er’ in ‘never’, but not ‘er’ in ‘verb’.
/B
Match non word boundaries. ‘er / B’ can match ‘er’ in ‘verb’, but cannot match ‘er’ in ‘never’.
/cx
Matches the control character specified by X. For example, / cm matches a control-m or carriage return. The value of X must be one of A-Z or A-Z. Otherwise, C is treated as an original ‘C’ character.
/d
Matches a numeric character. It is equivalent to [0-9].
/D
Matches a non numeric character. Equivalent to [^ 0-9].
/f
Match a page break. It is equivalent to / x0c and / CL.
/n
Matches a newline character. It is equivalent to / x0a and / CJ.
/r
Match a carriage return. It is equivalent to / x0d and / cm.
/s
Matches any white space characters, including spaces, tabs, page breaks, and so on. It is equivalent to [/ T / F].
/S
Matches any non white space characters. It is equivalent to [^ / F / N / R / T / v].
/t
Match a tab. It is equivalent to / X09 and / CI.
/v
Match a vertical tab. It is equivalent to / x0B and / CK.
/w
Matches any word characters that include underscores. Equivalent to ‘[a-za-z0-9_ ]’。
/W
Matches any non word characters. Equivalent to ‘[^ a-za-z0-9_ ]’。
/xn
Matches n, where n is a hexadecimal escape value. The hexadecimal escape value must be two digits long. For example, ‘/ x41’ matches “a”. ‘/ x041’ is equivalent to ‘/ X04’ & “1”. ASCII encoding can be used in regular expressions. .
/num
Matches num, where num is a positive integer. Reference to the match obtained. For example, ‘(.) / 1’ matches two consecutive identical characters.
/n
Identifies an octal escape value or a backward reference. If there are at least n acquired subexpressions before / N, then n is a backward reference. Otherwise, if n is an octal digit (0-7), then n is an octal escape value.
/nm
Identifies an octal escape value or a backward reference. If there are at least nm derived subexpressions before / nm, then nm is a backward reference. If there are at least n fetches before / nm, then n is a backward reference followed by the word M. If none of the previous conditions are met, if n and m are octal digits (0-7), then / nm will match the octal escape value nm.
/nml
If n is an octal digit (0-3), and m and L are octal digits (0-7), then the octal escape value NML is matched.
/un
Matches n, where n is a Unicode character represented by four hexadecimal digits. For example, / u00a9 matches the copyright symbol (?).

In fact, many regular expression grammars are basically unified, mainly because the calling methods are different. More can refer to this article

https://www.jb51.net/tools/shell_regex.html