PHP 正则表达式 · ZetCode 中文系列教程

# PHP 正则表达式 > 原文： [https://zetcode.com/lang/php/regex/](https://zetcode.com/lang/php/regex/) 在 PHP 教程的这一部分中，我们介绍了 PHP 中的正则表达式。正则表达式用于文本搜索和更高级的文本操作。正则表达式是内置工具，如`grep`，`sed`，文本编辑器（如 vi，emacs），编程语言（如 Tcl，Perl 和 Python）。 PHP 也具有对正则表达式的内置支持。在 PHP 中，有两个用于正则表达式的模块：POSIX Regex 和 PCRE。 POSIX 正则表达式已贬值。在本章中，我们将使用 PCRE 示例。 PCRE 代表与 Perl 兼容的正则表达式。使用正则表达式时，需要做两件事：正则表达式函数和模式。模式是一个正则表达式，用于定义我们正在搜索或操纵的文本。它由文本文字和元字符组成。该模式放置在两个定界符内。这些通常是`//`，`##`或`@@`字符。它们通知正则表达式函数模式的开始和结束位置。这是 PCRE 中使用的部分元字符列表。 | | | | --- | --- | | `.` | 匹配任何单个字符。 | | `*` | 与前面的元素匹配零次或多次。 | | `[ ]` | 括号表达。与方括号内的字符匹配。 | | `[^ ]` | 匹配方括号中未包含的单个字符。 | | `^` | 匹配字符串中的起始位置。 | | `$` | 匹配字符串中的结束位置。 | | <code>|</code> | 备用运算符。 | ## PRCE 函数我们定义一些 PCRE regex 函数。它们都有一个`preg`前缀。 * `preg_split()` - 按正则表达式模式分割字符串 * `preg_match()` - 执行正则表达式匹配 * `preg_replace()` - 用正则表达式模式搜索和替换字符串 * `preg_grep()` - 返回与正则表达式模式匹配的数组条目接下来，我们将为每个函数提供一个示例。 ```php php> print_r(preg_split("@\s@", "Jane\tKate\nLucy Marion")); Array ( [0] => Jane [1] => Kate [2] => Lucy [3] => Marion ) ``` 我们有四个名字，用空格分开。 `\s`是代表空格的字符类。 `preg_split()`函数返回数组中的拆分字符串。 ```php php> echo preg_match("#[a-z]#", "s"); 1 ``` `preg_match()`函数查看字符类`[a-z]`中是否包含`'s'`字符。该类代表从`a`到`z`的所有字符。成功返回 1。 ```php php> echo preg_replace("/Jane/","Beky","I saw Jane. Jane was beautiful."); I saw Beky. Beky was beautiful. ``` `preg_replace()`函数将单词`"Jane"`的所有出现替换为单词`"Beky"`。 ```php php> print_r(preg_grep("#Jane#", ["Jane", "jane", "Joan", "JANE"])); Array ( [0] => Jane ) ``` `preg_grep()`函数返回与给定模式匹配的单词数组。在此示例中，数组中仅返回一个单词。这是因为默认情况下，搜索区分大小写。 ```php php> print_r(preg_grep("#Jane#i", ["Jane", "jane", "Joan", "JANE"])); Array ( [0] => Jane [1] => jane [3] => JANE ) ``` 在此示例中，我们执行不区分大小写的`grep`。我们将`i`修饰符放在右定界符之后。现在，返回的数组包含三个单词。 ## 点元字符 `.`（点）元字符代表文本中的任何单个字符。 `single.php` ```php <?php $words = [ "Seven", "even", "Maven", "Amen", "Leven" ]; $pattern = "/.even/"; foreach ($words as $word) { if (preg_match($pattern, $word)) { echo "$word matches the pattern\n"; } else { echo "$word does not match the pattern\n"; } } ``` 在`$words`数组中，我们有五个词。 ```php $pattern = "/.even/"; ``` 在这里，我们定义搜索模式。模式是一个字符串。正则表达式位于分隔符内。分隔符是强制性的。在我们的例子中，我们使用正斜杠`/ /`作为分隔符。请注意，如果需要，我们可以使用不同的定界符。点字符代表任何单个字符。 ```php if (preg_match($pattern, $word)) { echo "$word matches the pattern\n"; } else { echo "$word does not match the pattern\n"; } ``` 如果这五个词与模式匹配，我们将对其进行测试。 ```php $ php single.php Seven matches the pattern even does not match the pattern Maven does not match the pattern Amen does not match the pattern Leven matches the pattern ``` 七个和七个词与我们的搜索模式匹配。 ## 锚点锚点匹配给定文本内字符的位置。在下一个示例中，我们查看字符串是否位于句子的开头。 `anchors.php` ```php <?php $sentence1 = "Everywhere I look I see Jane"; $sentence2 = "Jane is the best thing that happened to me"; if (preg_match("/^Jane/", $sentence1)) { echo "Jane is at the beginning of the \$sentence1\n"; } else { echo "Jane is not at the beginning of the \$sentence1\n"; } if (preg_match("/^Jane/", $sentence2)) { echo "Jane is at the beginning of the \$sentence2\n"; } else { echo "Jane is not at the beginning of the \$sentence2\n"; } ``` 我们有两个句子。模式是`^Jane`。该模式检查'Jane'字符串是否位于文本的开头。 ```php $ php anchors.php Jane is not at the beginning of the $sentence1 Jane is at the beginning of the $sentence2 ``` ```php php> echo preg_match("#Jane$#", "I love Jane"); 1 php> echo preg_match("#Jane$#", "Jane does not love me"); 0 ``` `Jane$`模式与字符串`"Jane"`结尾的字符串匹配。 ## 完全匹配在以下示例中，我们显示了如何查找完全匹配的单词。 ```php php> echo preg_match("/mother/", "mother"); 1 php> echo preg_match("/mother/", "motherboard"); 1 php> echo preg_match("/mother/", "motherland"); 1 ``` `mother`模式适合单词`mother`，`motherboard`和`motherland`。说，我们只想查找完全匹配的单词。我们将使用前面提到的锚点`^`和`$`字符。 ```php php> echo preg_match("/^mother$/", "motherland"); 0 php> echo preg_match("/^mother$/", "Who is your mother?"); 0 php> echo preg_match("/^mother$/", "mother"); 1 ``` 使用定位字符，我们可以获得与模式完全匹配的单词。 ## 量词标记或组之后的量词指定允许前面的元素出现的频率。 ```php ? - 0 or 1 match * - 0 or more + - 1 or more {n} - exactly n {n,} - n or more {,n} - n or less (??) {n,m} - range n to m ``` 上面是常见量词的列表。问号`?`表示存在零或前一个元素之一。 `zeroorone.php` ```php <?php $words = [ "color", "colour", "comic", "colourful", "colored", "cosmos", "coloseum", "coloured", "colourful" ]; $pattern = "/colou?r/"; foreach ($words as $word) { if (preg_match($pattern, $word)) { echo "$word matches the pattern\n"; } else { echo "$word does not match the pattern\n"; } } ``` `$words`数组中有四个 9。 ```php $pattern = "/colou?r/"; ``` 颜色用于美式英语，颜色用于英式英语。此模式匹配两种情况。 ```php $ php zeroorone.php color matches the pattern colour matches the pattern comic does not match the pattern colourful matches the pattern colored matches the pattern cosmos does not match the pattern coloseum does not match the pattern coloured matches the pattern colourful matches the pattern ``` 这是`zeroorone.php`脚本的输出。 `*`元字符与前面的元素匹配零次或更多次。 `zeroormore.php` ```php <?php $words = [ "Seven", "even", "Maven", "Amen", "Leven" ]; $pattern = "/.*even/"; foreach ($words as $word) { if (preg_match($pattern, $word)) { echo "$word matches the pattern\n"; } else { echo "$word does not match the pattern\n"; } } ``` 在上面的脚本中，我们添加了`*`元字符。 `.*`组合表示零个，一个或多个单个字符。 ```php $ php zeroormore.php Seven matches the pattern even matches the pattern Maven does not match the pattern Amen does not match the pattern Leven matches the pattern ``` 现在该模式匹配三个单词：`Seven`，`even`和`Leven`。 ```php php> print_r(preg_grep("#o{2}#", ["gool", "root", "foot", "dog"])); Array ( [0] => gool [1] => root [2] => foot ) ``` `o{2}`模式匹配恰好包含两个'o'字符的字符串。 ```php php> print_r(preg_grep("#^\d{2,4}$#", ["1", "12", "123", "1234", "12345"])); Array ( [1] => 12 [2] => 123 [3] => 1234 ) ``` 我们有这个`^\d{2,4}$`模式。 `\d`是一个字符集；它代表数字。该模式匹配具有 2、3 或 4 位数字的数字。 ## 交替下一个示例说明了交替运算符`|`。该运算符使您可以创建具有多种选择的正则表达式。 `alternation.php` ```php <?php $names = [ "Jane", "Thomas", "Robert", "Lucy", "Beky", "John", "Peter", "Andy" ]; $pattern = "/Jane|Beky|Robert/"; foreach ($names as $name) { if (preg_match($pattern, $name)) { echo "$name is my friend\n"; } else { echo "$name is not my friend\n"; } } ``` `$names`数组中有八个名称。 ```php $pattern = "/Jane|Beky|Robert/"; ``` 这是搜索模式。该模式查找`"Jane"`，`"Beky"`或`"Robert"`字符串。 ```php $ php alternation.php Jane is my friend Thomas is not my friend Robert is my friend Lucy is not my friend Beky is my friend John is not my friend Peter is not my friend Andy is not my friend ``` 这是脚本的输出。 ## 子模式我们可以使用方括号`()`在样式内创建子模式。 ```php php> echo preg_match("/book(worm)?$/", "bookworm"); 1 php> echo preg_match("/book(worm)?$/", "book"); 1 php> echo preg_match("/book(worm)?$/", "worm"); 0 ``` 我们有以下正则表达式模式：`book(worm)?$`。 `(worm)`是子模式。？字符跟随子模式，这意味着该子模式在最终模式中可能出现 0、1 次。这里的`$`字符用于字符串的精确末尾匹配。没有它，诸如`bookstore`，`bookmania`之类的词也将匹配。 ```php php> echo preg_match("/book(shelf|worm)?$/", "book"); 1 php> echo preg_match("/book(shelf|worm)?$/", "bookshelf"); 1 php> echo preg_match("/book(shelf|worm)?$/", "bookworm"); 1 php> echo preg_match("/book(shelf|worm)?$/", "bookstore"); 0 ``` 子模式经常交替使用。 `(shelf|worm)`子模式可创建多个单词组合。 ## 字符类我们可以使用方括号将字符组合成字符类。字符类与括号中指定的任何字符匹配。 `characterclass.php` ```php <?php $words = [ "sit", "MIT", "fit", "fat", "lot" ]; $pattern = "/[fs]it/"; foreach ($words as $word) { if (preg_match($pattern, $word)) { echo "$word matches the pattern\n"; } else { echo "$word does not match the pattern\n"; } } ``` 我们用两个字符定义一个字符集。 ```php $pattern = "/[fs]it/"; ``` 这是我们的模式。 `[fs]`是字符类。请注意，我们一次只能处理一个字符。我们要么考虑`f`要么`s`，但不能同时考虑两者。 ```php $ php characterclass.php sit matches the pattern MIT does not match the pattern fit matches the pattern fat does not match the pattern lot does not match the pattern ``` 这是脚本的结果。我们还可以将速记元字符用于字符类。 `\w`代表字母数字字符，`\d`代表数字，`\s`空格字符。 `shorthand.php` ```php <?php $words = [ "Prague", "111978", "terry2", "mitt##" ]; $pattern = "/\w{6}/"; foreach ($words as $word) { if (preg_match($pattern, $word)) { echo "$word matches the pattern\n"; } else { echo "$word does not match the pattern\n"; } } ``` 在上面的脚本中，我们测试包含字母数字字符的单词。 `\w{6}`代表六个字母数字字符。仅单词`mitt##`不匹配，因为它包含非字母数字字符。 ```php php> echo preg_match("#[^a-z]{3}#", "ABC"); 1 ``` `#[^a-z]{3}#`模式代表三个不在`a-z`类中的字符。 `"ABC"`字符符合条件。 ```php php> print_r(preg_grep("#\d{2,4}#", [ "32", "234", "2345", "3d3", "2"])); Array ( [0] => 32 [1] => 234 [2] => 2345 ) ``` 在上面的示例中，我们有一个匹配 2、3 和 4 位数字的模式。 ## 提取匹配 `preg_match()`具有可选的第三个参数。如果提供，则将其填充为搜索结果。变量是一个数组，其第一个元素包含与完整模式匹配的文本，第二个元素包含第一个捕获的带括号的子模式，依此类推。 `extract_matches.php` ```php <?php $times = [ "10:10:22", "23:23:11", "09:06:56" ]; $pattern = "/(\d\d):(\d\d):(\d\d)/"; foreach ($times as $time) { $r = preg_match($pattern, $time, $match); if ($r) { echo "The $match[0] is split into:\n"; echo "Hour: $match[1]\n"; echo "Minute: $match[2]\n"; echo "Second: $match[3]\n"; } } ``` 在示例中，我们提取时间字符串的一部分。 ```php $times = [ "10:10:22", "23:23:11", "09:06:56" ]; ``` 我们在英语语言环境中有三个时间字符串。 ```php $pattern = "/(\d\d):(\d\d):(\d\d)/"; ``` 模式使用方括号分为三个子模式。我们想确切地指每个部分。 ```php $r = preg_match($pattern, $time, $match); ``` 我们将第三个参数传递给`preg_match()`函数。如果匹配，则包含匹配字符串的文本部分。 ```php if ($r) { echo "The $match[0] is split into:\n"; echo "Hour: $match[1]\n"; echo "Minute: $match[2]\n"; echo "Second: $match[3]\n"; } ``` `$match[0]`包含与完整模式匹配的文本，`$match[1]`包含与第一个子模式匹配的文本，`$match[2]`与第二个子模式匹配，`$match[3]`与第三个子模式匹配。 ```php $ php extract_matches.php The 10:10:22 is split into: Hour: 10 Minute: 10 Second: 22 The 23:23:11 is split into: Hour: 23 Minute: 23 Second: 11 The 09:06:56 is split into: Hour: 09 Minute: 06 Second: 56 ``` 这是示例的输出。 ## 电子邮件示例接下来有一个实际的例子。我们创建一个正则表达式模式来检查电子邮件地址。 `emails.php` ```php <?php $emails = [ "luke@gmail.com", "andy@yahoocom", "34234sdfa#2345", "f344@gmail.com"]; # regular expression for emails $pattern = "/^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$/"; foreach ($emails as $email) { if (preg_match($pattern, $email)) { echo "$email matches \n"; } else { echo "$email does not match\n"; } } >? ``` 请注意，此示例仅提供一种解决方案。它不一定是最好的。 ```php $pattern = "/^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$/"; ``` 这就是模式。这里的第一个`^`和最后一个`$`字符可得到精确的模式匹配。模式前后不允许有字符。电子邮件分为五个部分。第一部分是本地部分。这通常是公司，个人或昵称的名称。 `[a-zA-Z0-9._-]+`列出了所有可能的字符，我们可以在本地使用。它们可以使用一次或多次。第二部分是文字`@`字符。第三部分是领域部分。通常是电子邮件提供商的域名，例如 yahoo 或 gmail。 `[a-zA-Z0-9-]+`是一个字符集，提供了可在域名中使用的所有字符。 `+`量词使用这些字符中的一个或多个。第四部分是点字符。它前面有转义字符（`\`）。这是因为点字符是一个元字符，并且具有特殊含义。通过转义，我们得到一个文字点。最后一部分是顶级域。模式如下：`[a-zA-Z.]{2,18}`顶级域可以包含 2 到 18 个字符，例如`sk, net, info, travel, cleaning, travelinsurance`。最大长度可以为 63 个字符，但是今天大多数域都少于 18 个字符。还有一个点字符。这是因为某些顶级域包含两个部分：例如`co.uk`。 ```php $ php emails.php luke@gmail.com matches andy@yahoocom does not match 34234sdfa#2345 does not match f344@gmail.com matches ``` 这是`emails.php`示例的输出。 ## 总结最后，我们快速回顾一下正则表达式模式。 ```php Jane the 'Jane' string ^Jane 'Jane' at the start of a string Jane$ 'Jane' at the end of a string ^Jane$ exact match of the string 'Jane' [abc] a, b, or c [a-z] any lowercase letter [^A-Z] any character that is not a uppercase letter (Jane|Becky) Matches either 'Jane' or 'Becky' [a-z]+ one or more lowercase letters ^[98]?$ digits 9, 8 or empty string ([wx])([yz]) wy, wz, xy, or xz [0-9] any digit [^A-Za-z0-9] any symbol (not a number or a letter) ``` 在本章中，我们介绍了 PHP 中的正则表达式。