regex

rust crate - 这篇文章属于一个选集。

§ 14: 本文

§ 15: once_cell

§ 16: http/http_body crate

§ 17: serde_json

§ 18: diesel

§ 19: sqlx

regex crate 主要类型：Regex、Match、Capture。

Regex 是编译后的正则表达式对象，提供如下方法：

Regex::new(): 使用 缺省配置 对正则表达式字符串进行编译，返回一个 Regex 类型对象。

使用 RegexBuilder 可以创建一个自定义配置的 Regex 对象；

Regex::is_match() : 判断正则表达式是否匹配传入的字符串；
Regex::find(): 返回匹配的 byte offset，而 Regex::find_iter() 返回一个匹配所有内容的迭代器；
Regex::captures() : 返回一个 Captures，同时返回一个匹配整个正则表达式的 byte offsets，以及各 capture group 的 byte offsets。Regex::captures_iter() 返回这些匹配的迭代器。

Regex::is_match():
#

判断传入的字符串是否匹配正则表达式，默认是任意位置匹配：

use regex::Regex;

let re = Regex::new(r"^\d{4}-\d{2}-\d{2}$").unwrap();
assert!(re.is_match("2010-03-14"));

Regex::find()
#

返回 第一个 Match 对象 。Match 对象可以提供匹配的：开始 m.start()、结束 m.end() 的 byte offset，以及匹配的字符串本身 m.as_str()：

let re = Regex::new(r"\p{Greek}+").unwrap();
let hay = "Greek: αβγδ";
let m = re.find(hay).unwrap();
assert_eq!(7, m.start());
assert_eq!(15, m.end());
assert!(!m.is_empty());
assert_eq!(8, m.len());
assert_eq!(7..15, m.range());
assert_eq!("αβγδ", m.as_str());

Regex::find_iter()
#

返回一个迭代器，每次迭代返回 Match 类型对象，分别匹配不重叠的子串：

use regex::Regex;

let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap();
let hay = "What do 1865-04-14, 1881-07-02, 1901-09-06 and 1963-11-22 have in common?";
let dates: Vec<&str> = re.find_iter(hay).map(|m| m.as_str()).collect();
assert_eq!(dates, vec![
    "1865-04-14",
    "1881-07-02",
    "1901-09-06",
    "1963-11-22",
]);

Regex::captures()
#

返回一个 Captures 类型对象，它是 Match 对象的封装，包含第一个匹配的匹配组信息。

index["name"] 和 c.name("name") 的区别: 返回的 Match 对象的 as_str() 的 lifetime 不一致，前者和 Captures 的 lifetime 一致，而后者和传入的字符串一致（建议使用 c.name(“name))；

use regex::Regex;

// index 操作支持：获得各匹配组，index 的 key 可以是匹配组的编号（0 表示整个正则表达式的匹配）或匹配组的名称。
let re = Regex::new(r"(?<first>\w)(\w)(?:\w)\w(?<last>\w)").unwrap();
let caps = re.captures("toady").unwrap();
assert_eq!("toady", &caps[0]);
assert_eq!("t", &caps["first"]);
assert_eq!("o", &caps[2]);
assert_eq!("y", &caps["last"]);

// Captures 的 extract() 方法： 返回本次匹配的完整子字符串和从 1 开始的所有匹配组数组；
let re = Regex::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})").unwrap();
let hay = "On 2010-03-14, I became a Tenneessee lamb.";
let Some((full, [year, month, day])) = re.captures(hay).map(|caps| caps.extract()) else { return };
assert_eq!("2010-03-14", full);
assert_eq!("2010", year);
assert_eq!("03", month);
assert_eq!("14", day);

// get()：指定匹配组索引的 Match 对象；
let re = Regex::new(r"[a-z]+(?:([0-9]+)|([A-Z]+))").unwrap();
let caps = re.captures("abc123").unwrap();
let substr1 = caps.get(1).map_or("", |m| m.as_str());
let substr2 = caps.get(2).map_or("", |m| m.as_str());
assert_eq!(substr1, "123");
assert_eq!(substr2, "");

// name()：指定匹配组名称的 Match 对象
let re = Regex::new( r"[a-z]+(?:(?<numbers>[0-9]+)|(?<letters>[A-Z]+))", ).unwrap();
let caps = re.captures("abc123").unwrap();
let numbers = caps.name("numbers").map_or("", |m| m.as_str());
let letters = caps.name("letters").map_or("", |m| m.as_str());
assert_eq!(numbers, "123");
assert_eq!(letters, "");

// expand() : 使用匹配组来替换传入的字符串模板，结果字符串存入指定的 String
let re = Regex::new( r"(?<day>[0-9]{2})-(?<month>[0-9]{2})-(?<year>[0-9]{4})",).unwrap();
let hay = "On 14-03-2010, I became a Tenneessee lamb.";
let caps = re.captures(hay).unwrap();
let mut dst = String::new();
caps.expand("year=$year, month=$month, day=$day", &mut dst);
assert_eq!(dst, "year=2010, month=03, day=14");

Regex::captures_iter()
#

返回一个迭代器，每次迭代返回 Cpatures 类型对象。

use regex::Regex;

let re = Regex::new(r"'([^']+)'\s+\(([0-9]{4})\)").unwrap();
let hay = "'Citizen Kane' (1941), 'The Wizard of Oz' (1939), 'M' (1931).";
let mut movies = vec![];

for (_, [title, year]) in re.captures_iter(hay).map(|c| c.extract()) {
    movies.push((title, year.parse::<i64>()?));
}
assert_eq!(movies, vec![
    ("Citizen Kane", 1941),
    ("The Wizard of Oz", 1939),
    ("M", 1931),
]);

// 命名组
let re = Regex::new(r"'(?<title>[^']+)'\s+\((?<year>[0-9]{4})\)").unwrap();
let hay = "'Citizen Kane' (1941), 'The Wizard of Oz' (1939), 'M' (1931).";
let mut it = re.captures_iter(hay);

let caps = it.next().unwrap();
assert_eq!(&caps["title"], "Citizen Kane");
assert_eq!(&caps["year"], "1941");

let caps = it.next().unwrap();
assert_eq!(&caps["title"], "The Wizard of Oz");
assert_eq!(&caps["year"], "1939");

let caps = it.next().unwrap();
assert_eq!(&caps["title"], "M");
assert_eq!(&caps["year"], "1931");

Regex::split()
#

使用正则表达式对字符串进行拆分，返回拆分后的 Vec 数组：

let re = Regex::new(r"[ \t]+").unwrap();
let hay = "a b \t  c\td    e";
let fields: Vec<&str> = re.split(hay).collect();
assert_eq!(fields, vec!["a", "b", "c", "d", "e"]);

// 连续的匹配会返回一个空字符串
let re = Regex::new(r"X").unwrap();
let hay = "lionXXtigerXleopard";
let got: Vec<&str> = re.split(hay).collect();
assert_eq!(got, vec!["lion", "", "tiger", "leopard"]);

let re = Regex::new(r"0").unwrap();
let hay = "010";
let got: Vec<&str> = re.split(hay).collect();
assert_eq!(got, vec!["", "1", ""]);

Regex::replace()/replace_all()
#

用匹配的模式（使用 $N 或 $name 来应用，而非 \N）渲染传入的字符串模板；

pub fn replace<'h, R: Replacer>(
    &self,
    haystack: &'h str,
    rep: R
) -> Cow<'h, str>

// 以下类型实现了 Replacer trait
impl Replacer for String
impl<'a> Replacer for &'a Cow<'a, str>
impl<'a> Replacer for &'a str
impl<'a> Replacer for Cow<'a, str>

use regex::Regex;
let re = Regex::new(r"[^01]+").unwrap();
assert_eq!(re.replace("1078910", ""), "1010");


let re = Regex::new(r"([^,\s]+),\s+(\S+)").unwrap();
let result = re.replace("Springsteen, Bruce", |caps: &Captures| { format!("{} {}", &caps[2], &caps[1]) });
assert_eq!(result, "Bruce Springsteen");

let re = Regex::new(r"(?<last>[^,\s]+),\s+(?<first>\S+)").unwrap();
let result = re.replace("Springsteen, Bruce", "$first $last");
// let result = re.replace("deep fried", "${first}_$second");
assert_eq!(result, "Bruce Springsteen");

let re = Regex::new(r"(?<last>[^,\s]+),\s+(\S+)").unwrap();
let result = re.replace("Springsteen, Bruce", NoExpand("$2 $last"));
assert_eq!(result, "$2 $last");

// replace_all()
use regex::Regex;

let re = Regex::new(r"(?<y>\d{4})-(?<m>\d{2})-(?<d>\d{2})").unwrap();
let before = "1973-01-05, 1975-08-25 and 1980-10-18";
let after = re.replace_all(before, "$m/$d/$y");
assert_eq!(after, "01/05/1973, 08/25/1975 and 10/18/1980");

verbose 模式
#

可以在正则表达式中添加注释等内容，编译时忽略：

use regex::Regex;

let re = Regex::new(r"(?x)
  (?P<y>\d{4}) # the year, including all Unicode digits
  -
  (?P<m>\d{2}) # the month, including all Unicode digits
  -
  (?P<d>\d{2}) # the day, including all Unicode digits
").unwrap();

let before = "1973-01-05, 1975-08-25 and 1980-10-18";
let after = re.replace_all(before, "$m/$d/$y");
assert_eq!(after, "01/05/1973, 08/25/1975 and 10/18/1980");

使用 once_cell 或 lazy_static 来避免重复编译正则表达式
#

use {
    once_cell::sync::Lazy,
    regex::Regex,
};

fn some_helper_function(haystack: &str) -> bool {
    static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"...").unwrap());
    RE.is_match(haystack)
}

fn main() {
    assert!(some_helper_function("abc"));
    assert!(!some_helper_function("ac"));
}

编译生成的 Regexp 对象可以在多线程环境中共享。

regex crate 提供了 bytes module，用于对 &[u8] 而非 &str 字符串进行模式匹配。

正则语法：
#

匹配单字符：

   .             any character except new line (includes new line with s flag)
   [0-9]         any ASCII digit
   \d            digit (\p{Nd})
   \D            not digit
   \pX           Unicode character class identified by a one-letter name
   \p{Greek}     Unicode character class (general category or script)
   \PX           Negated Unicode character class identified by a one-letter name
   \P{Greek}     negated Unicode character class (general category or script)

字符类：

   [xyz]         A character class matching either x, y or z (union).
   [^xyz]        A character class matching any character except x, y and z.
   [a-z]         A character class matching any character in range a-z.
   [[:alpha:]]   ASCII character class ([A-Za-z])
   [[:^alpha:]]  Negated ASCII character class ([^A-Za-z])
   [x[^xyz]]     Nested/grouping character class (matching any character except y and z)

   // 取交集
   [a-y&&xyz]    Intersection (matching x or y)
   [0-9&&[^4]]   Subtraction using intersection and negation (matching 0-9 except 4)
   // 取差集
   [0-9--4]      Direct subtraction (matching 0-9 except 4)
   // 取对称差
   [a-g~~b-h]    Symmetric difference (matching `a` and `h` only)
   [\[\]]        Escaping in character classes (matching [ or ])
   [a&&b]        An empty character class matching nothing

组合：| 的优先级最低

   xy    concatenation (x followed by y)
   x|y   alternation (x or y, prefer x)

重复：

   // 贪婪重复
   x*        zero or more of x (greedy)
   x+        one or more of x (greedy)
   x?        zero or one of x (greedy)

   // 非贪婪重复
   x*?       zero or more of x (ungreedy/lazy)
   x+?       one or more of x (ungreedy/lazy)
   x??       zero or one of x (ungreedy/lazy)

   x{n,m}    at least n x and at most m x (greedy)
   x{n,}     at least n x (greedy)
   x{n}      exactly n x
   x{n,m}?   at least n x and at most m x (ungreedy/lazy)
   x{n,}?    at least n x (ungreedy/lazy)
   x{n}?     exactly n x

匹配空白

   ^               the beginning of a haystack (or start-of-line with multi-line mode)
   $               the end of a haystack (or end-of-line with multi-line mode)

   \A              only the beginning of a haystack (even with multi-line mode enabled)
   \z              only the end of a haystack (even with multi-line mode enabled)

   \b              a Unicode word boundary (\w on one side and \W, \A, or \z on other)
   \B              not a Unicode word boundary

   \b{start}, \<   a Unicode start-of-word boundary (\W|\A on the left, \w on the right)
   \b{end}, \>     a Unicode end-of-word boundary (\w on the left, \W|\z on the right))
   \b{start-half}  half of a Unicode start-of-word boundary (\W|\A on the left)
   \b{end-half}    half of a Unicode end-of-word boundary (\W|\z on the right)

分组

   // 编号分组
   (exp)          numbered capture group (indexed by opening parenthesis)

   // 命名分组
   (?P<name>exp)  named (also numbered) capture group (names must be alpha-numeric)
   (?<name>exp)   named (also numbered) capture group (names must be alpha-numeric)

   // 非分组
   (?:exp)        non-capturing group

   // 设置当前分组的 flags
   (?flags)       set flags within current group

   // 设置当前整个正则表达式 exp 的 flags
   (?flags:exp)   set flags for exp (non-capturing)

可以设置的 flags 如下：

   i     case-insensitive: letters match both upper and lower case
   m     multi-line mode: ^ and $ match begin/end of line
   s     allow . to match \n
   R     enables CRLF mode: when multi-line mode is enabled, \r\n is used
   U     swap the meaning of x* and x*?
   u     Unicode support (enabled by default)
   x     verbose mode, ignores whitespace and allow line comments (starting with `#`)

示例：?xy 设置开启 flags xy；?-xy 设置关闭 flags xy；

     let re = Regex::new(r"(?i)a+(?-i)b+").unwrap();
     let m = re.find("AaAaAbbBBBb").unwrap();
     assert_eq!(m.as_str(), "AaAaAbb");

     let re = Regex::new(r"(?m)^line \d+").unwrap();
     let m = re.find("line one\nline 2\n").unwrap();
     assert_eq!(m.as_str(), "line 2");

转义序列

   \*              literal *, applies to all ASCII except [0-9A-Za-z<>]
   \a              bell (\x07)
   \f              form feed (\x0C)
   \t              horizontal tab
   \n              new line
   \r              carriage return
   \v              vertical tab (\x0B)

   \A              matches at the beginning of a haystack
   \z              matches at the end of a haystack

   \b              word boundary assertion
   \B              negated word boundary assertion

   \b{start}, \<   start-of-word boundary assertion
   \b{end}, \>     end-of-word boundary assertion

   \b{start-half}  half of a start-of-word boundary assertion
   \b{end-half}    half of a end-of-word boundary assertion

   \123            octal character code, up to three digits (when enabled)
   \x7F            hex character code (exactly two digits)

   \x{10FFFF}      any hex character code corresponding to a Unicode code point
   \u007F          hex character code (exactly four digits)
   \u{7F}          any hex character code corresponding to a Unicode code point
   \U0000007F      hex character code (exactly eight digits)
   \U{7F}          any hex character code corresponding to a Unicode code point
   \p{Letter}      Unicode character class
   \P{Letter}      negated Unicode character class
   \d, \s, \w      Perl character class
   \D, \S, \W      negated Perl character class

Perl 字符类

\d     digit (\p{Nd})
\D     not digit
\s     whitespace (\p{White_Space})
\S     not whitespace
\w     word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
\W     not word character

ASCII 字符类

   [[:alnum:]]    alphanumeric ([0-9A-Za-z])
   [[:alpha:]]    alphabetic ([A-Za-z])
   [[:ascii:]]    ASCII ([\x00-\x7F])
   [[:blank:]]    blank ([\t ])
   [[:cntrl:]]    control ([\x00-\x1F\x7F])
   [[:digit:]]    digits ([0-9])
   [[:graph:]]    graphical ([!-~])
   [[:lower:]]    lower case ([a-z])
   [[:print:]]    printable ([ -~])
   [[:punct:]]    punctuation ([!-/:-@\[-`{-~])
   [[:space:]]    whitespace ([\t\n\v\f\r ])
   [[:upper:]]    upper case ([A-Z])
   [[:word:]]     word characters ([0-9A-Za-z_])
   [[:xdigit:]]   hex digit ([0-9A-Fa-f])