正则表达式 - LemuSakuya

レム・咲く夜

LemuSakuya的个人博客网站堂堂登场

这一天终于来到了！！经过长期的策划以及愚蠢的大创项目策划书的拖延，我终于部署好了我的个人博客，真的很感谢Mizuki，用这么好的模版让我舒爽一整天！！！

標籤

レム・咲く夜

LemuSakuya的个人博客网站堂堂登场

这一天终于来到了！！经过长期的策划以及愚蠢的大创项目策划书的拖延，我终于部署好了我的个人博客，真的很感谢Mizuki，用这么好的模版让我舒爽一整天！！！

標籤

レム・咲く夜

LemuSakuya的个人博客网站堂堂登场

这一天终于来到了！！经过长期的策划以及愚蠢的大创项目策划书的拖延，我终于部署好了我的个人博客，真的很感谢Mizuki，用这么好的模版让我舒爽一整天！！！

標籤

🌸 咲夜の秘密メロディ

分類

站點統計

文章

79

分類

16

標籤

24

總字數

394,959

運行天數

0 天

最後活動

0 天前

792 字

2 分鐘

正则表达式

2026-06-03

Web Crawler

Study Notes

/

Web Crawler

正则表达式#

正则表达式（Regular Expression，简称 regex 或 regexp）是一种用于匹配字符串中字符组合的模式。正则表达式在文本处理、数据验证、搜索和替换等场景中非常有用。它提供了一种强大的、灵活的语法来定义复杂的搜索模式。

正则表达式的基本组成部分#

字符：
- 匹配特定字符。例如，a 匹配字符 a。
字符类：
- 使用方括号 [] 来定义一组字符，其中的任何一个字符都可以匹配。例如，[abc] 匹配字符 a、b 或 c。
- 特殊字符类：
  - .：匹配任意单个字符（除了换行符）。
  - \d：匹配任意数字（相当于 [0-9]）。
  - \D：匹配任意非数字（相当于 [^0-9]）。
  - \w：匹配任意字母、数字或下划线（相当于 [a-zA-Z0-9_]）。
  - \W：匹配任意非字母、数字或下划线（相当于 [^a-zA-Z0-9_]）。
  - \s：匹配任意空白字符（包括空格、制表符、换行符等）。
  - \S：匹配任意非空白字符。
量词：
- 定义前面的字符或字符类出现的次数。
- *：匹配前面的字符或字符类零次或多次。
- +：匹配前面的字符或字符类一次或多次。
- ?：匹配前面的字符或字符类零次或一次。
- {n}：匹配前面的字符或字符类恰好 n 次。
- {n,}：匹配前面的字符或字符类至少 n 次。
- {n,m}：匹配前面的字符或字符类至少 n 次，但不超过 m 次。
锚点：
- 用于指定匹配的位置。
- ^：匹配字符串的开始。
- $：匹配字符串的结束。
- \b：匹配单词边界。
- \B：匹配非单词边界。
分组：
- 使用括号 () 来分组，可以对组进行量词操作。
- 示例：(abc)+ 匹配 abc 一次或多次。
字符集：
- 使用 [] 来定义字符集，可以包含单个字符、字符范围或特殊字符类。
- 示例：[a-z] 匹配任意小写字母。
特殊字符：
- 在正则表达式中，某些字符具有特殊含义，如 .、*、+、?、^、$ 等。如果要匹配这些特殊字符本身，需要使用反斜杠 \ 进行转义。
- 示例：\. 匹配字符 .。

示例#

匹配特定字符串：

1
import re
2

3
text = "Hello, world!"
4
pattern = "world"
5
match = re.search(pattern, text)
6
if match:
7
    print("Match found:", match.group())  # 输出: Match found: world
8
else:
9
    print("No match found")

使用字符类：

1
import re
2

3
text = "Hello, 123 world!"
4
pattern = r"\d+"  # 匹配一个或多个数字
5
match = re.search(pattern, text)
6
if match:
7
    print("Match found:", match.group())  # 输出: Match found: 123
8
else:
9
    print("No match found")

使用量词：

1
import re
2

3
text = "Hello, world!"
4
pattern = r"l+"  # 匹配一个或多个 'l'
5
match = re.search(pattern, text)
6
if match:
7
    print("Match found:", match.group())  # 输出: Match found: ll
8
else:
9
    print("No match found")

使用锚点：

1
import re
2

3
text = "Hello, world!"
4
pattern = r"^Hello"  # 匹配字符串开头的 "Hello"
5
match = re.search(pattern, text)
6
if match:
7
    print("Match found:", match.group())  # 输出: Match found: Hello
8
else:
9
    print("No match found")

使用分组：

1
import re
2

3
text = "Hello, world!"
4
pattern = r"(Hello), (\w+)"  # 匹配 "Hello" 和一个单词
5
match = re.search(pattern, text)
6
if match:
7
    print("Match found:", match.group(0))  # 输出: Match found: Hello, world
8
    print("Group 1:", match.group(1))  # 输出: Group 1: Hello
9
    print("Group 2:", match.group(2))  # 输出: Group 2: world
10
else:
11
    print("No match found")

使用替换：

1
import re
2

3
text = "Hello, world!"
4
pattern = r"world"
5
replacement = "Python"
6
new_text = re.sub(pattern, replacement, text)
7
print(new_text)  # 输出: Hello, Python!

正则表达式模块#

在 Python 中，正则表达式主要通过 re 模块来使用。re 模块提供了多种函数来处理正则表达式，例如：

re.search()：在字符串中查找匹配的子串。
re.match()：从字符串的开头开始匹配。
re.findall()：查找所有匹配的子串，返回列表。
re.sub()：替换所有匹配的子串。
re.split()：根据正则表达式分割字符串。

常用正则表达式示例#

匹配电子邮件地址：

1
import re
2

3
text = "Contact us at info@example.com or support@example.org"
4
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
5
matches = re.findall(pattern, text)
6
print(matches)  # 输出: ['info@example.com', 'support@example.org']

匹配电话号码：

1
import re
2

3
text = "Call me at 123-456-7890 or 987.654.3210"
4
pattern = r"\b\d{3}[-.]\d{3}[-.]\d{4}\b"
5
matches = re.findall(pattern, text)
6
print(matches)  # 输出: ['123-456-7890', '987.654.3210']