能抓出「數字」的正規表示式

作者：gugod 發佈於：2024/11/01 ，更新於：2024/11/03 #regexp

在許多公開與非公開的程式碼中，都很常看到以 \d 來判別某個輸入是否為數字。比方說像是這種試圖檢查 URI 上的 id 是否只含有數字的寫法：

unless ($uri->query_param('id') =~ /^\d+$/) {
     croak $InvalidIdException;
}

這種寫法確實可以讓例如 "123"、"0135"、"00000" 等等字串通過檢查，一定程度上算是正確的。但，其實 \d 這個字元集合十分大。一般直觀可能會認爲這個集合只有十個元素，也就是跟 [0123456789] 一樣。

但如果對 Unicode 碼位一百萬以下逐一測試，數出所有符合 \d 的字符總數。就可發現這個集合中共有 680 個元素。也就是說在 Unicode 前一百萬個符號中，有 680 個「數字」：

# perl -C7 -E 'my @digits = grep { /\d/ } map { chr($_) } 0..999999; say scalar @digits'

680

確實在 ASCII 當道的時代，\d 就是「ASCII 中代表數字的字符」的這個集合，也就是 [0123456789] 這個集合。但，在 Unicode 普及之後，不少正規表示式引擎都將 \d 這個集合重新定義爲「所有 Unicode 字符屬性爲 Digit 的字符」。

而 Unicode 既然儘可能地羅列了各種語言用的文字，那各種語言用的數字字符自然也都盡數包含進來了。

包括了如泰文數字：

๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙

或如藏文數字：

༠ ༡ ༢ ༣ ༤ ༥ ༦ ༧ ༨ ༩

不是只有 perl 語言中的 \d 才這麼寬。

raku 跟 perl 似乎一樣:

# raku --version
Welcome to Rakudo™ v2024.09.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2024.09.

# raku -e '(^1000000).map(&chr).grep({ /\d/ }).elems.say' 
680

python 3.12.7 似乎也跟 perl 一樣:

# python3 --version
Python 3.12.7

# echo "import re; print(len([chr(i) for i in range(1000000) if re.match(r'\d', chr(i))]))" | python3 -
680

Swift 中的 \d 也會與匹配到 680 個字符:

# swift -version
Swift version 6.0.1 (swift-6.0.1-RELEASE)
Target: x86_64-unknown-linux-gnu

# swift -e '
import Foundation
let digits = (0...999999)
    .compactMap { UnicodeScalar($0) }
    .map { String($0) }
    .filter { $0.range(of: "\\d", options: .regularExpression) != nil }
print(digits.count)
'

680

PHP 8 。若用 multi-byte string 及 mb_ereg_match ，則 \d 也是 680 個字符：

# php --version
PHP 8.3.13 (cli) (built: Oct 22 2024 18:39:14) (NTS gcc x86_64)
Copyright (c) The PHP Group
Zend Engine v4.3.13, Copyright (c) Zend Technologies

# php -r '
$digits = array_filter(array_map(function($i) {
    return mb_chr($i, "UTF-8");
}, range(0, 999999)), function($char) {
    return mb_ereg_match("\\d", $char);
});

echo count($digits)
'

680

node js 與 kotlin (JVM) 的 \d 看來則是等於 [0123456789]: (見註1)

# node   --version
v20.17.0

# node -e 'console.log( Array.from({length: 1000000}, (_,i) => String.fromCodePoint(i)).filter(c => /\d/.test(c)).length )'
10

# kotlin -version
Kotlin version 2.0.21-release-482 (JRE 17.0.12+7)

## String(int[], int, int) 是自 unicode codepoint 建構出一個 String 的方法。來自 Java String: https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#String-int:A-int-int-
# kotlin -e '(0..999999).map { String(intArrayOf(it), 0, 1) }.filter { Regex("\\d").matches(it) }.let { println(it.size) }'      
10

看來雖然不同語言中 \d 的意義或許有所不同，並且它仍然代表「由所有數字所構成的集合」，但對於些語言來說，這個集合已經擴張到有包含 Unicode 中的數字而不再是只有 0123456789 這十個數字而已了。

最後附上在 perl 中， \d 這個集合的完整列表。

# # 印出出所有符合 \d 的字符。
# perl -C7 -E 'my @digits = grep { /\d/ } map { chr($_) } 0..999999; while (@digits) { say join(" ", splice(@digits,0, 10)) }'
0 1 2 3 4 5 6 7 8 9
٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩
۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹
߀ ߁ ߂ ߃ ߄ ߅ ߆ ߇ ߈ ߉
० १ २ ३ ४ ५ ६ ७ ८ ९
০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯
੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯
૦ ૧ ૨ ૩ ૪ ૫ ૬ ૭ ૮ ૯
୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯
௦ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯
౦ ౧ ౨ ౩ ౪ ౫ ౬ ౭ ౮ ౯
೦ ೧ ೨ ೩ ೪ ೫ ೬ ೭ ೮ ೯
൦ ൧ ൨ ൩ ൪ ൫ ൬ ൭ ൮ ൯
෦ ෧ ෨ ෩ ෪ ෫ ෬ ෭ ෮ ෯
๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙
໐ ໑ ໒ ໓ ໔ ໕ ໖ ໗ ໘ ໙
༠ ༡ ༢ ༣ ༤ ༥ ༦ ༧ ༨ ༩
၀ ၁ ၂ ၃ ၄ ၅ ၆ ၇ ၈ ၉
႐ ႑ ႒ ႓ ႔ ႕ ႖ ႗ ႘ ႙
០ ១ ២ ៣ ៤ ៥ ៦ ៧ ៨ ៩
᠐ ᠑ ᠒ ᠓ ᠔ ᠕ ᠖ ᠗ ᠘ ᠙
᥆ ᥇ ᥈ ᥉ ᥊ ᥋ ᥌ ᥍ ᥎ ᥏
᧐ ᧑ ᧒ ᧓ ᧔ ᧕ ᧖ ᧗ ᧘ ᧙
᪀ ᪁ ᪂ ᪃ ᪄ ᪅ ᪆ ᪇ ᪈ ᪉
᪐ ᪑ ᪒ ᪓ ᪔ ᪕ ᪖ ᪗ ᪘ ᪙
᭐ ᭑ ᭒ ᭓ ᭔ ᭕ ᭖ ᭗ ᭘ ᭙
᮰ ᮱ ᮲ ᮳ ᮴ ᮵ ᮶ ᮷ ᮸ ᮹
᱀ ᱁ ᱂ ᱃ ᱄ ᱅ ᱆ ᱇ ᱈ ᱉
᱐ ᱑ ᱒ ᱓ ᱔ ᱕ ᱖ ᱗ ᱘ ᱙
꘠ ꘡ ꘢ ꘣ ꘤ ꘥ ꘦ ꘧ ꘨ ꘩
꣐ ꣑ ꣒ ꣓ ꣔ ꣕ ꣖ ꣗ ꣘ ꣙
꤀ ꤁ ꤂ ꤃ ꤄ ꤅ ꤆ ꤇ ꤈ ꤉
꧐ ꧑ ꧒ ꧓ ꧔ ꧕ ꧖ ꧗ ꧘ ꧙
꧰ ꧱ ꧲ ꧳ ꧴ ꧵ ꧶ ꧷ ꧸ ꧹
꩐ ꩑ ꩒ ꩓ ꩔ ꩕ ꩖ ꩗ ꩘ ꩙
꯰ ꯱ ꯲ ꯳ ꯴ ꯵ ꯶ ꯷ ꯸ ꯹
０ １ ２ ３ ４ ５ ６ ７ ８ ９
𐒠 𐒡 𐒢 𐒣 𐒤 𐒥 𐒦 𐒧 𐒨 𐒩
𐴰 𐴱 𐴲 𐴳 𐴴 𐴵 𐴶 𐴷 𐴸 𐴹
𑁦 𑁧 𑁨 𑁩 𑁪 𑁫 𑁬 𑁭 𑁮 𑁯
𑃰 𑃱 𑃲 𑃳 𑃴 𑃵 𑃶 𑃷 𑃸 𑃹
𑄶 𑄷 𑄸 𑄹 𑄺 𑄻 𑄼 𑄽 𑄾 𑄿
𑇐 𑇑 𑇒 𑇓 𑇔 𑇕 𑇖 𑇗 𑇘 𑇙
𑋰 𑋱 𑋲 𑋳 𑋴 𑋵 𑋶 𑋷 𑋸 𑋹
𑑐 𑑑 𑑒 𑑓 𑑔 𑑕 𑑖 𑑗 𑑘 𑑙
𑓐 𑓑 𑓒 𑓓 𑓔 𑓕 𑓖 𑓗 𑓘 𑓙
𑙐 𑙑 𑙒 𑙓 𑙔 𑙕 𑙖 𑙗 𑙘 𑙙
𑛀 𑛁 𑛂 𑛃 𑛄 𑛅 𑛆 𑛇 𑛈 𑛉
𑜰 𑜱 𑜲 𑜳 𑜴 𑜵 𑜶 𑜷 𑜸 𑜹
𑣠 𑣡 𑣢 𑣣 𑣤 𑣥 𑣦 𑣧 𑣨 𑣩
𑥐 𑥑 𑥒 𑥓 𑥔 𑥕 𑥖 𑥗 𑥘 𑥙
𑱐 𑱑 𑱒 𑱓 𑱔 𑱕 𑱖 𑱗 𑱘 𑱙
𑵐 𑵑 𑵒 𑵓 𑵔 𑵕 𑵖 𑵗 𑵘 𑵙
𑶠 𑶡 𑶢 𑶣 𑶤 𑶥 𑶦 𑶧 𑶨 𑶩
𑽐 𑽑 𑽒 𑽓 𑽔 𑽕 𑽖 𑽗 𑽘 𑽙
𖩠 𖩡 𖩢 𖩣 𖩤 𖩥 𖩦 𖩧 𖩨 𖩩
𖫀 𖫁 𖫂 𖫃 𖫄 𖫅 𖫆 𖫇 𖫈 𖫉
𖭐 𖭑 𖭒 𖭓 𖭔 𖭕 𖭖 𖭗 𖭘 𖭙
𝟎 𝟏 𝟐 𝟑 𝟒 𝟓 𝟔 𝟕 𝟖 𝟗
𝟘 𝟙 𝟚 𝟛 𝟜 𝟝 𝟞 𝟟 𝟠 𝟡
𝟢 𝟣 𝟤 𝟥 𝟦 𝟧 𝟨 𝟩 𝟪 𝟫
𝟬 𝟭 𝟮 𝟯 𝟰 𝟱 𝟲 𝟳 𝟴 𝟵
𝟶 𝟷 𝟸 𝟹 𝟺 𝟻 𝟼 𝟽 𝟾 𝟿
𞅀 𞅁 𞅂 𞅃 𞅄 𞅅 𞅆 𞅇 𞅈 𞅉
𞋰 𞋱 𞋲 𞋳 𞋴 𞋵 𞋶 𞋷 𞋸 𞋹
𞓰 𞓱 𞓲 𞓳 𞓴 𞓵 𞓶 𞓷 𞓸 𞓹
𞥐 𞥑 𞥒 𞥓 𞥔 𞥕 𞥖 𞥗 𞥘 𞥙
🯰 🯱 🯲 🯳 🯴 🯵 🯶 🯷 🯸 🯹

註1：(11/3 更新)

由 timdream 及 gholk 來訊指出，在 node js 的範例中若使用 String.fromCharCode(i) ，其實只會造出 i 在 65535 (0xFFFF) 以下的字，當 i 超過 65535 後會 overflow 而造出某個在 0..65535 這個範圍中的字。因此舊的範例中這段測得 160 的程式碼其實是由於不正確的：

# node -e 'console.log( Array.from({length: 1000000}, (_,i) => String.fromCharCode(i)).filter(c => /\d/.test(c)).length )'
160

而舊 kotlin 範例中這段使用了 toChar() 寫法，也同樣是只會造出 codepoint 範圍在 0..65535 中的字符。因此此範例也是誤導的。

# kotlin -e '(0..999999).map { it.toChar() }.filter { Regex("\\d").matches(it.toString()) }.let { println(it.size) }'
160