使用unicodedata.category()判断标点符号
Unicode类别P*专用于标点符号:
Pc :标点,连接符(connector )
Pd :标点,短划线(dash)
Ps :标点,开始(start)
Pe :标点,结束(end)
Pi :标点,前引号(initial quote,根据具体使用情况,作用可能像 Ps 或 Pe)
Pf :标点,后引号(final quote,根据具体使用情况,作用可能像 Ps 或 Pe)
Po :标点,其他(other)
python3.8+版本:
>>> import sys
>>> from unicodedata import category
>>> codepoints = range(sys.maxunicode + 1)
>>> punctuation = {c for i in codepoints if category(c := chr(i)).startswith("P")}
>>> "'" in punctuation
True
>>> "’" in punctuation
True
python3版本:
>>> import sys
>>> from unicodedata import category
>>> chrs = (chr(i) for i in range(sys.maxunicode + 1))
>>> punctuation = set(c for c in chrs if category(c).startswith("P"))
>>> "'" in punctuation
True
>>> "’" in punctuation
True
python2版本:
>>> import sys
>>> from unicodedata import category
>>> chrs = (unichr(i) for i in range(sys.maxunicode + 1))
>>> punctuation = set(c for c in chrs if category(c).startswith("P"))
>>> u"'" in punctuation
True
>>> u"’" in punctuation
True
标点符号判断方法代码:
import unicodedata
class DuckType:
def __contains__(self,s):
return unicodedata.category(s).startswith("P")
punct=DuckType()
print("'" in punct,'"' in punct,"a" in punct)
#python2.7中调用方法
#print(u"'" in punct,u'"' in punct,u"a" in punct)
(True, True, False)