Function details
default function
Basic match()
, extract()
, segment()
function.
- plane.func.build_new_regex(name, regex, flag=0, repl=' ')[source]
-
build regex pattern, space
' '
in name will be replaced by'_'
- plane.func.extract(text, regex)[source]
- Parameters
text (str) – text
pattern (Regex) –
plane.pattern.Regex
Extract tokens with regex pattern.
- plane.func.replace(text, pattern, repl=None)[source]
- Parameters
text (str) – text
pattern (Regex) –
plane.pattern.Regex
repl (str) – replacement for pattern, if setted, default repl will be overwritten
Replace matched tokens with repl.
- plane.func.segment(text, regex=Regex(name='ASCII_word', pattern="[<$#&]?[a-zA-Z0-9_.-]*\\'?[a-zA-Z0-9]+[%>]?", flag=0, repl=' '))[source]
- Parameters
text (str) – text
pattern (Regex) –
plane.pattern.Regex
Segment sentence. Chinese words will be split into char and English words will be keeped.
Chain function call
Plane class, support chain function calls.
- class plane.plane.Plane[source]
Init
Plane.text
andPlane.values
when the instance is created.- normalize_punctuation(punc=<plane.punctuation.Punctuation object>)[source]
normalize punctuations to English punctuations
- remove_punctuation(repl=' ', punc=<plane.punctuation.Punctuation object>)[source]
- Parameters
repl (str) – replacement for regex, if setted, default value will be overwritten
remove all punctuations
- replace(regex, repl=None, result=False)[source]
- Parameters
Replace matched
regex
patterns withrepl
.
- segment(regex=Regex(name='ASCII_word', pattern="[<$#&]?[a-zA-Z0-9_.-]*\\'?[a-zA-Z0-9]+[%>]?", flag=0, repl=' '))[source]
- Parameters
regex (Regex) – default regex is ASCII_WORD, this will keep all english words complete
Segment sentence. Chinese words will be split into char and English words will be keeped.
Pipeline
punctuation
- class plane.punctuation.Punctuation(normalization=None)[source]
All the punctuations in Unicode.
Abbr. Description
Pc - Punctuation, Connector Pd - Punctuation, Dash Ps - Punctuation, Open Pe - Punctuation, Close Pi - Punctuation, Initial quote (may behave like Ps or Pe) Pf - Punctuation, Final quote (may behave like Ps or Pe) Po - Punctuation, Other
Some chars are not included in punctuations. Such as: +, ^, $, ~.
You can use
Plane.pattern
to process these chars.- Parameters
normalization (dict) – punctuation normalization map
- normalize(text)[source]
- Parameters
text (str) – input text
Convert punctuations from other languages to English punctuations. Not every punctuation is included.
- remove(text, repl=' ')[source]
- Parameters
text (str) – input text
Remove all punctuations.
This methods use
unicodedata
(https://docs.python.org/3.6/library/unicodedata.html) to get all the punctuations.