Function details

default function

Basic match(), extract(), segment() function.

plane.func.build_new_regex(name, regex, flag=0, repl=' ')[source]

Parameters

name (str) – regex pattern name
regex (str) – regex
repl (str) – replacement

build regex pattern, space ' ' in name will be replaced by '_'

plane.func.extract(text, regex)[source]

Parameters

text (str) – text
pattern (Regex) – plane.pattern.Regex

Extract tokens with regex pattern.

plane.func.replace(text, pattern, repl=None)[source]

Parameters

text (str) – text
pattern (Regex) – plane.pattern.Regex
repl (str) – replacement for pattern, if setted, default repl will be overwritten

Replace matched tokens with repl.

plane.func.segment(text, regex=Regex(name='ASCII_word', pattern="[<$#&]?[a-zA-Z0-9_.-]*\\'?[a-zA-Z0-9]+[%>]?", flag=0, repl=' '))[source]

Parameters

text (str) – text
pattern (Regex) – plane.pattern.Regex

Segment sentence. Chinese words will be split into char and English words will be keeped.

Chain function call

Plane class, support chain function calls.

class plane.plane.Plane[source]

Init Plane.text and Plane.values when the instance is created.

extract(regex, result=False)[source]

Parameters

regex (Regex) – Regex
result (bool) – if True, return result directly

Extract tokens, results is saved in Plane.values

normalize_punctuation(punc=<plane.punctuation.Punctuation object>)[source]: normalize punctuations to English punctuations

remove_punctuation(repl=' ', punc=<plane.punctuation.Punctuation object>)[source]

Parameters: repl (str) – replacement for regex, if setted, default value will be overwritten

remove all punctuations

replace(regex, repl=None, result=False)[source]

Parameters

regex (Regex) – Regex
repl (str) – replacement for regex, if setted, default value will be overwritten
result (bool) – if True, return result directly

Replace matched regex patterns with repl.

segment(regex=Regex(name='ASCII_word', pattern="[<$#&]?[a-zA-Z0-9_.-]*\\'?[a-zA-Z0-9]+[%>]?", flag=0, repl=' '))[source]

Parameters: regex (Regex) – default regex is ASCII_WORD, this will keep all english words complete

Segment sentence. Chinese words will be split into char and English words will be keeped.

update(text)[source]

Parameters: text (str) – text string.

Init Plane.text and Plane.values.

Pipeline

class plane.pipeline.Pipeline(*functions)[source]

Initialize pipeline with functions. For example:

pl = Pipeline(
    lambda text: replace(text, EMAIL),
    segment,
)
pl("My email is abc@hello.com")

>> ["My", "email", "is", "<Email>"]

add(func, *args, **kwargs)[source]

Add functions.

pl = Pipeline()
pl.add(replace, EMAIL)
pl.add(segment)
pl("My email is abc@hello.com")

>> ["My", "email", "is", "<Email>"]

punctuation

class plane.punctuation.Punctuation(normalization=None)[source]

All the punctuations in Unicode.

Abbr. Description

Pc - Punctuation, Connector
Pd - Punctuation, Dash
Ps - Punctuation, Open
Pe - Punctuation, Close
Pi - Punctuation, Initial quote (may behave like Ps or Pe)
Pf - Punctuation, Final quote (may behave like Ps or Pe)
Po - Punctuation, Other

Some chars are not included in punctuations. Such as: +, ^, $, ~.

You can use Plane.pattern to process these chars.

Parameters: normalization (dict) – punctuation normalization map

normalize(text)[source]

Parameters: text (str) – input text

Convert punctuations from other languages to English punctuations. Not every punctuation is included.

remove(text, repl=' ')[source]

Parameters: text (str) – input text

Remove all punctuations.

This methods use unicodedata (https://docs.python.org/3.6/library/unicodedata.html) to get all the punctuations.