Welcome to Plane’s documentation!

Plane is a tool to remove the useless part of the sentence, just like the carpenter shapes the wood.

https://upload.wikimedia.org/wikipedia/commons/e/e3/Kanna2.gif

This package already offers some useful regex patterns, such as HTML tags, URL, Email. You can also write your own regex pattern and concatenate with existing patterns.

Features

build-in regex patterns: plane.pattern.Regex
custom regex patterns
pattern combination
extract, replace patterns
segment sentence
chain function calls: plane.plane.Plane
pipeline: plane.pipeline.Pipeline

Why we need this?

In NLP(Natural language processing) task, cleaning text data may be one of the most boring things. Plane is built for this.

extract content from web page source
detect urls, emails, telephone numbers
split sentence composed of Chinese and English
remove all punctuations to get pure text

Usage

Only support Python3.

extract and replace

from plane import EMAIL, extract, replace
text = 'fake@no.com & fakefake@nothing.com'

emails = extract(text, EMAIL) # this return a generator object
for e in emails:
    print(e)

>>> Token(name='Email', value='fake@no.com', start=0, end=11)
>>> Token(name='Email', value='fakefake@nothing.com', start=14, end=34)

print(EMAIL)

>>> Regex(name='Email', pattern='([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-]+)', repl='<Email>')

replace(text, EMAIL) # replace(text, Regex, repl), if repl is not provided, Regex.repl will be used

>>> '<Email> & <Email>'

replace(text, EMAIL, '')

>>> ' & '

Pattern combination

You can create your own pattern with plane.func.build_new_regex():

from plane import extract, build_new_regex, CHINESE_WORDS
ASCII = build_new_regex('ascii', r'[a-zA-Z0-9]+', ' ')
WORDS = ASCII + CHINESE_WORDS
print(WORDS)

>> Regex(name='ascii_Chinese_words', pattern='[a-zA-Z0-9]+|[\\U00004E00-\\U00009FFF\\U00003400-\\U00004DBF\\U00020000-\\U0002A6DF\\U0002A700-\\U0002B73F\\U0002B740-\\U0002B81F\\U0002B820-\\U0002CEAF\\U0002CEB0-\\U0002EBEF]+', repl=' ')

text = "自然语言处理太难了！who can help me? (╯▔🔺▔)╯"
print(' '.join([t.value for t in list(extract(text, WORDS))]))

>> "自然语言处理太难了 who can help me"

segment

segment can be used to segment sentence, English and Numbers like ‘PS4’ will be keeped and others like Chinese ‘中文’ will be split to single word format [‘中’, ‘文’].

from plane import segment
segment('你看起来guaiguai的。<EOS>')
>>> ['你', '看', '起', '来', 'guaiguai', '的', '。', '<EOS>']

replace all punctuations

punc.remove will replace all unicode punctuations to ‘ ‘ or something you send to this function as paramter repl. punc.normalize will normalize some Unicode punctuations to English punctuations.

Attention: ‘+’, ‘^’, ‘$’, ‘~’ and some chars are not punctuation.

from plane import punc

text = 'Hello world!'
punc.remove(text)

>>> 'Hello world '

# replace punctuation with special string
punc.remove(text, '<P>')

>>> 'Hello world<P>'

# normalize punctuations
punc.normalize('你读过那本《边城》吗？什么编程？！人生苦短，我用 Python。')

>>> '你读过那本(边城)吗?什么编程?!人生苦短,我用 Python.'

chain function calls

Plane contains extract, replace, segment and punc.remove, punc.normalize, and these methods can be called in chain. Since segment returns list, it can only be called in the end of the chain.

Plane.text saves the result of processed text and Plane.values saves the result of extracted strings.

from plane import Plane
from plane.pattern import EMAIL

p = Plane()
p.update('My email is my@email.com.').replace(EMAIL, '').text # update() will init Plane.text and Plane.values

>>> 'My email is .'

p.update('My email is my@email.com.').replace(EMAIL).segment()

>>> ['My', 'email', 'is', '<Email>', '.']

p.update('My email is my@email.com.').extract(EMAIL).values

>>> [Token(name='Email', value='my@email.com', start=12, end=24)]

pipeline

You can use Pipeline if you like.

from plane import Pipeline, replace, segment
from plane.pattern import URL

pipe = Pipeline()
pipe.add(replace, URL, '')
pipe.add(segment)
pipe('http://www.guokr.com is online.')

>>> ['is', 'online', '.']

Contents: