Regex patterns

plane.pattern.ASCII_WORD = Regex(name='ASCII_word', pattern="[<$#&]?[a-zA-Z0-9_.-]*\\'?[a-zA-Z0-9]+[%>]?", flag=0, repl=' ')

English words, numbers, like ‘hash’, ‘3.14’, ‘$100’, ‘<EOS>’ , ‘99.9%’

plane.pattern.BraSCII = Regex(name='BraSCII', pattern='[!-/:-~\\U000000C0-\\U000000FF]+', flag=0, repl=' ')

BraSCII https://en.wikipedia.org/wiki/BraSCII

plane.pattern.CHINESE = Regex(name='Chinese', pattern='[\\U00004E00-\\U00009FFF\\U00003400-\\U00004DBF\\U00020000-\\U0002A6DF\\U0002A700-\\U0002B73F\\U0002B740-\\U0002B81F\\U0002B820-\\U0002CEAF\\U0002CEB0-\\U0002EBEF\\U00003000-\\U0000303F\\U0000FE30-\\U0000FE4F\\U0000FF00-\\U0000FFEF]+', flag=0, repl=' ')

All Chinese words includes most punctuations.

plane.pattern.CHINESE_WORDS = Regex(name='Chinese_words', pattern='[\\U00004E00-\\U00009FFF\\U00003400-\\U00004DBF\\U00020000-\\U0002A6DF\\U0002A700-\\U0002B73F\\U0002B740-\\U0002B81F\\U0002B820-\\U0002CEAF\\U0002CEB0-\\U0002EBEF]+', flag=0, repl=' ')

All Chinese words without punctuations.

plane.pattern.CJK = Regex(name='CJK', pattern='[\\U00004E00-\\U00009FFF\\U00003400-\\U00004DBF\\U00020000-\\U0002A6DF\\U0002A700-\\U0002B73F\\U0002B740-\\U0002B81F\\U0002B820-\\U0002CEAF\\U0002CEB0-\\U0002EBEF\\U00002E80-\\U00002EFF\\U00002F00-\\U00002FDF\\U00002FF0-\\U00002FFF\\U00003000-\\U0000303F\\U000031C0-\\U000031EF\\U00003200-\\U000032FF\\U00003300-\\U000033FF\\U0000F900-\\U0000FAFF\\U0000FE30-\\U0000FE4F\\U0000FF00-\\U0000FFEF\\U0001F200-\\U0001F2FF\\U0002F800-\\U0002FA1F]+', flag=0, repl=' ')

All CJK chars.

plane.pattern.EMAIL = Regex(name='Email', pattern='[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-]+', flag=0, repl='<Email>')

[local-part]@[domain].[top-level-domain]

plane.pattern.ENGLISH = Regex(name='English', pattern='[!-/:-~]+', flag=0, repl=' ')

English words, punctuations, numbers are not included

plane.pattern.HTML = Regex(name='HTML', pattern='<script.*?>.*?</script>|<style.*?>.*?</style>|<.*?>', flag=re.DOTALL, repl=' ')

HTML tags includes ‘script’, ‘style’ and others.

plane.pattern.NUMBER = Regex(name='Numbers', pattern='[0-9]+', flag=0, repl=' ')

Numbers

class plane.pattern.Regex(name, pattern, flag=0, repl=' ')[source]
Parameters
  • name (str) – regex name

  • pattern (str) – Python regex

  • repl (str) – replacement

regex pattern

plane.pattern.SPACE = Regex(name='Space', pattern='\\s+', flag=0, repl=' ')

r(\s+)

This can remove all extra spaces.

plane.pattern.TELEPHONE = Regex(name='Telephone', pattern='\\d{3}[ +.-]?\\d{4}[ +.-]?\\d{4}', flag=0, repl='<Telephone>')

Chinese telephone number format of 11 numbers. ( [xxx][xxxx][xxxx] )

There can be space, +, ., - as delimiters. Such as 155-5555-5555

plane.pattern.THAI = Regex(name='Thai', pattern='[\\U00000E01-\\U00000E3A\\U00000E3F-\\U00000E5B]+', flag=0, repl=' ')

Thai: https://en.wikipedia.org/wiki/Thai_(Unicode_block) with punctuations

class plane.pattern.Token(name, value, start, end)[source]
Parameters
  • name (str) – token name

  • value (str) – matched text

  • start (int) – matched text started index

  • end (int) – matched text ended index

matched token

plane.pattern.URL = Regex(name='URL', pattern='https?:\\/\\/[!-~]+', flag=re.IGNORECASE, repl='<URL>')

URLs should begin with http or https.

Only support ASCII chars.

plane.pattern.VIETNAMESE = Regex(name='Vietnamese', pattern='[\\U00000021-\\U00000080\\U000000C0-\\U000000C3\\U000000C8-\\U000000CA\\U000000CC-\\U000000CD\\U000000D2-\\U000000D5\\U000000D9-\\U000000DA\\U000000E0-\\U000000E3\\U000000E8-\\U000000EA\\U000000EC-\\U000000ED\\U000000F2-\\U000000F5\\U000000F9-\\U000000FA\\U00000102-\\U00000103\\U00000110-\\U00000111\\U00000128-\\U00000129\\U00000168-\\U00000169\\U000001A0-\\U000001B0\\U00001EA0-\\U00001EF9\\U000002C6-\\U00000323\\U000000D0\\U000000DD\\U000000FD]+', flag=0, repl=' ')

Vietnamese with punctuations