Home  >  Article  >  Backend Development  >  How to Remove Non-Printable Characters from Python Strings?

How to Remove Non-Printable Characters from Python Strings?

Susan Sarandon
Susan SarandonOriginal
2024-10-22 06:58:30300browse

How to Remove Non-Printable Characters from Python Strings?

Removing Non-Printable Characters from Strings in Python

Question:

In Perl, non-printable characters can be removed using the regex expression s/[^[:print:]]//g. However, in Python, the [:print:] class is not supported. How can we achieve similar functionality in Python that handles both ASCII and Unicode characters?

Answer:

Due to Python's limitations in detecting printability, we can construct our own character class using the unicodedata module.

<code class="python">import unicodedata, re, itertools, sys

# Generate a list of all characters
all_chars = (chr(i) for i in range(sys.maxunicode))

# Category of control characters
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)

# Escape the control characters for regular expression matching
control_char_re = re.compile('[%s]' % re.escape(control_chars))

# Function to remove control characters from a string
def remove_control_chars(s):
    return control_char_re.sub('', s)</code>

For Python 2:

<code class="python">import unicodedata, re, sys

# Generate a list of all characters
all_chars = (unichr(i) for i in xrange(sys.maxunicode))

# Category of control characters
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)

# Escape the control characters for regular expression matching
control_char_re = re.compile('[%s]' % re.escape(control_chars))

# Function to remove control characters from a string
def remove_control_chars(s):
    return control_char_re.sub('', s)</code>

Extended Option:

For more comprehensive removal, additional categories can be included, though it may impact performance.

Character Categories and Counts:

  • Cc (control): 65
  • Cf (format): 161
  • Cs (surrogate): 2048
  • Co (private-use): 137468
  • Cn (unassigned): 836601

The above is the detailed content of How to Remove Non-Printable Characters from Python Strings?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn