Unicode and Encodings¶
textdata
doesn’t have any unique friction with Unicode characters and
encodings. That said, any time you use Unicode characters in Python 2 source
files, care is warranted.
Best advice is: It’s time to upgrade already! Python 3 is lovely and ever-improving. Python 2 is now showing its age.
If you do need to continue supporting Python 2, either make sure your literal
strings are marked with a “u” prefix: u"
★"
. To turn Unicode
literal processing on by default.
You can explicitly mark strings as unicode in Python 3.3 and following, though it’s only necessary if you’re maintaing backwards portability, since Python 3 strings are by default Unicode strings.
It can also be helpful (amd in Python 2, often strictly necessary) to declare your source encoding by putting a specially-formatted PEP 263 comment as the first or second line of the source code:
# -*- coding: utf-8 -*-
This will usually endorse UTF-8, but other encodings are possible. Python 3 defaults to a UTF-8 encoding, but Python 2 sadly assumes ASCII.
Finally, if you are reading from or writing to a file on Python 2,
strongly recommend you use an alternate form of open
that
supports automatic encoding (which is built-in to Python 3). E.g.:
from codecs import open
with open('filepath', encoding='utf-8') as f:
data = f.read()
This construction works across Python 2 and 3. Just add a mode='w'
for
writing.ß