Unicode and Character Encoding in Python
This tutorial will teach us about character encoding and number systems. We will explore how encoding is used in Python with string and bytes and numbering systems through its various forms of int literals. Let's have an introduction to character encoding in Python.
What is Character Encoding?
There are many character encodings and one of the common and simple character encoding is ASCII. You may be heard about the ASCII table. ASCII is good way to start learning about the character encoding because it is a small and contained encoding. It includes the following -
Character encoding converts letters, punctuation, symbols, whitespace, and control characters to integers and ultimately to binary digits (bits). Each character is mapped to a specific sequence of bits, allowing for the representation of the character in digital form. If you are still getting familiar with the concept of bits, don't worry, as we will cover them in more detail soon.
The categories listed refer to groups of characters that are divided into different ranges within the ASCII table. Each character in the ASCII table is assigned a code point, which can be viewed as an integer value.
As we can see in the above table, the ASCII table consists of 128 characters.
The string Module
Python string module is a best tool for string constant that fall in ASCII's character set. Let's see the following example.
Most of these constants have self-explanatory identifier names, making their purpose clear without requiring additional documentation.
'Hello from JavaTpoint'
An Introduction to Bit
Now let's briefly introduce the bit; it is the most fundamental unit of information that a computer understands. A bit is a signal that can exist in one of two possible states. There are various symbolic representations, all of which have the same meaning.
These decimal numbers can also be represented using a sequence of binary digits (base 2). Here are the binary equivalents of decimal numbers 0 to 10.
It is worth noting that as the decimal number increases, more significant bits are required to represent the character set, including that number.
In Python, ASCII strings can be represented as sequences of bits using the following convenient method. Each character within the ASCII string is represented as a pseudo-encoded 8-bit sequence, with spaces between each 8-bit sequence that corresponds to a single character.
'01110011 01110100 01110010 01101001 01101110 01100111' '01001010 01000001 01010110 01000001 01010100 01010000 01001111 01001001 01001110 01010100' '00100100 00110010 00110011'
There is a crucial formula that relates to the definition of a bit. For a given number of bits, n, the number of unique values that can be represented using n bits equals to 2 to the power of n.
It means -
A consequence of the above formula allows us to fully determine the number of bits, n, needed to represent a range of distinct possible values. To calculate n, we solve for n in equation 2 to the power of n equals the range of values x, where x is already known.
The ceiling function in the n_bits_required() method is used to handle cases where the range of values is not a clean power of 2. For instance, consider a scenario where a character set of 26 characters needs to be stored. At first glance, this would require log base 2 of 26 (i.e., log(26) / log(2)) bits, which is approximately 4.700 bits. However, since bits cannot be divided into fractional parts, 26 values would require 5 bits, with the final bit slots left unused.
Because ASCII only utilized a small portion of the 8-bit bytes offered by modern computers, a series of informal encoding schemes emerged that attempted to specify additional characters to be used with the remaining 128 code points available in an 8-bit character encoding scheme. Unfortunately, this resulted in a range of conflicting and non-standardized encodings.
What is Unicode?
As we understand the ASCII and there is a problem with it, and it needs to be a compatible set of characters to accommodate the world's languages, dialects, glyphs, and symbols.
Unicode works the same as ASCII but supports a bigger set of code points. While several encodings have emerged between the ASCII and Unicode standards chronologically, their usage has significantly decreased over time. It is mainly because Unicode and its encoding schemes, particularly UTF-8, have become the dominant encoding standards in the computing industry.
Unicode can be considered a big table of ASCII tables with 1, 114, and 112 possible code points. ASCII is a subset of Unicode. However, Unicode goes beyond ASCII by adding additional code points to represent characters from other languages and scripts.
Unicode is not an encoding but rather a standard that defines a consistent way of representing characters and symbols from various scripts and languages.
Unicode maps characters to unique code points or integers, allowing consistency across different platforms and systems. It is a large database table that maps each character to a unique code point.
However, to represent these code points in a computer's memory or storage, they must be encoded in a particular format. It is where different character encodings come into play. Character encoding is a specific way of representing the code points of Unicode in binary form so that they can be stored, transmitted, and displayed.
Unicode encompasses a vast range of characters and symbols from numerous scripts and languages used across the globe. It includes an extensive set of printable and non-printable characters, covering various symbols, emojis, mathematical notations, and more.
Unicode vs UTF-8
All the character cannot be packed into the one byte each. It's evident from this that modern, more comprehensive encodings would need to use multiple bytes to encode some characters.
The Unicode standard defines a mapping of characters to unique code points or integers, and it also defines several different encoding schemes for representing those code points in binary form. These encoding schemes have different properties, such as variable-length characters, which offer different trade-offs in terms of storage space, efficiency, and compatibility with different systems and applications.
UTF-8, along with its less commonly used counterparts UTF-16 and UTF-32, are character encoding formats used for representing Unicode characters as binary data using one or more bytes per character. While we will discuss UTF-16 and UTF-32 shortly, it is worth noting that UTF-8 has become the most popular encoding format by a wide margin.
Encoding and Decoding in Python
The string type in Python represents the human-readable text that can contain the Unicode character.
On the other hand, the byte data type represents the binary data, which doesn't include the encoding. Encoding is a process to convert human-readable text to bytes, and decoding transforms the bytes data into human-readable text.
Python provides the encode() and decode() methods which take 'utf-8' as a parameter by default.
As we can see in the above snippet, the encode() method converts the résumé into b'r\xc3\xa9sum\xc3\xa9' and the representations of bytes permit only ASCII characters.
The Variable-length encoding is the crucial feature of the encoding. UTF-8 is unique because a single Unicode character can be represented using one to four bytes. For example, consider a Unicode character requiring four bytes to be represented in UTF-8.
One important feature of len() to note is the difference between the length of a single Unicode character as a Python str and its length when encoded as bytes using UTF-8.
When a Unicode character is represented as a Python str, its length is always 1, regardless of how many bytes it occupies in memory. However, when the same character is encoded as bytes using UTF-8, its length can range from 1 to 4 bytes, depending on the specific character being encoded.
UTF-16 and UTF-32
The difference between UTF-8 and UTF-16 is substantial. Let's see the following example.
If you encode four Greek letters using UTF-8 and then decode the resulting bytes back to text using UTF-16, the text would likely be in a completely different language, such as Korean.
It highlights the importance of using a consistent encoding scheme bi-directionally, as using different encoding schemes for encoding and decoding can lead to drastically incorrect results. Two different decoding variations of the same bytes object can produce text in completely different languages, which can cause significant issues if not accounted for properly.
Another important aspect is that UTF-8 sometimes takes up less space than UTF-16. Let's understand the following example.
Python provides the built-in method that are relate in some way to numbering system and character encoding.
This tutorial includes the detailed understanding of encoding and decoding. Encoding and decoding are fundamental processes in modern computing and are critical to accurately representing and transmitting data. Unicode is the most commonly used character set for representing text data in a wide variety of languages and scripts. To store and transmit Unicode data, different encoding schemes, such as UTF-8, UTF-16, and UTF-32, are used to convert Unicode code points to binary data. As such, it is crucial for developers to understand the encoding and decoding processes and to use them correctly to ensure the integrity and accuracy of their data.