Javatpoint Logo
Javatpoint Logo

Program to convert ASCII to Unicode in C++

Two popular character encoding systems used in programming are ASCII and Unicode. Whereas Unicode can represent over 100,000 characters utilizing code points ranging from 0 to 0x10FFFF, ASCII can only represent 128 characters with 7 bits. When processing or displaying characters not in the ASCII range in C++, it is sometimes helpful to translate ASCII character codes to their corresponding Unicode code points. This post will describe a basic C++ program that transforms user-inputted ASCII code into the appropriate Unicode character. We will map the ASCII values directly to Unicode code points, which works for the standard ASCII range of 0-127. The complete code example shows how this conversion can be done with just a few lines of C++, providing a building block for more robust Unicode handling in applications.

What is the ASCCI Code?

The character encoding system known as ASCII (American system Code for Information Interchange) uses seven bits to encode 128 characters. It was based on the English alphabet when it was first created in the 1960s.

The character set in ASCII encodes:

  • Both capital and lowercase Letters in English (A-Z, a-z).
  • Numbers 0 through 9.
  • Symbols for punctuation.
  • Control codes: line feed, carriage return, etc.
  • Special Symbols such as such as !"#$%&'()*+,-./:;<=>?@[]^_}{|}~.

The binary numbers from 0000000 to 1111111, readily expressed as decimal values between 0 and 127, correspond to each character. As an illustration:

  • Binary 0100001, or decimal 65, corresponds to 'A',
  • While binary 01000010, or decimal 66, corresponds to 'B'.

The first 32 ASCII codes (0-31 decimal) are reserved for non-printable control characters like null, tab, line feed, carriage return, etc. Codes 32-126 represent printable characters like letters, digits, and punctuation. Code 127 is reserved for the deleted character.

The ASCII standard only uses 7 bits for each character, but most modern systems use 8 bits and set the highest bit to 0. It allows ASCII to be used alongside other encodings in 8-bit environments.

What is Unicode?

In most writing systems, Unicode is a computing industry standard that assures consistent encoding, representation, processing, and text handling. Regardless of platform, Unicode assigns each character a unique number, application, or language.

Some key points about Unicode:

  • Unicode enables text processing, storage, and transport independently of language and platform.
  • Unicode standard can encode over 1 million characters. It includes characters of all major languages in the world.
  • Unicode uses a coding space of 21 bits to define 1,112,064 code points. Each code point represents a unique character.
  • The 21-bit space is divided into 17 planes, each with 65,536 (= 2^16) code points. The first plane (0000 - F) is called Basic Multilingual Plane (BMP) and contains characters for almost all modern languages.
  • Unicode has bidirectional text, glyphs, collation and rendering standards to facilitate internationalization.
  • The Unicode Consortium, a non-profit organization, maintains the Unicode standard. Major companies and organizations participate in developing Unicode standards.
  • Unicode is device & platform-independent. The character represented by a Unicode code point will render consistently across devices.
  • Unicode is backwards compatible with ASCII. The first 128 Unicode code points correspond to the ASCII characters.

What is the ASCII Table of Characters?

The ASCII table is a character encoding standard representing 128 characters using 7-bit binary numbers. ASCII is an abbreviation that stands for American Standard Code for Information Interchange.

The ASCII table includes:

  • Uppercase and lowercase English letters
  • Numeric digits
  • Punctuation marks
  • Control codes
  • Special characters

Each ASCII character is mapped to a decimal number between 0 and 127. It allows the characters to be encoded using binary numbers from 0000000 to 1111111.

The first 32 ASCII codes (0-31) are reserved for non-printable control function characters like null, tab, line feed, carriage return, etc.

  • Codes 32 to 47 represent various punctuation symbols.
  • Codes 48 to 57 represent the numeric digits 0 to 9.
  • Codes 65 to 90 are the uppercase letters A to Z.
  • Codes 97 to 122 are the lowercase letters a to z.

The remaining codes are used for additional symbols and control characters. Below is the full ASCII standard table showing each character mapped to its decimal and hex code value:

Decimal Hex Character
0 00 NUL (null)
1 01 SOH (start of heading)
2 02 STX (start of text)
3 03 ETX (end of text)
4 04 EOT (end of transmission)
5 05 ENQ (enquiry)
6 06 ACK (acknowledge)
7 07 BEL (bell)
8 08 BS (backspace)
9 09 TAB (horizontal tab)
10 0A LF (newline)
11 0B VT (vertical tab)
12 0C FF (form feed)
13 0D CR (carriage return)
14 0E SO (shift out)
15 0F SI (shift in)
16 10 DLE (data link escape)
17 11 DC1 (device control 1)
18 12 DC2 (device control 2)
19 13 DC3 (device control 3)
20 14 DC4 (device control 4)
21 15 NAK (negative acknowledge)
22 16 SYN (synchronous idle)
23 17 ETB (end of transmission block)
24 18 CAN (cancel)
25 19 EM (end of medium)
26 1A SUB (substitute)
27 1B ESC (escape)
28 1C FS (file separator)
29 1D GS (group separator)
30 1E RS (record separator)
31 1F US (unit separator)
32 20 (space)
33 21 !
34 22 "
35 23 #
36 24 $
37 25 %
38 26 &
39 27 '
40 28 (
41 29 )
42 2A *
43 2B +
44 2C ,
45 2D -
46 2E .
47 2F /
48 30 0
49 31 1
50 32 2
51 33 3
52 34 4
53 35 5
54 36 6
55 37 7
56 38 8
57 39 9
58 3A :
59 3B ;
60 3C <
61 3D =
62 3E >
63 3F ?
64 40 @
65 41 A
66 42 B
67 43 C
68 44 D
69 45 E
70 46 F
71 47 G
72 48 H
73 49 I
74 4A J
75 4B K
76 4C L
77 4D M
78 4E N
79 4F O
80 50 P
81 51 Q
82 52 R
83 53 S
84 54 T
85 55 U
86 56 V
87 57 W
88 58 X
89 59 Y
90 5A Z
91 5B [
92 5C \
93 5D ]
94 5E ^
95 5F _
96 60 `
97 61 a
98 62 b
99 63 c
100 64 d
101 65 e
102 66 f
103 67 g
104 68 h
105 69 i
106 6A j
107 6B k
108 6C l
109 6D m
110 6E n
111 6F o
112 70 p
113 71 q
114 72 r
115 73 s
116 74 t
117 75 u
118 76 v
119 77 w
120 78 x
121 79 y
122 7A z
123 7B {
124 7C |
125 7D }
126 7E ~
127 7F DEL

It covers the 128-character ASCII set with control codes, printable characters, punctuation, and special symbols. The table provides the decimal and hex values representing each character in the ASCII encoding standard.

C++ Implementation

  • Get the decimal value of the ASCII character that needs to be converted. For example, 'A' has a decimal value of 65.
  • For ASCII values between 0 and 127, simply assign the ASCII decimal value directly to the Unicode code point. It works because Unicode is backwards compatible with ASCII and maintains the same values for the first 128 characters.
  • So for 'A' with ASCII value 65, the equivalent Unicode code point value is also 65.
  • To convert this to an actual Unicode character cast the code point int variable to a char or wchar_t type in C++.

For example:

  • It copies the ASCII value to the Unicode variable, which interprets it as a Unicode code point and converts it.
  • For ASCII values above 127, lookup tables or switch statements would be required to map the ASCII value to the appropriate Unicode code point.
  • Unicode library functions like mbstowcs, or MultiByteToWideChar can also convert ASCII to Unicode.

So, in summary, for the ASCII range 0-127, simply assign/cast the ASCII decimal value as Unicode. Using mapping mechanisms for extended ASCII above 127 to get the equivalent Unicode code point. Cast the resulting integer code point to wchar_t or char to get the Unicode character.

Output:

Enter an ASCII code (0-127): 65
Unicode character: A






Youtube For Videos Join Our Youtube Channel: Join Now

Feedback


Help Others, Please Share

facebook twitter pinterest

Learn Latest Tutorials


Preparation


Trending Technologies


B.Tech / MCA