Character Set in C

Introduction:

Character Set is a collection of permissible characters that can be used in a variety of contexts by the program. In this article, we covered the history of Character encoding's. Here, we also discuss the historical encoding system known as EBCDIC, the current coding standard Unicode, and ASCII.

Types of Character Set

Generally, there are two types of character sets in C.

Source Character Set (SCS):

Before preprocessing, SCS is used to parse the source code into an internal representation. White-space characters and the Basic Character package are included in this package. It is the collection of symbols that can be used to create source code. The initial stage of the C PreProcessor (CPP) is to translate the encoding of the source code into the Source Character Set (SCS), which is done before the preprocessing phase.

Execution Character Set (ECS):

Constants for character strings are stored in ECS. This set includes Control Characters, Escape Sequences, and the Basic Character Set. It is the set of characters that the program in use can decipher. CPP converts character and string constant encoding into the Execution Character Set (ECS) following the preprocessing phase.

The use of utility functions found in C is also used to describe the various sorts of character sets.

Both the Source Character Set and the Execution Character Set use UTF-8 encoding by default in CPP. The following compiler flags allow the user to alter them.

-finput-charset is used to set SCS.

-fexec-charset is used to set ECS.

Basic Character Set

Origin and Method Characters in character sets are rarely shared. The Basic Character Set refers to the collection of standard characters. Let's talk about it in more detail below:

Alphabets:

It has both capital and lowercase letters. Lowercase ASCII characters fall within the range [97, 122], and uppercase ASCII characters fall within the range [65, 90]. Example: A, B, A, B, etc.

The difference between uppercase and lowercase characters is minimal.

Utility Functions:

isalpha, islower, and isupper determine whether a character is an uppercase, lowercase, or alphabet. The alphabets are changed to the proper case using tolower and toupper.

Digits:

It includes numbers 0 through 9 inclusively. The range of the ASCII digits is [48, 57]. Example: 0, 1, 2, etc.

Utility functions:

The function isdigit determines if the supplied character is a digit. The function isalnum determines if a character is an alphanumeric character.

Punctuation/Special Characters:

The following characters are classified as punctuation by the default C locale.

Utility functions:

The function ispunct determines if a character is a punctuation character. The ASCII code and usage examples for each punctuation character are included in the table below.

CharacterASCIIDetail
!33Bang, exclamation point, or exclamation mark.
"34Inverted commas, quote marks, or quotations.
#35Hash, number, pound, octothorpe, or sharp.
$36Dollar sign or generic currency.
%37Percent.
&38Symbols for an ampersand, epershand, or and.
'39single quote or an apostrophe.
(40Open or left parenthesis.
)41Right or close parenthesis.
*42Asterisk, often known as a star occasionally, is a mathematical sign for multiplying two numbers.
+43Plus.
,44Comma.
-45Dash, hyphen, or minus sign.
.46a comma, a dot, or a full stop.
/47Solidus, virgule, whack, forward slash, and division symbol in mathematics.
:58Colon.
;59Semicolon.
<60or angle brackets for less.
=61Equal.
>62or angle brackets for greater than.
?63Inquiry mark.
@64Arobase, asperand, at, or the at symbol.
[91Enable brackets.
\92Solidus in reverse or backslash.
]93Open bracket.
^94circumflex or caret.
_95Underscore.
'96A push, left or open quotation, backtick, backquote, grave, grave accent, or acute.
{123Open brace, squiggly brackets, or curly bracket.
}125Close brace, squiggly brackets, or curly bracket.
~126Tilde.

Control Character Set

The ASCII codes for these characters run from 0 to 31 (inclusive) and the 127th character. Although they are not visible, they still impact the program in several ways. In contrast to Backspace on the keyboard, which deletes the previous character, the a (BEL) character may create a beep sound or screen flashing when printed, and the b (BS) character moves the cursor one step back.

Utility Functions:

The function iscntrl determines if a character is a control character.

ASCIIAbbreviation
00NUL '\0' (null character)
01SOH (start of heading)
02STX (start of text)
03ETX (end of text)
04EOT (end of transmission)
05ENQ (enquiry)
06ACK (acknowledge)
07BEL '\a' (bell)
08BS '\b' (backspace)
14SO (shift out)
15SI (shift in)
16DLE (data link escape)
17DC1 (device control 1)
18DC2 (device control 2)
19DC3 (device control 3)
20DC4 (device control 4)
21NAK (negative ack.)
22SYN (synchronous idle)
23ETB (end of trans. blk)
24CAN (cancel)
25EM (end of medium)
26SUB (substitute)
27ESC (escape)
28FS (file separator)
29GS (group separator)
30RS (record separator)
31US (unit separator)
127DEL (delete)

Escape Sequences:

The Execution Character Set includes these characters. You can use the backslash (/) key to distinguish these characters. Although it consists of two or more characters, C PreProcessor only counts them as one.

Example: a, b, t, etc.

White-space characters:

The Source Character Set includes these individuals. They have an impact on the displayed text but are visually invisible.

Utility Functions:

The function isspace determines whether a character is a space.

CharacterASCIIDetail
<space>32space (SPC)
\t9horizontal tab (TAB)
\n10newline (LF)
\v11vertical tab (VT)
\f12feed (FF)
\r13carriage return (CR)

Example:

Let's take an example to print all the character:

Output:

| Character | ASCII | Type        |
| :-------: | ----: | :---------- |
|           |  32   | Space       |
|    !      |  33   | Punctuation |
|    "      |  34   | Punctuation |
|    #      |  35   | Punctuation |
|    $      |  36   | Punctuation |
|    %      |  37   | Punctuation |
|    &|  38   | Punctuation |
|    '      |  39   | Punctuation |
|    (      |  40   | Punctuation |
|    )      |  41   | Punctuation |
|    *      |  42   | Punctuation |
|    +      |  43   | Punctuation |
|    ,      |  44   | Punctuation |
|    -      |  45   | Punctuation |
|   .      |  46   | Punctuation |
|    /      |  47   | Punctuation |
|    0      |  48   | Digit       |
|    1      |  49   | Digit       |
|    2      |  50   | Digit       |
|    3      |  51   | Digit       |
|    4      |  52   | Digit       |
|    5      |  53   | Digit       |
|    6      |  54   | Digit       |
|    7      |  55   | Digit       |
|    8      |  56   | Digit       |
|    9      |  57   | Digit       |
|    :      |  58   | Punctuation |
|    ;      |  59   | Punctuation |
|    <|  60   | Punctuation |
|    =      |  61   | Punctuation |
|    >|  62   | Punctuation |
|    ?      |  63   | Punctuation |
|    @      |  64   | Punctuation |
|    A      |  65   | Alphabet    |
|    B      |  66   | Alphabet    |
|    C      |  67   | Alphabet    |
|    D      |  68   | Alphabet    |
|    E      |  69   | Alphabet    |
|    F      |  70   | Alphabet    |
|    G      |  71   | Alphabet    |
|    H      |  72   | Alphabet    |
|    I      |  73   | Alphabet    |
|    J      |  74   | Alphabet    |
|    K      |  75   | Alphabet    |
|    L      |  76   | Alphabet    |
|    M      |  77   | Alphabet    |
|    N      |  78   | Alphabet    |
|    O      |  79   | Alphabet    |
|    P      |  80   | Alphabet    |
|    Q      |  81   | Alphabet    |
|    R      |  82   | Alphabet    |
|    S      |  83   | Alphabet    |
|    T      |  84   | Alphabet    |
|    U      |  85   | Alphabet    |
|    V      |  86   | Alphabet    |
|    W      |  87   | Alphabet    |
|    X      |  88   | Alphabet    |
|    Y      |  89   | Alphabet    |
|    Z      |  90   | Alphabet    |
|    [      |  91   | Punctuation |
|    \      |  92   | Punctuation |
|    ]      |  93   | Punctuation |
|    ^      |  94   | Punctuation |
|    _      |  95   | Punctuation |
|    `      |  96   | Punctuation |
|    a      |  97   | Alphabet    |
|    b      |  98   | Alphabet    |
|    c      |  99   | Alphabet    |
|    d      | 100   | Alphabet    |
|    e      | 101   | Alphabet    |
|    f      | 102   | Alphabet    |
|    g      | 103   | Alphabet    |
|    h      | 104   | Alphabet    |
|    i      | 105   | Alphabet    |
|    j      | 106   | Alphabet    |
|    k      | 107   | Alphabet    |
|    l      | 108   | Alphabet    |
|    m      | 109   | Alphabet    |
|    n      | 110   | Alphabet    |
|    o      | 111   | Alphabet    |
|    p      | 112   | Alphabet    |
|    q      | 113   | Alphabet    |
|    r      | 114   | Alphabet    |
|    s      | 115   | Alphabet    |
|    t      | 116   | Alphabet    |
|    u      | 117   | Alphabet    |
|    v      | 118   | Alphabet    |
|    w      | 119   | Alphabet    |
|    x      | 120   | Alphabet    |
|    y      | 121   | Alphabet    |
|    z      | 122   | Alphabet    |
|    {      | 123   | Punctuation |
|    |      | 124   | Punctuation |
|    }      | 125   | Punctuation |
|    ~      | 126   | Punctuation |

Explanation:

In this example, the ctype.h header file is utilized to define the isalpha and isdigit utility functions. Therefore, we placed it at the top. After that, we started the loop at ASCII code 32 because we are not printing Control characters because they are not visible.

We are determining the character type with the aid of utility functions. This program produces a markdown table of characters that is formatted.

Conclusion:

The Source Character Set (SCS) and Execution Character Set (ECS) are the two different character sets available in the C language.

Before preprocessing, SCS is created from source code by CPP. CPP preprocesses character and string constants before being converted into ECS. Despite appearing to be blank, space characters have an impact on the text. Despite being visually absent, control characters can execute a variety of tasks, such as making a bell ring (a), moving the pointer to the left (b), etc.

There are many useful functions to work with characters in ctype.h, such as isalpha and isdigit.