Character encodings

The topic of characters and fonts is very complex. It started out easy, when computers used ASCII characters that were always printed on paper. Now we have graphical displays that can draw any character from any alphabet in the world. But we don't sit at keyboards with thousands of keys. That just begins to hint at the complexity of letting documents use any character from any alphabet.

This document is a short introduction to characters, character sets, character encodings, glyphs, fonts, and typefaces.

All the example code mentioned this document is in the sub folder called "character_sets" from the following zip file.

http://cs.pnw.edu/~rlkraft/cs33600/for-class/streams_and_processes.zip

Code pages

We mentioned in the last document that ASCII uses 7 bits to encode 128 characters. This covers the characters on a standard keyboard. Since a byte has one more bit in it, Extended ASCII uses that bit to encode 128 more characters (which are not on a standard keyboard). Unlike ASCII, which is an international standard, Extended ASCII was never standardized. There are hundreds of ways to choose 128 characters and create an Extended ASCII character set.

Extended ASCII

A choice of 128 characters, along with a choice of which number between 128 and 255 will represent each of those characters, is called a code page. Every code page uses ASCII for the character numbers from 0 to 127.

Code page
Code pages - Microsoft Learn

Code pages are not important to most GUI programs, though some text editors use them. Code pages are important when we open a console window. A console window will display text data using a particular code page. The same text data may appear different in different console windows if the consoles are using different code pages. As we have seen many times, byte values are always ambiguous. There must always be an agreement about how to interpret bytes. A code page is an agreement about how to display character bytes.

Do the following experiments in a console window opened to the "character_sets" folder.

Windows has a command-line program, "chcp", that tells us what code page the console is currently using and it also lets us change the current code page.

    character_sets> chcp
    character_sets> chcp /?

In the "character_sets" folder there is a data file called "CharacterData_Ex_ASCII.txt" that contains byte values in the Extended ASCII range, from 128 to 254. This file does not contain "characters", it contains "byte values". The contents of this file will look different if we look at it using different code pages because each code page will interpret the byte values differently. Different code pages will interpret the same byte value as different characters.

In the "character_sets" folder there is a script file called "CharacterData_Ex_ASCII_codepages.cmd" that displays the data file "CharacterData_Ex_ASCII.txt" using a variety of code pages. If you run the script file (double-click on it) you can see how much the code page's interpretation of the same byte values can change the appearance of the data. The data (the byte values) do not change. Their interpretation, and their appearance, changes.

The code in the script file looks like the following. You should try opening a command-prompt window in the "character_sets" folder and typing these command-lines directly.

    character_sets> chcp 437
    character_sets> type CharacterData_Ex_ASCII.txt
    character_sets> chcp 1252
    character_sets> type CharacterData_Ex_ASCII.txt
    character_sets> chcp 864
    character_sets> type CharacterData_Ex_ASCII.txt
    character_sets> chcp 932
    character_sets> type CharacterData_Ex_ASCII.txt
    character_sets> chcp 1256
    character_sets> type CharacterData_Ex_ASCII.txt

The default code page for the cmd terminal is usually code page 437. Sometimes it is code page 1252. It should be code page 65001, the UTF-8 "code page" (which is not really a code page).

The data file "CharacterData_Ex_ASCII.txt" was created by compiling and running the Java program "CreateCharacterData_Ex_ASCII.java". It is a very simple program.

public class CreateCharacterData_Ex_ASCII {
   public static void main(String[] args) {
      for (int i = 0x80; i < 0xFF; ++i) {
         System.out.write(i); // Write one byte.
      }
      System.out.flush();
   }
}

You should compare this program to the program "CreateData.java" from the folder "data_formats". They are almost the same program

public class CreateData {
   public static void main(String[] args) {
      for (int i = 0x78; i <= 0x78 + 15; ++i) {
         System.out.write(i); // Write one byte.
      }
      System.out.flush();
   }
}

Both programs write one byte of data at a time in a loop. Neither program gives any meaning to those bytes of data. In the case of the CreateCharacterData_Ex_ASCII class, we interpreted the data as text using different code pages. In the case of the CreateData class, we interpreted the data as different primitive data types. In both cases, the data itself carried no meaning or interpretation. Binary data siting in a file is always just a stream of bytes. We always need additional information to tell us how to interpret the bytes.

Code pages are used two ways, for decoding bytes into characters (the examples we just did) and encoding characters into bytes. Here are examples of each direction. You should run these examples in JShell.

Here is an example of taking an array of bytes and "decoding" it into a string of characters. This is similar to the example above that used the file "CharacterData_Ex_ASCII.txt". Notice how each code page decodes the array of bytes into a different String.

    byte[] bytes = { -128, -121, -79, -58, -75, -74 }
    new String(bytes, "Cp437")
    new String(bytes, "Cp1252")
    new String(bytes, "UTF-16BE")
    new String(bytes, "UTF-16LE")
    new String(bytes, "UTF-8")

Here is an example of taking a String of extended ASCII characters and "encoding" it into an array of bytes.

    String str = "€‡±Æµ¶"
    str.getBytes("Cp437")
    str.getBytes("Cp1252")
    str.getBytes("UTF-16BE")
    str.getBytes("UTF-8")

Notice that each "encoding" of the String resulted in a different array of byte values.

Here are the same encodings, but formatted in hexadecimal.

    import java.util.HexFormat
    String s = "€‡±Æµ¶"
    HexFormat.ofDelimiter(", ").formatHex( s.getBytes("Cp437") )
    HexFormat.ofDelimiter(", ").formatHex( s.getBytes("Cp1252") )
    HexFormat.ofDelimiter(", ").formatHex( s.getBytes("UTF-16BE") )
    HexFormat.ofDelimiter(", ").formatHex( s.getBytes("UTF-8") )

Here is what can happen if we take a String of characters, use one code page to encode it into an array of byte values, and then use a different code page to decode the array back into a String.

    new String( "Æ€æ±¼¿Ç".getBytes("Cp1252"), "Cp437" )
    new String( "Æ€æ±¼¿Ç".getBytes("Cp437"),  "Cp1252" )

Remember that we "encode" characters into bytes and we "decode" bytes into characters.

We have just seen how the choice of a code page can change the appearance of a file when it is displayed (decoded) in a console window. What about displaying a file in a text editor? The same issues will still apply. A text editor must have some agreement on how to interpret the byte values in whatever file it opens and displays.

We know that the file "CharacterData_Ex_ASCII.txt" contains byte values in the extended ASCII range from 128 to 254. We know that these byte values do not have a fixed interpretation or appearance. We should be able to open this file in a text editor and see it displayed (decoded) using any code page we choose, just as we chose different code pages in the console window. But not all text editors make selecting a code page as easy as using the console window's chcp command.

The Windows Notepad text editor can only use code pages 1252 and UTF-8. We cannot force it to open a file using the older Windows Cp437 code page.

The open source Notepad++ text editor allows us to open a text file and use their "Encoding" menu to view the file's contents using almost any code page.

The VS Code editor can also view an open file using any code page. Open the "Command Palette" (Ctrl+Shift+P) and enter into its text box "Change File Encoding" and tap the Enter key. Select the item "Reopen with Encoding". A drop down list should appear of all the available code pages and encodings.

Remember, opening a file using different code pages (encodings) does not change the contents of the file. Only its appearance in the editor window changes. The byte values in the file should not change.

So far we have seen that the choice of a code page (an encoding) can change the appearance and interpretation of a text file's contents. When the console window needs to open a file, it needs a way interpret the bytes in the file. When a text editor opens a file, it needs a way to interpret the contents of the file. These ideas are not restricted to just the console window and text editors. All programs, when they open a text file, need a way to interpret the bytes in the file. This applies to compilers, like javac.

A Java source file, a file with the extension ".java" is usually thought of as a "text file". But there is no such thing as a "text file". Every file is just a sequence of bytes stored in a file system. As the javac compiler reads bytes from a source file, it needs to follow some agreement on how to interpret (decode) those bytes as a sequence of text characters. Like a text editor or a console window, the Java compiler uses a code page as its agreement on how to interpret a sequence of bytes as a sequence of characters.

When we use the javac compiler command, we can give the compiler an -encoding command-line argument that tells the compiler what code page to use as it reads Java source files. If no code page is specified on the command-line, the javac compiler uses a default code page. Notice that the command-line option is called -encoding even though the code page will determine how the compiler decodes the byte data in a source file.

The Java compiler's default code page depends on both the version of Java and the operating system. Starting with Java 18, the Java compiler uses UTF-8 as its default agreement for translating byte sequences into character sequences. Before Java 18, the Java compiler on Windows used code page Cp1252 as its default.

Let's look at some examples. In the "character_sets" folder there are two Java source files called "BoxDrawingChars_Cp437.java" and "BoxDrawingChars_UTF_8.java". Each file is encoded using a different character set. The file "BoxDrawingChars_Cp437.java" uses Extended ASCII and the Cp437 code page. The file "BoxDrawingChars_UTF_8.java" uses Unicode and the UTF-8 encoding. Other than the character encoding, the files are identical. In fact, they compile to the exact same .class file. When you open these two files in a text editor, make sure that you instruct the text editor to use the correct character encoding for each file. When opened in a text editor, the two files should look almost identical. In particular, they both should include a String literal that looks like this, "╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪". If that string looks different in either file, then the text editor is not using the proper encoding.

The string "╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪" contains 19 characters from the "box drawing" character set. These characters are in the Cp437 code page and they are in the Unicode character set, but they are not part of the Cp1252 code page. That makes these characters useful for doing experiments.

If you look at the table for the Cp437 code page, you will see that the box drawing characters in the above string have encodings between 0xC6 and 0xD8. Each one of these characters is encoded as a single byte.

In the UTF-8 encoding of Unicode, each box drawing character is encoded using three bytes. Each character's encoding looks something like 0xE2 0x95 0x9E. The encoding starts with 0xE2 0x95 followed by a third byte that specifies the character. This means that the file encoded in UTF-8 is larger (has more bytes) than the file encoded using Cp437. You can see this if you look at the hexadecimal dump of each Java source file.

    character_sets> java -cp filters.jar HexDump 16 < BoxDrawingChars_Cp437.java
    character_sets> java -cp filters.jar HexDump 16 < BoxDrawingChars_UTF_8.java

Look near the middle of each hex dump. In the Cp437 dump, look for this sequence.

    C6 C7 C8 C9 CA CB CC CD CE CF D0 D1 D2 D3 D4 D5 D6 D7 D8

In the UTF-8 dump, look for this sequence.

    E2 95 9E E2 95 9F E2 95 9A E2 95 94 E2 95 A9 E2 95 A6 E2 95 A0 ...

You could also open each of these files using the online hex dump program "ImHex".

Since the Java compiler has only one default character encoding, and since these files use two different encodings, at least one of them will not be using the compiler's default encoding (and maybe both of them). If we compile a file and the compiler is using the wrong encoding, one of two things will happen. The compiler will give us an encoding error and the file will fail to compile, or the file compiles, but not to the program that we want.

First, let us compile each program using the correct encoding.

    character_sets> javac -encoding Cp437  BoxDrawingChars_Cp437.java
    character_sets> javac -encoding utf-8  BoxDrawingChars_UTF_8.java

When we run the programs, they produce the same output (the two programs compile to the exact same .class file).

    character_sets> java  BoxDrawingChars_Cp437
    character_sets> java  BoxDrawingChars_UTF_8

If we compile a program and use the wrong encoding, we can get encoding errors from the compiler.

    character_sets> javac -encoding utf-8  BoxDrawingChars_Cp437.java
    character_sets> javac -encoding Cp1252 BoxDrawingChars_UTF_8.java

If we compile a program and use the wrong encoding, we can get an incorrect version of our program.

    character_sets> javac -encoding Cp437  BoxDrawingChars_UTF_8.java
    character_sets> java BoxDrawingChars_UTF_8

    character_sets> javac -encoding Cp1252  BoxDrawingChars_Cp437.java
    character_sets> java BoxDrawingChars_Cp437

How does it happen that a source file gets compiled to the "wrong program"? What do we even mean by this?

Here is a picture that represents the compilation process. The Java compiler takes as its input a Java source file (with the extension ".java") and produces a Java class file (with the extension ".class").

    +--------------+                              +---------------+
    | Java source  |         +----------+         | Java class    |
    | file encoded |=======\ | javac    |=======\ | file. All     |
    | using a      | decode >| compiler | encode >| strings are   |
    | particular   |=======/ |          |=======/ | encoded using |
    | code page.   |         +----------+         | MUTF-8.       |
    +--------------+                              +---------------+

The source file is character data encoded using some code page. In particular, all String literals are encoded using that code page (and so are variable names). The compiler must know what that code page is (if it's not the default encoding).

When the compiler compiles the source file, the compiler decodes all the String literals (and all the variable names) and converts them into their equivalent UTF-8 encoding (actually, Java uses Modified UTF-8 (MUTF-8), its own version of UTF-8). In the resulting ".class" file, all character data is encoded using UTF-8. That means that the encoding used by the source file has no affect on the resulting class file. A source file might be a Cp437 file, but as long as it compiles correctly, the class file has no trace in it of the Cp437 encodings.

Suppose we tell the compiler to use the wrong encoding scheme (actually, "decoding scheme"). Suppose that the source file contains no byte values that are illegal in the encoding scheme we instructed the compiler to use (if any byte values are illegal, then the source file will not compile). The compiler will translates each String literal from the source file into a UTF-8 string using the encoding (decoding) we told it to use. But this will most likely result in a string inside the class file that is different from what we intended. The wrong encoding (decoding) will almost always change what characters are in a string. The resulting class file does not do what we intended because if is using the wrong string values (and maybe even the wrong variable names).

Here is a list of all the encodings that the Java compiler can translate into UTF-8.

Supported Encodings

We have just seen that when we compile a Java source file we need to make sure that the compiler is using the correct character encoding (code page). When we run a Java program it turns out that we may need to tell the Java Virtual Machine (JVM) what character encoding it should use when doing text based input or output.

Like the Java compiler, the JVM has built in default code pages for doing text based input and output. And like the compiler, the defaults depend on both the version of Java and the operating system. But unlike the compiler, which only needs one default encoding, the JVM needs several default encodings. There is a default encoding for reading character data from System.in, a default encoding for writing character data to System.out, another one for System.err, and one more default encoding for all other input or output streams.

We tell the compiler what encoding to use (if we don't want to use the compiler's default encoding) using the -encoding command-line argument. The JVM has six "system Properties" for changing default encodings at runtime.

    stdin.encoding     // Java 25
    stdout.encoding    // Java 19
    stderr.encoding    // Java 19
    file.encoding      // Java 18
    native.encoding    // Java 17
    sun.jnu.encoding

A system property is a key-value pair (kind of like an environment variable) that is set as a command-line parameter using the -D option to the java command. The value of each of these six system properties should be the name of a code page. Here is how we set a value to the stdout.encoding system property.

    > javac  ReportEncodings.java
    > java  -Dstdout.encoding=Cp437  ReportEncodings

Use the "ReportEncodings.java" program and try changing some of the other six encoding properties.

Notice that these encoding properties are somewhat recent. If you run the examples from the "character_sets" folder, they will behave differently in Java 17, Java 18, and Java 19. We will explain the Java 19 behavior here.

Let's start with the -Dstdout.encoding= command-line option. It is not easy to explain what this property does and how to use it. To fully explain it, we need to go back to how we think about processes and their standard streams.

First, an example. Here is a transcript of compiling and running the program "BoxDrawingChars_UTF_8.java". Some of these command-lines use the -Dstdout.encoding option.

    character_sets> javac -encoding utf-8 BoxDrawingChars_UTF_8.java

    character_sets> chcp 437
    Active code page: 437

    character_sets> java BoxDrawingChars_UTF_8
    ╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪

    character_sets> java -Dstdout.encoding=utf-8  BoxDrawingChars_UTF_8
    Γò₧ΓòƒΓòÜΓòöΓò⌐ΓòªΓòáΓòÉΓò¼ΓòºΓò¿ΓòñΓòÑΓòÖΓòÿΓòÆΓòôΓò½Γò¬

    character_sets> chcp 65001
    Active code page: 65001

    character_sets> java -Dstdout.encoding=Cp437  BoxDrawingChars_UTF_8
    �������������������

Notice that the UTF-8 encoded program produces proper output when the console uses Cp437. But if we set -stdout.encoding to UTF-8, then the output is wrong. Why? Isn't the program encoded in UTF-8? Why should that command-line option break the output?

Also notice that the output is wrong if the console uses Cp65001 (UTF-8) and we set -stdout.encoding to Cp437. But the output was correct in the first command-line when the console was using Cp437. Why doesn't this last command-line work?

Let us try another experiment. This time we will redirect the output from "BoxDrawingChars_UTF_8.java" to a file and then look at what is in the file.

    character_sets> chcp 437
    Active code page: 437

    character_sets> java BoxDrawingChars_UTF_8 > temp1.txt

    character_sets> type temp1.txt
    ???????????????????

    character_sets> java -Dstdout.encoding=utf-8 BoxDrawingChars_UTF_8 > temp2.txt

    character_sets> type temp2.txt
    Γò₧ΓòƒΓòÜΓòöΓò⌐ΓòªΓòáΓòÉΓò¼ΓòºΓò¿ΓòñΓòÑΓòÖΓòÿΓòÆΓòôΓò½Γò¬

    character_sets> java -Dstdout.encoding=Cp437 BoxDrawingChars_UTF_8 > temp3.txt

    character_sets> type temp3.txt
    ╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪

Redirecting the program's output to a file changed everything! We only get the correct output when we set -Dstdout.encoding to Cp437.

This behavior can be difficult to decipher and understand. Exactly what does the stdout.encoding property do and how does it interact with the console's code page?

To answer these questions, we need to go back to our conceptual picture of a process and its standard output stream and add additional detail.

       BoxDrawingChars_UTF_8
       +------------------+
       |                  |
  >--->> stdin     stdout >>---+-----> console window
       |                  |    |
       |           stderr >>---+
       |                  |
       +------------------+

We want to replace the "console window" with a model for what a console window actually is.

       BoxDrawingChars_UTF_8                   OpenConsole.exe
       +------------------+                 +-------------------+
       |                  |        pipe     |  Read byte data,  |      +---------+
  >--->> stdin     stdout >>---+--0====0--->>  decode it into   +====\ | console |
       |                  |    |            |  a character,      draw >| display |
       |           stderr >>---+            |  draw a glyph     +====/ | screen  |
       |                  |                 |  from a font.     |      +---------+
       +------------------+                 +-------------------+

The "console window" is really another process. That process is connected by a pipe to the standard output stream of our process. In Windows, this console process is called "OpenConsole.exe" (or "conhost.exe"). On Linux its name depends on the version of Linux (an old name for this process was "xterm"). The console process reads character data from the pipe, decodes the data to find out what characters are being communicated, and then draws those characters (glyphs) on the graphical console window. This means that the console process is really a GUI program. It is the bridge between our character based console application and the desktop GUI.

Here is the most important point of this picture. Our Java process encodes characters into bytes and writes the byte data into the pipe. The console process reads the byte data from the pipe and decodes it back into characters. If the encoding and decoding are done with different encoding schemes (code pages) than the character data is corrupted.

Let's use this picture to analyze the results for our previous experiments.

First run the program "BoxDrawingChars_UTF_8" using the Cp437 code page.

    character_sets> chcp 437
    Active code page: 437

    character_sets> java BoxDrawingChars_UTF_8
    ╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪

The Windows command "chcp" tells the "OpenConsole.exe" program to decode the character data it reads as Cp437 (that is the real meaning of that command). We know that the character data in the "BoxDrawingChars_UTF_8" class file is encoded in UTF-8. If the JVM writes UTF-8 data into the pipe, the "OpenConsole.exe" program will not decode it correctly. The JVM solves this problem by asking the console program what code page it is using for decoding. The JVM then re-encodes its character data from UTF-8 to match the code page used by the console program, in this case Cp437.

If we switch the console to Cp65001 (UTF-8) for decoding, then the JVM will find out from the console program that it is using UTF-8, so the JVM will use UTF-8 for its encoding, and we get the proper output.

    character_sets> chcp 65001
    Active code page: 65001

    character_sets> java BoxDrawingChars_UTF_8
    ╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪

If we switch the console to use Cp1252 for decoding, then the JVM will switch to using Cp1252 for re-encoding. But the CP1252 code page does not contain any box drawing characters. The JVM realizes this, and sends the '?' character for every UTF-8 character it cannot re-encode into Cp1252, which, in this case, is all 19 of our box drawing characters.

    character_sets> chcp 65001
    Active code page: 65001

    character_sets> java BoxDrawingChars_UTF_8
    ???????????????????

Now let's consider the case where we redirect standard output to a file.

When the console's (decoding) code page is Cp437 and we redirect the program's output to a file, and then display the file in the console window, the file is not displayed properly.

    character_sets> chcp 437
    Active code page: 437

    character_sets> java BoxDrawingChars_UTF_8 > temp1.txt

    character_sets> type temp1.txt
    ???????????????????

Notice all the '?' characters. That looks suspiciously like Cp1252 is somehow involved.

Here is our picture of the running process. The standard output stream is directly connected to the "temp.txt" file.

       BoxDrawingChars_UTF_8
       +------------------+
       |                  |
  >--->> stdin     stdout >>---+-----> temp.txt
       |                  |    |
       |           stderr >>---+
       |                  |
       +------------------+

Unlike a console program, a file does not "decode" the character data written into it. A file stores the byte data it is given as it is. The character data in the class file is encoded in UTF-8. Should the JVM write that UTF-8 byte data into the file? The answer is no, that is not what the JVM does. The JVM re-encodes its character data using the code page stored in the stdout.encoding system property. If we have not set the value of stdin.encode (using a command-line option when the JVM is started), then the value of stdout.encode is equal to the value of the native.encode system property. The value of the native.encoding system property is set by the JVM when it starts and its value depends on the operating system. On Windows, the value of native.encoding is, you guessed it, Cp1252. Therefor, in the above command, stdout.encoding had the value Cp1252. But, as we saw before, Cp1252 does not contain any box drawing characters, so the JVM encodes every character of our string as '?'. The file "temp.txt" then holds 19 '?' characters.

Our console window is currently using code page Cp437. If we want the file "temp.txt" to display correctly, we need Cp437 encoded data in the file. We can get the JVM to write Cp437 encoded data to the standard output stream by setting stdout.encoding to Cp437.

    character_sets> java -Dstdout.encoding=Cp437 BoxDrawingChars_UTF_8 > temp.txt

    character_sets> type temp.txt
    ╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪

If we set stdout.encoding to UTF-8, then we get the following.

    character_sets> java -Dstdout.encoding=utf-8 BoxDrawingChars_UTF_8 > temp.txt

    character_sets> type temp.txt
    Γò₧ΓòƒΓòÜΓòöΓò⌐ΓòªΓòáΓòÉΓò¼ΓòºΓò¿ΓòñΓòÑΓòÖΓòÿΓòÆΓòôΓò½Γò¬

The file "temp.txt" holds UTF-8 encoded data, but when we display the file, the console decodes it using Cp437. Each UTF-8 character is encoded using three bytes (see the section below on UTF-8 encoding). The console, using Cp437, decodes every one of those bytes as a single character. So the displayed string has 3 * 19 = 57 characters.

If we switch the console to use UTF-8 (Cp65001), then the "temp.txt" file is decoded and displayed correctly by the console.

    character_sets> chcp 65001
    Active code page: 65001

    character_sets> type temp.txt
    ╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪

The stderr.encoding property plays a role similar to the stdout.encoding property.

Let's look briefly at the standard input stream and the stdin.encoding system property.

First, recall our usual illustration of a process with three standard streams.

                         Java process
                      +------------------+
                      |                  |
        keyboard >--->> stdin     stdout >>---+-----> console window
                      |                  |    |
                      |           stderr >>---+
                      |                  |
                      +------------------+

We need to replace both the console window and the keyboard with a better model for what they really are.

                                 OpenConsole.exe
                     +--------------------------------------+
                     |                  |                   |
   +----------+      |  Read keystroke  |  Read byte data,  |      +---------+
   | keyboard +=====\|  data, encode    |  decode it into   +====\ | console |
   | device   +=====/|  it into byte    |  a character,      draw >| display |
   | driver   |      |  data, write     |  draw a glyph     +====/ | screen  |
   +----------+      |  bytes.          |  from a font.     |      +---------+
                     |                  |                   |
                     |        out       |       in          |
                     +--------\|/---------------/|\---------+
                               |                 |
                        pipe   |                 |   pipe
                    +--0====0--+                 +--0====0--+
                    |                                       |
                    |             Java process              |
                    |         +-------------------+         |
                    |         |                   |         |
                    +-------->> stdin      stdout >>--------+
                              |                   |         |
                              |            stderr >>--------+
                              |                   |
                              +-------------------+

Both the console window and the keyboard are really the "OpenConsole.exe" process. The console process is connected by pipes to both the standard input and standard output streams of the Java process. The console process has two parts to it. One part reads byte data from the pipe connected to the standard output stream and uses that data to draw glyphs on the console screen. The other part gets keystroke data (scancodes) from the keyboard device driver, encodes the keystrokes as character data, and then writes the data bytes to the pipe connected to the standard input stream.

The console process also has two code pages associated to it. One code page is used to encode the keystroke data written to the standard input pipe. The other code page is used to decode the character data read from the standard output pipe.

As we mentioned earlier, the "chcp" command sets the code page used to decode the byte data read from the standard output pipe. The command also sets the code page used to encode keystrokes. The "chcp" command always sets the two code pages inside the console process to be the same. A program written in the C language can make use of Windows functions that let these two code pages be different.

When either standard input or standard output is connected to a console process, the JVM will ask the console process what code pages it is using. The JVM will use on each stream the same code page that the console process is using.

When a standard stream is redirected to a file, the JVM will encode the stream using the value of the stream's encoding property, stdin.encoding for standard input and stdout.encoding for standard output. If we have not set those properties on the command-line of the "java" program, then their values are the native.encoding property (which is Cp1252 on Windows). When a standard stream is redirected to a file, the only way to be sure that the proper encoding is being used is to know ahead of time what the file's encoding is and explicitly set the stream to use that encoding.

Exercise: Create an experiment that shows how stdin.encoding can be used to properly decode a file that is redirected to the standard input stream.

Exercise: Write a Java filter program that translates Cp437 encoded text into UTF-8 encoded text. The program should read from standard input the Cp437 encoded text and write to its standard output the UTF-8 encoded text.

Now let us consider the file.encoding system property.

This system property is not used by the three standard streams. This property applies to certain character streams, input or output, that we open without specifying an encoding. Let's look at an example using an input stream.

When a Java program opens a stream to read data from a text file, the Java program needs a way to decode the byte values stored in the file. If the programmer wants to, they can write code that reads raw byte data from the file and gives that data an interpretation. But that is difficult code to write. We want the Java language to help us with this task. The Java API has several stream classes that make it easier to interpret and read text data from a file.

The following code will open a text file and interpret the byte data using the UTF-8 encoding.

    BufferedReader br = Files.newBufferedReader(
                                Paths.get("file.txt"))

The following code will open a text file and interpret the byte data using a specific encoding, in this case Cp1252.

    BufferedReader br = Files.newBufferedReader(
                                Paths.get("file.txt"),
                                Charset.forName("windows-1262"))

The following code will open a text file and interpret the byte data using the "default encoding".

    BufferedReader br = Files.newBufferedReader(
                                Paths.get("file.txt"),
                                Charset.default())

Notice the modern use of static factory methods, Files.newBufferedReader(), Path.get(), and Charset.forName() (and no constructors).

Since Java 18, the "default choice" for the "default encoding" is UTF-8. We can change the default encoding using the JVM's file.encodoing property. For example, if we set -Dfile.encodoing=Cp437, then anytime the JVM opens a file using the "default encoding", the file will be opened using the Cp437 encoding. But it will still be the case that the first code example above will open the file using the UTF-8 encoding, and the second code example above can still be used to open a file using a different encoding.

Notice that Java really favors UTF-8. If we do not specify any encoding, the encoding is UTF-8. If we specify the "default" encoding, the "default" encoding defaults to UTF-8. This is where modern Java is different from earlier versions of Java. It used to be that if we chose the "default" encoding, we would be using the preferred encoding of the computer we were running on (the native.encoding), which would usually depend on what country the computer was in. The computer would have a "default" encoding appropriate for that country's local language, In theory, all computers are now supposed to set their "default" encoding to UTF-8, and the "default" encoding becomes redundant.

The trend in Java, and in the computer world in general, is to encode everything using UTF-8. This is a good idea. But there are still a lot of documents and databases that are encoded in "legacy" encodings. We cannot (yet) ignore encoding issues. The modern practice is that the default encoding is UTF-8, we should not use the file.encoding property, and we should always know the encoding of every file we open.

Remember these facts.

We tell the Java compiler to use a specific code page by using the compiler's -encoding command-line option.

We tell the JVM to use specific code pages by using the JVM's stdout.encoding, stderr.encoding, and stdin.encoding properties. Avoid using the file.encoding property.

Here are Javadoc references to the system properties for character encoding.

System Properties Java 25
java.lang.System/getProperties()
System.in, System.out, System.err
ava.nio.charset.Charset.defaultCharset()
Default Charset in Java 25
Default Charset in Java 21

Character set

A character set is a choice of characters. A character set is usually some alphabet (like the Latin, Greek, or Cyrillic alphabets) combined with useful symbols (like punctuation or arithmetic symbols).

Character set (definition)
Character set (definition)

Each code page is a character set. Many code pages were created by countries to contain their native alphabet along with their most important symbols.

List of code pages

The Java language has a notion of a "charset". Most people read charset as "character set", but a Java charset is not exactly a character set. It is more like a "character encoding". The use of the term "charset" causes some confusion. but this term (and its confusion) did not start with Java. The term comes from the early standards for email (RFC 1341 and RFC 2278).

java.nio.charset.Charset
java.nio.charset - Package Summary
RFC 1341 (1992)
RFC 2278 (1998)

Coded character set

If we take the characters in a character set and put them in a specific order, so we have a character that comes first, and a character that comes second, and a character that comes last, then that ordering makes the character set into a coded character set.

This terminology can be confusing, because a "coded character set" is not a "character encoding". A "character encoding" means that we have assigned a binary value to represent each character in a character set. A "coded character set" means that we have put the characters in an ordering.

When we have a coded character set, we have assigned a number to each character (but not a binary representation). The number assigned to a character is called its code point (in that coded character set).

If we go back to extended ASCII, every code page is simultaneously a character set and a coded character set. If we look at the code page's table, the table shows us the character set. The table puts the characters in an ordering, from the table's upper left-hand corner to the table's lower right-hand corner. The code point for each character is determined by its position in the table, starting with 0 at the upper left-hand corner.

Coded character set (definition)
Code point (definition)
Code point

Character encoding

Once we have chosen the characters that will be in a character set and assigned a code point to each character, we then need to choose a binary encoding for each of those characters (code points). Exactly what binary value do we want each code point to be represented by?

Choosing a binary representation for each character in a character set is called a character encoding.

If we go back to extended ASCII, every code page is simultaneously a character set, a coded character set, and a character encoding. The code page's table shows us the character set and puts them in an ordering. The code point for each character is its position in the table, starting with 0 at the upper left-hand corner. Each character's binary encoding is the 8-bit binary number for its code point.

In a large character set, if a character is assigned a code point, say 28,745, then we could give that character the binary encoding that is the binary encoding of its code point number, 28,745. But we will see that such a straight forward binary encoding of code points is usually not the best way to assign encodings.

Code pages are such simple character sets that we tend to blur the distinction between coded character set and character encoding. When we work with a complex character set like Unicode, these distinctions become important. As we will see, Unicode is an ordered character set with many different character encodings.

Code unit

When we set out to define a binary encoding for all the code points in an encoded character set, the first decision we must make is what will be the size of the binary words that our encoding uses, that is, what will be our code unit. For example we could use 8-bit code units (bytes), or 16-bit code units, or 24-bit code units, etc. If we choose 16-bit units, then every code point will be encoded as a sequence of 16-bit words. If we choose 8-bit code units, then every code point will be encoded as a sequence of bytes.

For example, ASCII is a 7-bit code but the code unit is an 8-bit byte (the most significant bit is always 0). Every code point is encoded in a single code unit.

Extended ASCII is a 8-bit code and we use an 8-bit code unit and every code point is encoded in a single code unit.

Unicode is an encoded character set with several encodings that use different size code units. Some encodings use 8-bit code units, some 16-bit code units, and some use 32-bit code units.

If our code unit is made up of multiple bytes, then we must specify a byte order for the bytes in a code unit.

It is important to realize that code point and code unit are not the same thing. If an encoding uses 8-bit code units, that does not mean that the coded character set must be at most 256 characters. An encoding can use 8-bit code units, and use two (or more) code units per encoded character. So there can be far more that 256 character in the coded character set.

Let's do a simple example. Consider the following coded character set and an encoding of the characters using one or two 3-bit code units. Even though each code unit is only three bits, there are 20 characters in the character set.

This is called a variable length encoding because some characters are encoded using one code unit and some characters are encoded using two code units.

Since this character set has 20 characters, we could use a single 5-bit code unit to encode every character. But the variable length encoding has the potential to use fewer bits to encode strings than a 5-bit, fixed length, encoding. If the characters '+', '-', '(', and ')' are more common in our strings than the other characters, then the fact that those characters need only 3 bits each might make the overall encoding of a string shorter than if we used 5 bits for every character.

Here is a grammar for a small language that uses this character set.

   expr ::= expr [ '+' | '-' ] expr
          | '(' expr ')'
          | '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'
          | 'u', 'v', 'w', 'x', 'y', 'z'

This is a language of arithmetic expressions using single digit numbers and single letter variable names. Strings in this language look like the following.

1+2
x-4+z
(1+2)-(x-8)+(3+4-z)

In the first string, both the 3-bit code unit encoding and a 5-bit code unit encoding need 15 bits. In the second string, the 3-bit code unit encoding needs 24 bits, but a 5-bit code unit encoding needs 25 bits. In the third string, the 3-bit code unit encoding needs (3 * 12) + (6 * 7) = 78 bits, but a 5-bit encoding needs 5 * 19 = 95 bits. For a typical string in this language, the 3-bit, variable length code unit encoding will need fewer bits than a 5-bit fixed length code unit encoding.

Exercise: Which strings in the above language need fewer bits in a 5-bit code unit encoding?

This encoding has another property that is important when using a variable length encoding. It is a self-synchronizing code. That means we can point to any code unit in a stream of code units and determine what we are looking at without having to go back to the beginning of the stream and decode the entire stream.

For example, suppose we are given a code unit that has 1 as its most significant bit. That code unit must be the first code unit in a two code unit character. When we are given the next code unit, we can decode the character the two code units encode.

On the other hand, if we are given a code unit with 0 as its most significant bit, then it is either a one code unit character or the second code unit is a two code unit character. If the given code unit is the first code unit in the stream, then it must be a single code unit character. If its not the first code unit and if we can see the previous code unit, and its MSB is 0, then our code unit is a single code unit character. If we can see that the previous code unit's MSB is 1, then we can decode the two code unit character. If we cannot see the previous code unit (maybe it has already been discarded), then we must discard the given code unit as un-decodable, but we can start decoding correctly on the next code unit.

The above encoding has a property that is not desirable in an encoding. If we want to search a stream of code units for the character (, then we can look for the code unit 010. But searching for that code unit will also find the trailing code unit for the characters '2', '6', 'u', etc. The code unit 010 "overlaps" several code points. This makes searching for characters inefficient. Well designed character encodings should be "non-overlapping".

Not all variable-length codes are self-synchronizing. Here is a simple example. Consider this coded character set with an encoding that uses one or two 3-bit code units.

This encoding is not self-synchronizing. In the following two messages, the code unit 000, with a bar over it, cannot be decoded without looking all the way to the beginning of the message. In the first message, that code unit is the character 'a'. In the second message, that code unit is part of the character 'h'.

                    ___
000 111 111 111 111 000 111 111 ==> "aooao" (5 chars)
111 111 111 111 111 000 111 111 ==> "ooho"  (4 chars)

When we get to a more complex character set like Unicode, then understanding the differences between "code point", "code unit", and "character encoding" becomes crucial to understanding the structure of the character encoding.

Unicode

Unicode is a character set. It is supposed to be the set of all characters and symbols used by any, and every, language on Earth. Unicode presently has 159,801 characters in its character set.

Unicode is a coded character set. The 159,801 characters in Unicode are in a specific order. The order a character has in this ordering is called the character's code point.

Unicode is also a character encoding, but with several different encodings. The two most common encodings of Unicode are UTF-8 and UTF-16. There is also UTF-32 which is used in certain special situations. Java, JavaScript, and C# all define their char data type in terms of UTF-16. On the other hand, the Internet uses UTF-8, and most newer programming languages are based on the UTF-8 character encoding.

Let's look at Unicode code points first, and then we will look at Unicode encodings.

Unicode organizes its code points into 17 "code planes" with 2^16 = 65,536 code points per plane. That means that Unicode has 17 * 2^16 = 1,114,112 code points. That is a lot more than the 159,801 characters we said are in the Unicode character set. Unicode is designed to be able to handle all the characters that history has created so far and also all the characters that will be created in the future. So Unicode has a really generous number of code points.

The number

    17 * 2^16 = (1 + 2^4) * 2^16
              = 2^16 + (2^4 * 2^16)
              = 2^16 + 2^20
              = 1,114,112

needs 21 bits to be written in binary, so it is sometimes said that Unicode has a 21-bit address space. But that is not accurate because a 21-bit address space would have 2^21 = 2,097,152 code points, much more than what Unicode defines.

The address space for Unicode code points is from 0x000000 to 0x10FFFF in hexadecimal. Notice that 0x10FFFF is the number 2^20 + 2^16 - 1 (notice that the 1 is in the binary 2^20's place).

Unicode organizes its code points into planes of size 65,536 because Unicode started out as a 16-bit character encoding, a kind of super ASCII. For the first few years of Unicode's existence, there were less than 65,536 characters in the character set and it was thought that 65,536 characters was enough for the whole world. It turned out that 65,536 was no where near enough characters, so Unicode grew to its present form.

Since Unicode is a coded character set, all the characters (code points) are in an ordering from the first character to the last one. The first 127 characters (code points) in the Unicode ordering are the 127 ASCII characters (Unicode calls this the "Basic Latin block" instead of calling them ASCII). The next 128 characters (code points) are called the "Latin-1 block". These are the most common characters used in Europe. The first 255 characters (code points) in Unicode are essentially Windows code page 28591 (or code page 1252).

Unicode has a notation for Unicode code points. The notation uses the hexadecimal value of a code point. For example, the letter 'a' in ASCII has the code 97 in decimal, which is 0x61 in hexadecimal. Since the first 127 code points in Unicode are the same as ASCII, the Unicode code point for the letter 'a' is written U+0061. In general, the notation for a code point is U+ followed by the hexadecimal digits of its code point number. The notation does not care about leading zeros, so U+0061 is the same as U+61, but the Unicode standard does state that code points should be written with a least four hexadecimal digits. Some Unicode code points need as many as six hexadecimal digits.

If we know a Unicode code point, then we can use a method in the Character class to print that character. For example, the rocket emoji has code point U+1F680.

    jshell> Character.toString(0x1F680)

We can also use Unicode code points as "escape sequences" in Java String literals, but this only works for code points that are not "surrogate pairs" (see the section below about UTF-16).

    jshell> "\u2560\u255d\u255a\u2550\u2557\u2554\u2563"
    jshell> "\u1F680"  // rocket emoji does not work here

One interesting aspect of Unicode is that every one of its 159,801 characters is given a unique name. Java has a method in the Character class that can tell us the name of a Unicode character as long as we know the character's code point. Here is one of the box drawing characters with its name.

    jshell> Character.toString(0x255F)
    jshell> Character.getName(0x255F)

Here is an Arabic letter, U+FBEE, with a very long name.

    jshell> Character.toString(0xFBEE)
    jshell> Character.getName(0xFBEE)

Here are two elegant Javanese letters, U+A9C5 and U+A9C2.

    jshell> Character.toString(0xA9C5)
    jshell> Character.getName(0xA9C5)
    jshell> Character.toString(0xA9C2)
    jshell> Character.getName(0xA9C2)

The Javanese script has some very interesting characters.

Make sure that you can see these Unicode characters in your version of JShell. You need to be using a recent version of Java, at least Java 18. On Windows, you must use the new Windows Terminal. The old cmd command-prompt window will not work. And you must be sure to change Terminal to use code page 65001 (its default code page is probably Cp437).

     > chcp 6501
     > jshell
     jshell> Character.toString(0x1F680)
     jshell> Character.toString(0xA9C5)

Even if you get the above code to work in JShell, the following line of code may not work properly. It should print out four Japanese characters.

    jshell> System.out.println("\u6587\u5b57\u5316\u3051")

If it doesn't work, exit JShell and restart JShell using the -execution local command-line option, as shown here.

     > chcp 6501
     > jshell  -execution local
     jshell> System.out.println("\u6587\u5b57\u5316\u3051")
     jshell> System.out.println("\uA9C1   \uA9C5    \uA9C2")

Here are some references about Unicode.

UTF-8

UTF-8 is a binary encoding of all the code points in Unicode. UTF-8 uses an 8-bit code unit. UTF-8 is a "variable length encoding". That means that every Unicode code point is encoded as either one, two, three, or four bytes. If a string has five characters, the UTF-8 encoding of that string can be between 5 and 20 bytes long, depending on the exact characters in the string.

Here is a table that shows how a UTF-8 encoding is computed from a Unicode code point.

                    |                                        |data|
 Unicode code point |    UTF-8 code units (bytes 1 to 4)     |bits|   21-bit address
--------------------+----------------------------------------+----+----------------------
  U+0000 - U+007F   | 0xxxxxxx                               |  7 | 00000000000000xxxxxxx
  U+0080 - U+07FF   | 110yyyyy  10xxxxxx                     | 11 | 0000000000yyyyyxxxxxx
  U+0800 - U+FFFF   | 1110zzzz  10yyyyyy  10xxxxxx           | 16 | 00000zzzzyyyyyyxxxxxx
U+010000 - U+10FFFF | 11110www  10zzzzzz  10yyyyyy  10xxxxxx | 21 | wwwzzzzzzyyyyyyxxxxxx

Notice that the first 127 code points are coded using 7-bits with a leading 0 bit. This is exactly ASCII. The one-byte UTF-8 encodings are ASCII encodings. Any pure ASCII document (that does not contain any Extended ASCII characters) is automatically a UTF-8 document. This is important. This means that much of the world's old documents are compatible with Unicode.

The next 2^11 = 2,048 code points are coded using two code units (two bytes). UTF-8 is designed so that these 2,048 characters are chosen from the most common characters in world documents. The three and four code unit encodings are less frequently used characters.

UTF-8 is a "self-synchronizing code". This is one of UTF-8's most important properties. It means that we can "jump" into the middle of a UTF-8 byte stream and immediately figure out what we are looking at. If the first byte we see is of the form 10uuuuuu, then we know that we are in the middle of a multi-byte character. We search for the first byte that looks like either 0vvvvvvv or 11vvvvvv and that byte must be the beginning of a new character. We may need to throw away (discard) at most three bytes to get "synchronized" with the UTF-8 byte stream (why three?).

One important thing to notice is that UTF-8 does not use all possible byte values. Many byte values cannot be in a UTF-8 encoded document. For example, any byte with the binary form 11111xxx is not legal in UTF-8 (why?).

Also, many byte combinations are not legal in UTF-8. For example, we cannot have three bytes in a row that look like this (why not?).

    110uuuuu 10vvvvvv 10wwwwww

Another source of illegal byte sequences is that UTF-8 does not allow any character to have more than one encoding. For example, lower case 'a' is ASCII, and also UTF-8, 0x61 == 0b01100001. Let us use the above table to encode 'a' as a two byte UTF-8 code.

   0x61 ==> 000 0110 0001 ==> 110yyyyy  10xxxxxx ==> 11000001 10100001 ==> C1 A0

We can also use the above table to encode 'a' as a three byte UTF-8 code.

   0x61 ==> 0000 0000 0110 0001 ==> 1110zzzz 10yyyyyy 10xxxxxx ==> 11100000 10000001 10100001 ==> E0 81 A1

The byte sequences C1 A0 and E0 81 A1 appear to be valid UTF-8 codes and they would both decode to 0x61, which is the character 'a'. But 'a' is not allowed to have multiple encodings, so these are in fact illegal byte sequences for UTF-8.

We can ask Java to decode these byte sequences. Only the first one decodes properly.

    jshell> new String(new byte[]{(byte)0x61}, "UTF-8")
   jshell> new String(new byte[]{(byte)0xC1, (byte)0xA0}, "UTF-8")
   jshell> new String(new byte[]{(byte)0xE0, (byte)0x81, (byte)0xA1}, "UTF-8")

Exercise: Explain why 0xC0 and 0xC1 are both illegal byte values in UTF-8. Is 0xE0 an illegal byte value in UTF-8? Why or why not?

Here are some online versions of the above UTF-8 encoding table.

The third of the above references is a good explanation of UTF-8 (using JavaScript). Here are a couple more explanations of UTF-8.

UTF-16

UTF-16 is a binary encoding of all the code points in Unicode. UTF-16 uses a 16-bit code unit. That means that every Unicode code point is encoded as either one or two 16-bit (two byte) words. Because the UTF-16 code unit is two bytes, byte ordering is important. That leads to two additional encodings, UTF-16BE and UTF-16LE. In UTF-16BE, the big-endian byte order is always used. In UTF-16LE, the little-endian byte order is always used. In UTF-16, the byte order used by a sequence of bytes is declared by a byte order mark (BOM) at the beginning o the sequence.

Java uses UTF-16BE. The char data type represents a UTF-16BE code unit (not a Unicode character!). If a Unicode code point is represented in UTF-16BE by a single code unit, then that char value represent a Unicode character. But some Unicode code points require two code units in UTF-16BE. In that case, we need two char values to represent that character.

When a Unicode code point requires two UTF-16 code units, the code units are called a surrogate pair.

An example is the rocket emoji character, Unicode code point U+1F680. This one character is made up of two Java char values. Using two UTF-16 code units to represent a single character can lead to some unusual results. For example, a String containing a single character can have a length of two.

    jshell> "🚀".length()
    jshell> "🚀".toCharArray()
    jshell> "🚀".getBytes("UTF-8")
    jshell> "🚀".getBytes("UTF-16BE")
    jshell> "🚀".getBytes("UTF-32BE")
    jshell> int codepoint = 0x1F680   // Rocket emoji, U+1F680
    jshell> Character.toString(codepoint)

The following line of code will tell us the two code units in the rocket character's surrogate pair.

    jshell> java.util.HexFormat.of().formatHex( "🚀".getBytes("UTF-16BE") )

If we use these two code units together as Unicode escape sequences in a String literal, then we get the rocket emoji.

    jshell> "\ud83d\ude80"

Notice that Java's Unicode escape sequences are not really code points, they are code units in UTF-16.

Here is another way to compute the two code units for the rocket emoji's surrogate pair.

    jshell> java.util.HexFormat.of().toHexDigits( Character.highSurrogate(0x1F680) )
    jshell> java.util.HexFormat.of().toHexDigits( Character.lowSurrogate(0x1F680) )

U+1F680

UTF-16 is, like UTF-8, a variable length code. Here is a table that shows how a UTF-16 encoding is computed from a Unicode code point.

                    |                                     |data|
 Unicode code point |  UTF-16 code units (words 1 and 2)  |bits|   21-bit address
--------------------+-------------------------------------+----+----------------------
  U+0000 - U+D7FF   | xxxxxxxxxxxxxxxx                    | 16 | 00000xxxxxxxxxxxxxxxx
  U+E000 - 0xFFFF   | xxxxxxxxxxxxxxxx                    | 16 | 00000xxxxxxxxxxxxxxxx
U+010000 - U+10FFFF | 110110uuuuuuuuuu  110111vvvvvvvvvv  | 20 | 0uuuuuuuuuuvvvvvvvvvv
                                                               | +   10000000000000000

This table is not as straight forward as the UTF-8 table. Here is how to read it. If a code unit is less than or equal to 0xD7FF or greater than or equal to 0xE000, then that code unit is a code point. Otherwise, the code unit is part of a surrogate pair. If the six most significant bits of the code unit are 110110, then the code unit is the first word in a surrogate pair (called the high surrogate). If the six most significant bits of the code unit are 110111, then the code unit is the second word in a surrogate pair (called the low surrogate). The ten least significant bits from each code unit are concatenated, to form a 20-bit word, and then added to 0x10000.

Unicode has seventeen 16-bit planes, the BMP and 16 other planes. The 16 other planes represent a 20-bit address space (4 + 16). The two code units in a UTF-16 surrogate pair provide 10 bits each, for a 20-bit address. That 20-bit address is added to 0x10000 to form a 21-bit code point. The highest 20-bit address is 0xFFFFF, which gives Unicode's highest code point.

      0x0FFFFF
    + 0x010000
    ----------
      0x10FFFF

Be sure to notice that there is a gap between the code points of the first row and the code points of the second row. The "code points" from U+D800 to U+DFFF are not legitimate Unicode code points. Those "code points" are reserved to be used as the code units in a surrogate pair (notice that the code units in the third row are of the form 0xDxxx). So, for example, the symbol U+D801 is not a Unicode code point. It does not represent a character. The table for the UTF-8 encoding should also show this gap in the code points because the "code points" between U+D800 and U+DFFF do not have a UTF-8 encoding. But this gap is almost never shown in UTF-8 encoding tables. That makes the table seem simpler, but it is a bit misleading.

UTF-16 is a self-synchronizing code of 16-bit words, but it is not self-synchronizing as a code of 8-bit bytes. If a UTF-16 byte stream loses one byte, then the stream can no longer be decoded correctly. This makes UTF-16 not a good choice for transmitting text data over a network.

Exercise: The rocket emoji has code point U+1F680. It's UTF-16 code units are 0xD83D and 0xDE80.

    jshell> java.util.HexFormat.ofDelimiter(" ").withUpperCase().formatHex("🚀".getBytes("UTF-16BE"))

Verify that these two code units decode to 0x1F680.

Exercise: Compile the following simple program and then use its class file to show that Java does not use UTF-8 as its internal encoding for string literals, it uses MUTF-8. (Hint: Encode each word from the emoji's surrogate pair in UTF-8.)

public class Example {
   public static void main(String[] args) {
      System.out.println("🚀");
   }
}

UTF-32

UTF-32 uses a 32-bit code unit (similar to a Java int). UTF-32 uses a single code unit for every code point in Unicode.

UTF-32 is a straightforward character encoding that uses the binary number representation of each code point as the encoding. Since the address space of Unicode code points is 21 bits, this address space easily fits into a 32-bit code unit. But it also wastes a lot of bits. Every encoding wastes at least 11 bits of the code unit. Most of the time, it wastes 24 bits because the majority of characters are from ASCII.

UTF-32 is usually not used for storing or transmitting Unicode characters because it wastes so many bits. But UTF-32 is useful in certain situations. It can be used to make encoding conversions easier to implement. Text editors use it to represent in memory the text we are editing. Since every character is essentially one int, UTF-32 makes it easy to jump around in the text (think about how tricky it is to "jump ahead 239 charactetres" in UTF-8 encoded text).

Fonts

When we work with computer representations of text (or what are generally called "writing systems") we need to make many detailed distinctions in order to be clear about what we are saying. We need to define a number of terms that describe text and its appearance, words like, character, glyph, font, and typeface. We will not give precise definitions from these terms, just definitions that are good enough to talk reasonably about text.

A character is a letter from an alphabet, or a common symbol like a numeral or a punctuation mark.

A glyph is a drawing of a character. Here is a drawing of the letter 'a'. But that is not the only way to draw the letter 'a'. Here are nine different glyphs for the letter 'a'.

a a a a a a a a a

We could use any one of those glyphs to spell the word "cat". The choice of a glyph does not change the meaning of the letter or the word. Most importantly, all nine of these glyphs have the same character encoding. They are all the ASCII (or maybe UTF-8) character with hexadecimal code 0x61 (decimal code 97).

Here are nine glyphs for the letter 'A'. Every one of these glyphs is represented by the ASCII (or UTF-8) character code hexadecimal 0x41 (decimal code 65).

A A A A A A A A A

If every one of those glyphs has the same ASCII code, then how does the display system know that it should draw them differently. The letters 'a' and 'A' have different ASCII codes, so we expect the display system to draw something different for each one. But if we use the same code, 0x41, nine times, how do we get nine different drawings? To (partially) answer this question, we need some more terminology.

When we choose a specific glyph for each letter in an alphabet, that collection of glyphs is called a font. Usually, all the glyphs in a font have the same size and are in a similar style.

A collection of fonts in different sizes, where each font is, more or less, a scaled version of the other fonts, is called a typeface. In HTML, typefaces are called "font families", which is a good descriptive name. A typeface is a family of closely related fonts.

In many current uses of these terms, the distinction between a font and a typeface is blurred. When people talk about a font, they often mean a typeface.

In the above display of the nine glyphs for the letter 'A", just before each character 'A' there is special code embedded into this document that tells the display system to switch to a different typeface and choose a font from that typeface. Once the display system is told to switch to a different font, it will use that font to display every character it decodes. But in the above display, the font is changed nine times, just before each occurrence of the character 'A'.

If you want to see the hidden code that changes the font for each 'A', use your mouse to point at one of the glyphs, then "right-mouse-click" on the glyph and choose the menu item "Inspect". You can also examine the source code for this document, the file "Readme_7_character_encodings.md" in the zip file "streams_and_processes.zip".

Not all display systems can switch fonts character by character. For example, most text editors cannot do that (but word processing programs can). In a text editor, we chose a global font and then all the text in your document is displayed using that font. Similarly for a console window. The font is a global setting and all the text in the console window is shown in the same font. In some console programs, we need to restart the program in order to be able to change the display font.

Exercise: How does "changing the font" compare with "changing the code page"? Both change the appearance of what we see on the screen without changing the underlying data. Are they similar ideas? Are they the same thing? (Hint: This is a subtle question.)

Exercise: Find out how to change the font in your text editor. Find out how to change the code page in your text editor. Do the same for your console (terminal) program.

Exercise: Look up "ligatures". Install into your text editor a "programmer's font" that uses ligatures (such as Fira Code). How do ligatures fit in to the ideas of character encodings, code points, and glyphs? Does Unicode have code points for ligatures?

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search