Mr Branding: Java, Unicode, and the Mysterious Compile Error

Thursday, August 25, 2016

Java, Unicode, and the Mysterious Compile Error

Unicode is a text encoding standard which supports a broad range of characters and symbols. Although the latest version of the standard is 9.0, JDK 8 supports Unicode 6.2 and JDK 9 is expected to be released with support for Unicode 8.0. Java allows you to insert any supported Unicode characters with Unicode escapes. These are essentially a sequence of hexadecimal digits representing a code point. In this post I'm going to cover how to use Unicode escapes in Java and how to avoid unexplainable compiler errors caused by Unicode escape misuse.

What are Unicode Escapes?

Let's start from the beginning. Unicode escapes are used to represent Unicode symbols with only ASCII characters. This will come in handy when you need to insert a character that cannot be represented in the source file's character set. According to section 3.3 of the Java Language Specification (JLS) a unicode escape consists of a backslash character (\) followed by one or more 'u' characters and four hexadecimal digits.

UnicodeEscape:
    \ UnicodeMarker HexDigit HexDigit HexDigit HexDigit

UnicodeMarker:
    u
    UnicodeMarker u

So for example \u000A will be treated as a line feed.

Example Usage

The following is a piece of Java code containing a Unicode escape.

public class HelloUnicode {
    public static void main(String[] args) {
        // \u0055 is a Unicode escape for the capital U character (U)
        System.out.println("Hello \u0055nicode".length());
    }
}

Take a moment to think about what will be printed out. If you want, copy and paste the code to a new file, compile and run it.

At first glance it looks like the program prints out 18. There's 18 characters between the double quotes, so the length of the string should be 18. But if you run the program, the output is 13. As the comment suggests, the Unicode escape will be replaced with a single character.

Equipped with the knowledge that Unicode escapes are replaced with their respective Unicode characters, let's look at the following example.

public class NewLine {
    public static void main(String[] args) {
        // \u000A is a unicode escape for the line feed (LF)
        // \u0055 is a Unicode escape for the capital U character (U)
        System.out.println("Hello \u0055nicode".length());
    }
}

Can you guess what will be printed out now? The answer should be the same as before, right? I'm sure some of you might suspect that this is a trick question and as a matter of fact, it is. This example will not compile at all.

$ javac NewLine.java
NewLine.java:3: error: ';' expected
        // \u000A is a unicode escape for the line feed (LF)
                      ^
NewLine.java:3: error: ';' expected
        // \u000A is a unicode escape for the line feed (LF)
                                     ^
NewLine.java:3: error: '(' expected
        // \u000A is a unicode escape for the line feed (LF)
                                         ^
NewLine.java:3: error: ';' expected
        // \u000A is a unicode escape for the line feed (LF)
                                                  ^
NewLine.java:3: error: ';' expected
        // \u000A is a unicode escape for the line feed (LF)
                                                            ^
NewLine.java:5: error: ')' expected
        System.out.println("Hello \u0055nicode".length());
                                                         ^
6 errors

What!? So many errors! My IDE doesn't show any squiggly red lines and I can't seem to find any syntax errors myself. Error on line 3? But that's a comment. What is going on?

Continue reading %Java, Unicode, and the Mysterious Compile Error%

by Indrek Ots via SitePoint

Mr Branding

Thursday, August 25, 2016

Java, Unicode, and the Mysterious Compile Error

What are Unicode Escapes?

Example Usage

No comments:

Post a Comment